Research · Results

The results behind the reliability layer.

QSI has been validated across science, math, medicine, code, and general knowledge — thousands of evaluations spanning open and frontier models. Here is what holds.

13+

models covered, open to frontier

0.86–0.96

AUC separating correct from incorrect

domains validated end-to-end

thousands

independent evaluations

Cross-model coverage

One detector, 13 models.

The same governance layer separates correct from incorrect answers across 5 frontier and 8 open-weight models. It is the model's mistakes QSI reads — not a model it was tuned for.

frontier open weightAUC · separating correct from incorrect answers · 0.5 = chance

Category coverage

Where weaker models go wrong.

QSI catches the mistakes weaker and specialized models make — error rates of 40–80% on hard items, surfaced before they reach users.

Domain Hard-item error rate What QSI is catching

Science 52% Graduate-level reasoning where confident-sounding answers are often wrong.

Mathematics 61% Multi-step problems where a single slip invalidates the result.

Medicine 44% High-stakes factual recall where a wrong answer cannot reach a user.

Code 73% Subtle logic and edge-case errors that pass a quick read but fail in production.

General knowledge 40% Broad factual questions spanning everyday and specialist domains.

Error rates illustrate how often weaker/specialized models are wrong on hard items — the failures QSI surfaces. Figures are indicative of the coverage range, not a single benchmark.

Want the numbers for your models and your domains?

We run QSI against your traffic and share the separation we get on your data.

Talk to us How it works