An audit of seven LLMs across four regulated-knowledge domains — 95–100% audited accuracy, every number reproducible from the run records.
Point a test harness at a body of written rules. Generate a fresh question set from it.
Give a model the source evidence and have it answer, then audit each answer against that source — with a different vendor's
model reviewing every contested verdict, so nothing confirms itself.
Below is the full matrix: every model, every domain, baseline and audited, low scores included.
The run records are downloadable so you can check the work yourself.
I'm sharing this because I want technical criticism — particularly on evaluator bias, the unanswerable
questions, and whether this is a sound way to audit retrieval-grounded answers. Tell me where it's wrong.
Results
Baseline = the model with standard BM25 retrieval, fetching its own evidence. Audited = headline accuracy after the pipeline,
with malformed questions excluded symmetrically from both columns. Δ is the gap a team could recover.
Nothing is cherry-picked — the full matrix is shown, including a weak 70B model, so the numbers
can't be accused of selection.
OSHA — 29 CFR workplace-safety regulations
Model under test
Baseline
Audited
Δ
Claude Opus 4.8
93.88
99.90
+6.02
GPT-5.5
93.85
99.80
+5.95
Gemini 2.5 Pro
92.79
99.60
+6.81
Qwen 2.5 72B
92.75
99.60
+6.85
Gemma 27B
88.35
98.69
+10.34
Qwen 2.5 7B
87.94
97.29
+9.35
Llama 3.1 70B
63.54
77.84
+14.30
~992–998 scored questions per model after malformed-question exclusion.
FDA — prescription drug labels
Model under test
Baseline
Audited
Δ
Claude Opus 4.8
94.96
100.00
+5.04
GPT-5.5
94.47
99.90
+5.43
Gemini 2.5 Pro
93.60
99.70
+6.10
Qwen 2.5 72B
93.13
99.49
+6.36
Gemma 27B
91.29
98.99
+7.70
Qwen 2.5 7B
89.09
95.66
+6.57
Llama 3.1 70B
64.84
80.85
+16.01
~987–1,000 scored questions per model after malformed-question exclusion.
SEC — 10-K financial filings
Model under test
Baseline
Audited
Δ
Claude Opus 4.8
89.67
99.18
+9.51
Gemini 2.5 Pro
90.12
99.09
+8.97
GPT-5.5
87.53
99.08
+11.55
Llama 3.1 70B
86.75
97.69
+10.94
Qwen 2.5 72B
90.85
96.64
+5.79
Gemma 27B
81.67
95.67
+14.00
Qwen 2.5 7B
87.12
91.66
+4.54
~900–992 scored questions per model (a mid-run PC restart's dropped API calls were excluded symmetrically from both arms).
IRS — U.S. tax code
Model under test
Baseline
Audited
Δ
Claude Opus 4.8
94.36
100.00
+5.64
GPT-5.5
93.04
99.70
+6.66
Gemini 2.5 Pro
93.99
99.40
+5.41
Gemma 27B
92.51
98.68
+6.17
Qwen 2.5 72B
86.76
98.19
+11.43
Llama 3.1 70B
87.69
97.70
+10.01
Qwen 2.5 7B
86.58
96.67
+10.09
~988–999 scored questions per model after malformed-question exclusion.
How the audit works
For a given corpus of written rules, the harness:
Generates a fresh question set from the source text — including intentionally unanswerable
questions, to catch a model that bluffs rather than declines.
Gives the model the correct source passage and records its answer, then audits that answer
against the same passage and the expected answer.
Sends only contested verdicts (roughly 4% of answers) to a different vendor's frontier
model for an independent second opinion. Vendor rotation is enforced in code, so no model ever reviews
its own output. Nothing confirms itself.
Removes malformed test items (bad questions, broken ground truth) from the denominator —
symmetrically, from both the baseline and audited columns — and shows the count rather than hiding it.
Outputs a full run record: every question, the model's answer, the expected answer, the grading
rationale, the failures, the removed items, timestamps, and model names.
Known limitations
The audited number is bounded by the model under test. The harness measures truth; it does not
manufacture it. A model that can't reason to the answer stays wrong, and the score says so.
Audited accuracy reflects an oracle-grounded check, not a live serving uplift — it is the accuracy a
team is leaving on the table, measured, not a claim about a deployed RAG stack.
The contested-verdict reviewer is itself an LLM. The architecture's job is to make imperfect reviewers
produce reliable verdicts; cross-vendor rotation is how it limits any single model's blind spot.
Download a run record
Each is the complete, unedited output of one real run — questions, answers, expected answers,
grading rationale, failures, and removed items. Open one and check the work yourself.