VERI★TROOPER Patent pending

An audit of seven LLMs across four regulated-knowledge domains — 80.8–99.2% audited accuracy, every number reproducible from the run records.

Point a test harness at a body of written rules. Generate a fresh question set from it. Give a model the source evidence and have it answer, then audit each answer against that source — with a different vendor's model reviewing every contested verdict, so nothing confirms itself.

Below is the full matrix: every model, every domain, baseline and audited, low scores included. The run records are downloadable so you can check the work yourself.

Results

Baseline = the model with standard BM25 retrieval, fetching its own evidence. Audited = headline accuracy after the pipeline, with malformed questions excluded symmetrically from both columns. Δ is the gap a team could recover. Nothing is cherry-picked — the full matrix is shown, including a weak 70B model, so the numbers can't be accused of selection.

OSHA — 29 CFR workplace-safety regulations

Model under test	Baseline	Audited	Δ
Claude Opus 4.8	94.89	99.00	+4.11
GPT-5.5	92.40	98.80	+6.40
Gemini 2.5 Pro	93.20	98.00	+4.80
Qwen 2.5 72B	93.20	98.30	+5.10
Gemma 3 27B	89.50	98.40	+8.90
Qwen 2.5 7B	89.60	97.00	+7.40
Llama 3.1 70B	79.90	95.80	+15.90

999–1,000 scored questions per model after malformed-question exclusion.

FDA — prescription drug labels

Model under test	Baseline	Audited	Δ
Claude Opus 4.8	93.41	98.60	+5.19
GPT-5.5	90.02	98.20	+8.18
Gemini 2.5 Pro	90.92	96.81	+5.89
Qwen 2.5 72B	91.32	98.30	+6.98
Gemma 3 27B	87.13	96.81	+9.68
Qwen 2.5 7B	87.23	94.51	+7.28
Llama 3.1 70B	69.26	80.84	+11.58

1,001–1,002 scored questions per model after malformed-question exclusion.

SEC — 10-K financial filings

Model under test	Baseline	Audited	Δ
Claude Opus 4.8	86.69	96.30	+9.61
Gemini 2.5 Pro	88.10	97.40	+9.30
GPT-5.5	82.60	96.10	+13.50
Llama 3.1 70B	87.08	95.52	+8.44
Qwen 2.5 72B	89.53	94.76	+5.23
Gemma 3 27B	82.79	94.96	+12.17
Qwen 2.5 7B	85.65	90.89	+5.24

912–1,000 scored questions per model (a mid-run PC restart's dropped API calls were excluded symmetrically from both arms).

IRS — U.S. tax code

Model under test	Baseline	Audited	Δ
Claude Opus 4.8	93.01	99.20	+6.19
GPT-5.5	91.02	99.20	+8.18
Gemini 2.5 Pro	92.71	97.11	+4.40
Gemma 3 27B	88.63	97.99	+9.36
Qwen 2.5 72B	86.33	98.20	+11.87
Llama 3.1 70B	86.03	97.41	+11.38
Qwen 2.5 7B	86.12	94.87	+8.75

994–1,002 scored questions per model after malformed-question exclusion.

How the audit works

For a given body of written rules, the harness:

Generates a fresh question set from the source text — including intentionally unanswerable questions, to catch a model that bluffs rather than declines.
Gives the model the correct source passage and records its answer, then audits that answer against the same passage and the expected answer.
Sends only contested verdicts (a median of 4% of answers — rising to about 35% on the weakest model we have tested) to a different vendor's frontier model for an independent second opinion. Vendor rotation is enforced in code, so no model ever reviews its own output. Nothing confirms itself.
Removes malformed test items (bad questions, broken ground truth) from the denominator — symmetrically, from both the baseline and audited columns — and shows the count rather than hiding it.
Outputs a full evidence package: every question, the model's answer, the expected answer, the grading rationale, the failures, the removed items, timestamps, and model names — plus a failure-pattern analysis that groups the errors by cause and concrete engineering recommendations (what to retrain, re-chunk, or re-ground) to close the gap. Current runs add a confidence interval and a PASS / CONDITIONAL / FAIL disposition, role-specific documents, a machine-readable record, and a cryptographic signature + trusted timestamp with a standalone verifier.

Known limitations

The audited number is bounded by the model under test. The harness measures truth; it does not manufacture it. A model that can't reason to the answer stays wrong, and the score says so.
Audited accuracy reflects an oracle-grounded check, not a live serving uplift — it is the accuracy a team is leaving on the table, measured, not a claim about a deployed RAG stack.
The contested-verdict reviewer is itself an LLM. The architecture's job is to make imperfect reviewers produce reliable verdicts; cross-vendor rotation is how it limits any single model's blind spot.
Each result is point-in-time and reported with a confidence interval; a small sample or a thin question category is flagged rather than over-claimed.

Wondering why a team couldn’t just build this in-house? Here’s the honest answer →

Download a run record

Each is the complete, unedited output of one real run — questions, answers, expected answers, grading rationale, failures, and removed items. Open one and check the work yourself.

IRS · Claude Opus 4.893.01% → 99.20% · +6.19pp recovered SEC 10-K · GPT-5.582.60% → 96.10% · +13.50pp recovered OSHA · Gemma 3 27B89.50% → 98.40% · +8.90pp recovered · open-weight OSHA · Llama 3.1 70B79.90% → 95.80% · +15.90pp recovered

Every record holds ~1,000 questions from the source material, baseline (BM25 retrieval) vs. audited, contested verdicts cross-checked by a different vendor's model, malformed items excluded symmetrically, and every figure reproducible from the timestamped logs inside. Each is a complete, cryptographically signed, independently verifiable evidence package with a standalone checker — open one and check the work, or verify the seal, yourself.