VERITROOPER Patent pending

An audit of seven LLMs across four regulated-knowledge domains — 95–100% audited accuracy, every number reproducible from the run records.

Point a test harness at a body of written rules. Generate a fresh question set from it. Give a model the source evidence and have it answer, then audit each answer against that source — with a different vendor's model reviewing every contested verdict, so nothing confirms itself.

Below is the full matrix: every model, every domain, baseline and audited, low scores included. The run records are downloadable so you can check the work yourself.

I'm sharing this because I want technical criticism — particularly on evaluator bias, the unanswerable questions, and whether this is a sound way to audit retrieval-grounded answers. Tell me where it's wrong.

Results

Baseline = the model with standard BM25 retrieval, fetching its own evidence. Audited = headline accuracy after the pipeline, with malformed questions excluded symmetrically from both columns. Δ is the gap a team could recover. Nothing is cherry-picked — the full matrix is shown, including a weak 70B model, so the numbers can't be accused of selection.

OSHA — 29 CFR workplace-safety regulations

Model under testBaselineAuditedΔ
Claude Opus 4.893.8899.90+6.02
GPT-5.593.8599.80+5.95
Gemini 2.5 Pro92.7999.60+6.81
Qwen 2.5 72B92.7599.60+6.85
Gemma 27B88.3598.69+10.34
Qwen 2.5 7B87.9497.29+9.35
Llama 3.1 70B63.5477.84+14.30

~992–998 scored questions per model after malformed-question exclusion.

FDA — prescription drug labels

Model under testBaselineAuditedΔ
Claude Opus 4.894.96100.00+5.04
GPT-5.594.4799.90+5.43
Gemini 2.5 Pro93.6099.70+6.10
Qwen 2.5 72B93.1399.49+6.36
Gemma 27B91.2998.99+7.70
Qwen 2.5 7B89.0995.66+6.57
Llama 3.1 70B64.8480.85+16.01

~987–1,000 scored questions per model after malformed-question exclusion.

SEC — 10-K financial filings

Model under testBaselineAuditedΔ
Claude Opus 4.889.6799.18+9.51
Gemini 2.5 Pro90.1299.09+8.97
GPT-5.587.5399.08+11.55
Llama 3.1 70B86.7597.69+10.94
Qwen 2.5 72B90.8596.64+5.79
Gemma 27B81.6795.67+14.00
Qwen 2.5 7B87.1291.66+4.54

~900–992 scored questions per model (a mid-run PC restart's dropped API calls were excluded symmetrically from both arms).

IRS — U.S. tax code

Model under testBaselineAuditedΔ
Claude Opus 4.894.36100.00+5.64
GPT-5.593.0499.70+6.66
Gemini 2.5 Pro93.9999.40+5.41
Gemma 27B92.5198.68+6.17
Qwen 2.5 72B86.7698.19+11.43
Llama 3.1 70B87.6997.70+10.01
Qwen 2.5 7B86.5896.67+10.09

~988–999 scored questions per model after malformed-question exclusion.

How the audit works

For a given corpus of written rules, the harness:

Known limitations

Download a run record

Each is the complete, unedited output of one real run — questions, answers, expected answers, grading rationale, failures, and removed items. Open one and check the work yourself.

IRS · Claude Opus 4.894.36% → 100.00% · perfect run SEC 10-K · GPT-5.587.53% → 99.08% · +11.55pp recovered OSHA · Gemma 3 27B88.35% → 98.69% · +10.34pp · open-weight

(Download links are placeholders in this mockup — the packaged run records aren't wired up yet.)

VERITROOPER Want to learn more? veritrooper.com →