deep-dive
Cheap Checks Before Expensive Ones: The Two-Part Judge Pattern
The two-part judge pattern in ~40 lines: a deterministic filter first, multi-model consensus second. It generalizes to RAG, extraction, and code review.
Every team shipping an AI feature eventually hits the same wall: the model says something confident and wrong, and now you need to know whether to trust any given output. The usual answers are “prompt it harder” or “use a bigger model.” Both are expensive, and neither actually verifies anything.
There’s a better pattern, and it’s old enough that it predates LLMs by decades: cheap deterministic filter first, expensive probabilistic check second. I built it into a Postgres health tool, but the shape is general. The whole thing is about 40 lines.
The setup: make the model commit
You can’t verify free text. So the first move is forcing the model to emit a structured claim with the evidence it’s relying on baked in:
class Finding(BaseModel):
check_id: str
claim: str
cited_metrics: dict[str, Any] = {} # {"shared_buffers": "32MB"} — the values it's citing
severity: Severity
That cited_metrics field is the hinge. The model isn’t just asserting something; it’s telling you
which numbers it based the assertion on. Now you have something to check.
Part 1: the free check
If the model cited a number, and you already have the real number, you don’t need AI to compare them.
You need ==.
def ground_finding(finding, real_data: dict) -> str:
if not finding.cited_metrics:
return "unverifiable"
for name, cited in finding.cited_metrics.items():
actual = real_data.get(name)
if actual is None:
continue
if _numbers_differ(cited, actual):
return "ungrounded" # the model cited a value that contradicts reality
return "grounded"
The model claims shared_buffers = 256MB; you measured 32MB; the numbers differ; the finding is
ungrounded. Caught for nothing — no tokens, no latency, no API call. This single function kills the
most dangerous class of error (fabricated specifics) at the cheapest possible price.
One warning from experience: be careful how you compare. My first version used substring matching and decided “2” matched “200” because “2” is inside “200.” It was manufacturing false confidence — stamping fabricated numbers as verified. Numeric comparison must be numeric. Test it with adversarial pairs (“2” vs “200”, “5” vs “5GB”) or you’ll ship a verifier that lies.
Part 2: the expensive check, for what’s left
Grounding only works when there’s a number to check. Judgment calls (“this config is too aggressive for your write pattern”) have no single number. So those go to a vote — and the key is using more than one model:
def consensus_verdict(votes: list[dict]) -> tuple[str, str]:
fp = [v for v in votes if v["verdict"] == "false_positive"]
tp = [v for v in votes if v["verdict"] == "true_positive"]
if len(fp) >= 2: # 2+ models agree it's bogus -> suppress
return "false_positive", best_confidence(fp)
if len(tp) >= 2 and not fp: # 2+ agree it's real, none object -> confirm
return "true_positive", "MEDIUM"
return "uncertain", "LOW" # no consensus -> keep it, flag it
Two design choices carry the whole thing:
Bias to keep. A finding is only suppressed when two models agree it’s false. One model can’t bury it. Pick your asymmetry on purpose: a false alarm wastes someone’s thirty seconds; a suppressed real problem wastes someone’s outage. Optimize against the expensive failure.
A jury beats a judge. Two different models (different vendors, even) catch different failure modes. Where they disagree, you learn something — that’s exactly the finding to surface as “uncertain” rather than guess.
What you’re really looking at is a division of labor. Part 1 hands the facts to a deterministic check, because facts have a right answer and a rule can’t hallucinate one. Part 2 hands the judgment to the model, because judgment is the thing it’s actually good at. I make the longer case for that split, and why most AI features get it backwards, in a separate piece.
The seam where it pays off
The orchestration, and the one line that matters most:
for f in findings:
f.ground_status = ground_finding(f, real_data)
if f.ground_status == "ungrounded":
f.verdict = "false_positive" # the free check already decided
continue # <- do NOT call the LLM jury. We're done.
f.verdict = consensus_verdict(await poll_models(f))
That continue is the entire pattern in one statement. A fabricated finding never reaches the
expensive jury. You spend zero tokens voting on a claim a string comparison already refuted. The cheap
filter clears the easy 80%; the expensive jury only convenes for the genuinely hard 20%.
Why this generalizes
Swap “Postgres findings” for whatever you’re building. A support bot’s quoted price or policy grounded against the actual knowledge-base record. An agent’s tool-call arguments validated against the tool’s real schema before anything executes. An LLM-as-judge eval score checked against whether the criteria it cited are actually in the rubric. The structure holds:
- Make the model emit claims with citations, not prose.
- Check the citations with the cheapest mechanism that works — exact match, regex, a lookup. Free is the goal.
- Send only the survivors to the expensive check, and make that check a multi-model vote, not a single oracle.
- Bias toward keeping/surfacing rather than silently suppressing.
The instinct in AI products right now is to throw more model at every problem. That gets you more fluent confidence, the same lack of verification, and a bigger bill. The deterministic filter is the unsexy part nobody blogs about, and it’s doing most of the work. (Sorry — it’s true.)
Full implementation, tests, and a line-by-line tutorial are open source. The grounding function is thirty lines. The hard part wasn’t writing it. The hard part was remembering that the cheapest check should always go first.