Cheap Checks Before Expensive Ones: The Two-Part Judge Pattern

Every team shipping an AI feature eventually hits the same wall: the model says something confident and wrong, and now you need to know whether to trust any given output. The usual answers are “prompt it harder” or “use a bigger model.” Both are expensive, and neither actually verifies anything.

There’s a better pattern, and it’s old enough that it predates LLMs by decades: cheap deterministic filter first, expensive probabilistic check second. I built it into a Postgres health tool, but the shape is general. The whole thing is about 40 lines.

The setup: make the model commit

You can’t verify free text. So the first move is forcing the model to emit a structured claim with the evidence it’s relying on baked in:

class Finding(BaseModel):
    check_id: str
    claim: str
    cited_metrics: dict[str, Any] = {}   # {"shared_buffers": "32MB"} — the values it's citing
    severity: Severity

That cited_metrics field is the hinge. The model isn’t just asserting something; it’s telling you which numbers it based the assertion on. Now you have something to check.

Part 1: the free check

If the model cited a number, and you already have the real number, you don’t need AI to compare them. You need ==.

def ground_finding(finding, real_data: dict) -> str:
    if not finding.cited_metrics:
        return "unverifiable"
    for name, cited in finding.cited_metrics.items():
        actual = real_data.get(name)
        if actual is None:
            continue
        if _numbers_differ(cited, actual):
            return "ungrounded"     # the model cited a value that contradicts reality
    return "grounded"

The model claims shared_buffers = 256MB; you measured 32MB; the numbers differ; the finding is ungrounded. Caught for nothing — no tokens, no latency, no API call. This single function kills the most dangerous class of error (fabricated specifics) at the cheapest possible price.

One warning from experience: be careful how you compare. My first version used substring matching and decided “2” matched “200” because “2” is inside “200.” It was manufacturing false confidence — stamping fabricated numbers as verified. Numeric comparison must be numeric. Test it with adversarial pairs (“2” vs “200”, “5” vs “5GB”) or you’ll ship a verifier that lies.

Part 2: the expensive check, for what’s left

Grounding only works when there’s a number to check. Judgment calls (“this config is too aggressive for your write pattern”) have no single number. So those go to a vote — and the key is using more than one model:

def consensus_verdict(votes: list[dict]) -> tuple[str, str]:
    fp = [v for v in votes if v["verdict"] == "false_positive"]
    tp = [v for v in votes if v["verdict"] == "true_positive"]
    if len(fp) >= 2:                       # 2+ models agree it's bogus -> suppress
        return "false_positive", best_confidence(fp)
    if len(tp) >= 2 and not fp:            # 2+ agree it's real, none object -> confirm
        return "true_positive", "MEDIUM"
    return "uncertain", "LOW"              # no consensus -> keep it, flag it

Two design choices carry the whole thing:

Bias to keep. A finding is only suppressed when two models agree it’s false. One model can’t bury it. Pick your asymmetry on purpose: a false alarm wastes someone’s thirty seconds; a suppressed real problem wastes someone’s outage. Optimize against the expensive failure.

A jury beats a judge. Two different models (different vendors, even) catch different failure modes. Where they disagree, you learn something — that’s exactly the finding to surface as “uncertain” rather than guess.

What you’re really looking at is a division of labor. Part 1 hands the facts to a deterministic check, because facts have a right answer and a rule can’t hallucinate one. Part 2 hands the judgment to the model, because judgment is the thing it’s actually good at. I make the longer case for that split, and why most AI features get it backwards, in a separate piece.

The seam where it pays off

The orchestration, and the one line that matters most:

for f in findings:
    f.ground_status = ground_finding(f, real_data)
    if f.ground_status == "ungrounded":
        f.verdict = "false_positive"     # the free check already decided
        continue                         # <- do NOT call the LLM jury. We're done.
    f.verdict = consensus_verdict(await poll_models(f))

That continue is the entire pattern in one statement. A fabricated finding never reaches the expensive jury. You spend zero tokens voting on a claim a string comparison already refuted. The cheap filter clears the easy 80%; the expensive jury only convenes for the genuinely hard 20%.

Why this generalizes

Swap “Postgres findings” for whatever you’re building. A support bot’s quoted price or policy grounded against the actual knowledge-base record. An agent’s tool-call arguments validated against the tool’s real schema before anything executes. An LLM-as-judge eval score checked against whether the criteria it cited are actually in the rubric. The structure holds:

Make the model emit claims with citations, not prose.
Check the citations with the cheapest mechanism that works — exact match, regex, a lookup. Free is the goal.
Send only the survivors to the expensive check, and make that check a multi-model vote, not a single oracle.
Bias toward keeping/surfacing rather than silently suppressing.

The instinct in AI products right now is to throw more model at every problem. That gets you more fluent confidence, the same lack of verification, and a bigger bill. The deterministic filter is the unsexy part nobody blogs about, and it’s doing most of the work. (Sorry — it’s true.)

Full implementation, tests, and a line-by-line tutorial are open source. The grounding function is thirty lines. The hard part wasn’t writing it. The hard part was remembering that the cheapest check should always go first.