Goldfish
Trustworthy AI · intermediate · 51 min

Tutorial 2 — Build an LLM-as-Judge: Catching Hallucinations Before They Ship

Build the LLM-as-judge verification layer: a cheap deterministic grounding check plus a multi-model consensus vote — including the substring-grounding bug in its natural habitat.

Matt Yonkovit

Build-along — part 2 of the evolution (after Tutorial 1; the arc starts at Tutorial 0).

The problem: the LLM hallucinates facts — it will cite a number that’s flatly wrong, and prose gives you no way to catch it. This is the v0’s most dangerous flaw, because it fails toward confident wrongness. By the end: a two-part judge that grounds every cited number against the real telemetry (a free, deterministic heuristic) and juries the judgment calls across multiple models — the heuristics-vs-LLM division of labor, made into code. Prereqs: starts from tutorial-2-start (the end of Tutorial 1); finished state is tutorial-2-judge-complete. Real code, real prompts, and the best bug the review caught.

Get something to run against

To see the judge actually catch a hallucination, you need a database with real problems for the model to (sometimes) lie about. Grab the sample dataset — a deliberately-misconfigured Postgres:

# Download ../sample/pg-healthcheck-sample-data.zip, unzip, then:
docker compose -f docker-compose.demo.yml up -d
pg-healthcheck audit --host localhost --port 5432 --db demo --user postgres --password demo_password --quick

What’s intentionally broken: ../sample/README.md. What the output looks like, including a finding the judge marks ungrounded and suppresses: ../sample/EXAMPLE-REPORT.md.

The uncomfortable question

Tutorial 1 got us structured findings. The analyzer now emits a Finding with a claim and the cited_metrics it’s based on, instead of a wall of prose. Progress.

But a finding is still just a claim. The model said “shared_buffers is 32MB, too low.” Cool. Is it? Did the model read the actual setting, or did it pattern-match its way to a plausible-sounding number? You don’t know. And “the AI said so” is not a sentence I want to put in front of a DBA who’s about to change a production config.

The instinct is to reach for a smarter model. Don’t — a smarter model hallucinates more convincingly, which is worse. You fix it with a process — the same way newsrooms do. You verify the claim against the source, and you get a second opinion before you run the story.

That’s the judge. Two parts, and the order matters:

  1. Grounding — a dirt-cheap, deterministic check: does the number the model cited actually appear in the data we collected? No LLM. No tokens. Microseconds.
  2. Consensus — for the claims that survive grounding, a vote among multiple models, where a finding only gets killed if the jury agrees.

Cheap check first, expensive check second. Hold that thought, because it’s the whole lesson.

Part 1: grounding, or “did you actually read the gauge?”

The trick that makes grounding possible is the thing we built in Tutorial 1: the model had to put its cited numbers in the finding. So now we can check them. Every audit already collected the real telemetry into each check’s details. Grounding just compares the two.

def ground_finding(finding: Finding, details_by_check: dict[str, dict]) -> str:
    """Returns 'grounded' | 'ungrounded' | 'unverifiable'. Pure, deterministic, zero LLM cost."""
    if not finding.cited_metrics:
        return "unverifiable"
    details = details_by_check.get(finding.check_id)
    if not details:
        return "unverifiable"

    saw_match = False
    for name, cited in finding.cited_metrics.items():
        if name not in details:
            continue
        outcome = _compare_metric(_norm(cited), _norm(details[name]))
        if outcome == "contradict":
            return "ungrounded"          # the model cited a number that contradicts the data
        if outcome == "match":
            saw_match = True
    return "grounded" if saw_match else "unverifiable"

The killer demo: the model claims shared_buffers is 256MB. The collected data says 32MB. _compare_metric sees two different numbers, returns "contradict", and the finding is flagged ungroundedfor free, before a single token is spent on it. The model literally made up a number, and a thirty-line pure function caught it.

This is lifted, in spirit, straight from the sibling migration-scanner project, which grounds regex matches by checking whether they’re inside comments or string literals. Same idea, different domain: a cheap deterministic filter that kills obvious garbage before the expensive model ever sees it.

The bug that is the entire point of this tutorial

The first version of the matcher was lazy. It compared values with substring containment: “does the cited string appear in the actual string, or vice versa?” Reads fine. Passes the happy-path test ("128MB" is a substring of "128MB (8% of RAM)", great).

Then the capstone review traced the adversarial cases and the floor fell out:

  • cited "2" vs actual "200""2" is a substring of "200"grounded
  • cited "1" vs actual "100"grounded
  • cited "5" vs actual "5GB"grounded

Read that again. A finding citing max_connections = 2 against a real value of 200 passed as verified. The anti-hallucination layer was manufacturing false confidence — confidently stamping fabricated numbers as grounded. That is so much worse than no check at all, because now you trust the wrong thing.

The fix is numeric-aware comparison: if both values are numbers, they have to be equal to match; different numbers contradict; non-numeric differences return "indeterminate" and get punted to the consensus vote instead of guessed at.

def _compare_metric(cited_n: str, actual_n: str) -> str:
    if cited_n == actual_n:
        return "match"
    cited_num, actual_num = _extract_number(cited_n), _extract_number(actual_n)
    if cited_num is not None and actual_num is not None:
        return "match" if cited_num == actual_num else "contradict"
    return "indeterminate"   # don't guess — let Part 2 decide

So what’s the lesson, beyond “write better matchers”? The most dangerous bug in a verification system is the one that fails toward false confidence and passes every happy-path test. Nothing crashed. The one test we had was green. It would have shipped, and it would have made the tool less trustworthy while looking more trustworthy. The adversarial test cases — "2" vs "200" — are now permanent residents of the test suite. Write the mean tests.

Part 2: consensus, or “never let one model be the judge”

Grounding is binary and dumb (on purpose). Plenty of real findings are unverifiable — the model made a judgment call that no single number confirms (“autovacuum is too conservative for this write pattern”). You can’t ground that. But you can put it to a vote.

def consensus_verdict(votes: list[dict]) -> tuple[str, str]:
    """Bias-to-keep: a finding is suppressed ONLY when 2+ providers agree it's false."""
    valid = [v for v in votes if v.get("verdict") in _VALID_VERDICTS]
    if not valid:
        return "uncertain", "LOW"
    fp = [v for v in valid if v["verdict"] == "false_positive"]
    tp = [v for v in valid if v["verdict"] == "true_positive"]
    if len(fp) >= 2:
        return "false_positive", best_confidence(fp)
    if len(tp) >= 2 and not fp:
        return "true_positive", "MEDIUM"
    if len(valid) == 1:
        v = valid[0]
        return ("uncertain", "LOW") if v["verdict"] == "false_positive" else (v["verdict"], v.get("confidence") or "LOW")
    return "uncertain", "LOW"

Two design choices worth internalizing:

Bias-to-keep. A finding is only suppressed (marked false-positive) when at least two providers agree it’s bogus. One model’s “nah, that’s fine” can never bury a finding on its own. The asymmetry is deliberate: a noisy false positive costs a DBA thirty seconds of eye-rolling; a suppressed real problem costs them a 3 a.m. page. We optimize against the expensive failure.

A jury, not a judge. Configure a second provider (judge_provider_2 in .env) and the same finding gets reviewed by, say, GPT-4o and Claude. They have to agree to act. One vote is advisory — it gets recorded, but it can’t suppress. (Run with one provider and you still get grounding plus an advisory opinion. Run with two and you get a real jury.)

Each model’s vote comes from a prompt that hands it the claim and the actual evidence, then forces a JSON verdict — parsed by the same never-raise-never-lie parser pattern from Tutorial 1, because model output is still untrusted no matter how official the prompt sounds.

The seam where it all pays off

The orchestrator, and the four most important lines on the project:

async def judge(self, findings, details_by_check):
    for f in findings:
        f.ground_status = ground_finding(f, details_by_check)
        if f.ground_status == "ungrounded":
            f.verdict = "false_positive"           # caught deterministically
            f.confidence = "HIGH"
            continue                                # ← do NOT call the LLM. We already know.
        if not self._providers:
            f.verdict = "unverified"; continue
        votes = await judge_finding(f, self._evidence_for(f, details_by_check), self._providers)
        f.verdict, f.confidence = consensus_verdict(votes)

That continue is the “cheap before expensive” principle as executable code. A finding that cites a fabricated number never reaches Part 2. We don’t pay a single token to have three models vote on a claim we already proved false with a string comparison. The cheap deterministic filter does the bulk-clearing; the expensive probabilistic jury only convenes for the cases that actually need judgment.

Verify it the way the test suite does:

git checkout tutorial-2-judge-complete
pytest tests/unit/test_judge_grounding.py tests/unit/test_judge_consensus.py tests/unit/test_judge_orchestrator.py -v

Watch test_ungrounded_is_caught_deterministically_without_llm in particular. It asserts the verdict and provider.analyze.assert_not_awaited() — proving zero tokens were spent on the hallucinated finding. That assertion is the whole philosophy in one line.

How it got built (the loop that caught the bug)

Same harness rhythm as Tutorial 1: spec, then a task-by-task plan, then a fresh subagent per task, then a separate reviewer. The grounding, consensus, and orchestrator were each built and tested in isolation — and every isolated test passed.

The substring-confidence bug survived every one of those per-task reviews. It only died when a wide capstone review traced the matcher against adversarial inputs the unit tests never tried. The takeaway that outlives this repo: unit tests verify that a function does what you thought; an adversarial review checks whether what you thought was right. For a verification system, that second pass isn’t optional. The thing whose entire job is “don’t be fooled” is exactly the thing you must try hardest to fool.

What you’ve got, and what’s next

You now have an audit layer that grounds every cited number against reality for free, juries the judgment calls across multiple models, and refuses to spend tokens on claims it already disproved. The AI can no longer hand a DBA a confident lie.

Tutorial 3 gives the tool a memory: findings that remember themselves across audits (so the model stops re-diagnosing the same issue every run), and incremental distillation that skips re-analyzing checks whose data hasn’t budged. Check out tutorial-3-memory-complete when you’re ready to make it remember.