Goldfish

essay

I Asked an LLM to Diagnose My Database. Then I Asked Another LLM if It Was Lying.

The origin story of the judge: the model invented a shared_buffers value. The fix was not a bigger model — it was cheap grounding plus a second-opinion jury.

Matt Yonkovit · 5 min read

So I built a tool that points an LLM at a PostgreSQL database and asks it what’s wrong. Hundred-point inspection, scores every category, writes up the findings. Very 2026. Very “AI-powered.”

And then I watched it confidently tell me my shared_buffers was set to 256MB and that I should bump it.

It was 32MB. The model made the number up.

Nobody selling you an AI tool says this out loud: the model doesn’t know your database. It knows what databases usually look like. So when it’s not sure, it doesn’t say “I’m not sure.” It fills in a plausible number and says it with the same confidence it uses for the things it actually got right. And in a tool whose entire job is telling a DBA what to change in production, that’s not a quirk. That’s the whole product failing quietly.

I’ve been doing databases for 20 years. I’ve cleaned up after enough “trust me” config changes to know that a tool which is confidently wrong 10% of the time is worse than no tool at all, because you stop checking. So I had a rule for this project: the AI is not allowed to tell you something is wrong unless it can prove it. Let me walk you through how that actually works, because the answer turned out to be more interesting than “use a better model.”

You can’t verify prose

First problem: the original version had the model write its analysis as paragraphs. Nice, readable paragraphs. Completely unverifiable paragraphs.

You cannot fact-check a vibe. If the model says “your memory settings look suboptimal for this workload,” what do you even check that against? Nothing. There’s no claim, no number, no anchor. It’s mush, and mush is unfalsifiable by design.

So step one was making the model commit. Instead of prose, it now has to emit a structured finding: a specific claim, the specific metric values it’s citing, a severity, a recommendation. “shared_buffers is 32MB, too low for this workload” with {"shared_buffers": "32MB"} attached. Now there’s a number on the table. And a number on the table is a number you can check.

The cheap check that costs nothing

The part I’m weirdly proud of: once the model cites a number, you don’t need AI to check it. You already collected the real value. So you just… compare them.

The model says shared_buffers is 256MB. Your audit measured 32MB. Those don’t match. The finding is flagged as ungrounded (fabricated), and it never reaches a human. Total cost: a string comparison. Zero tokens. Microseconds. The most expensive hallucination in the report gets caught by the cheapest possible code.

That’s the first lesson, and it generalizes way past databases: before you spend money asking a model to check something, see if a dumb deterministic function can check it for free. The cheap filter does the bulk-clearing. You’d be amazed how much garbage a thirty-line function kills before the expensive machinery ever wakes up.

(There’s a great bug story here. The first version of that comparison used substring matching, and it happily decided that a cited “2” matched a real “200” — because “2” is inside “200.” So the anti-hallucination layer was manufacturing false confidence. The fix was numeric-aware comparison, and the adversarial test cases — “2” versus “200” — are now permanent residents of the test suite. The thing whose job is to not be fooled is exactly the thing you have to try hardest to fool. Write the mean tests.)

The second opinion

Cheap grounding catches fabricated numbers. But plenty of findings are judgment calls — “autovacuum is too conservative for this write pattern” — where there’s no single number to check against. You can’t ground a judgment. So for those, I borrowed the oldest trick in journalism: get a second source.

The surviving findings go to a panel. Configure two LLM providers — say GPT-4o and Claude — and each one gets the claim plus the actual evidence and votes: true positive, false positive, or uncertain. And the rule that matters: a finding only gets killed if two of them agree it’s bogus. One model’s “nah, that’s fine” can’t bury anything on its own.

That asymmetry is deliberate, and it’s a product decision, not a technical one. A false alarm costs a DBA thirty seconds of eye-rolling. A suppressed real problem costs them a 3 a.m. page when the thing they were never warned about falls over. So the system is biased to keep findings. When in doubt, it shows you the finding and tells you it’s uncertain. It does not quietly decide on its own that you didn’t need to know.

What this actually buys you

Put it together and the flow is: the model proposes findings → a free deterministic check kills the fabricated numbers → the survivors go to a multi-model jury → only consensus suppresses. The cheap check does most of the work; the expensive jury only convenes for the genuinely ambiguous calls. (A finding caught as fabricated never even reaches the jury — why pay three models to vote on a claim you already disproved with a string compare?)

What does that buy you in practice? When the tool tells you something’s wrong, it has either checked the number against your real data or gotten two independent models to agree. It’s not a confident guess anymore. It’s a verified claim or an explicitly-flagged uncertain one. The DBA gets to trust the red items and skim the gray ones, instead of fact-checking everything by hand, which, let’s be honest, means they’d fact-check nothing.

The uncomfortable takeaway

The reason I’m writing this up is that “pipe data to an LLM and print the output” is the default architecture for approximately every AI feature shipping right now. And it’s the wrong one for anything where being wrong has a cost. The fix isn’t a bigger model; a bigger model just lies more fluently. The fix is treating the model like what it is: a brilliant, fast, occasionally-confabulating component that you wire up with verification, the same way you’d wire up any other dependency you don’t fully trust.

Ground what you can ground for free. Get a second opinion on the rest. Bias toward showing the human the finding. That’s the whole split — let the cheap engine state the facts and the expensive one weigh them — and it earns its own post. None of that is AI magic. It’s just… engineering. Which, it turns out, is still the job.

The whole thing is open source, and there’s a build-along tutorial that walks the judge code line by line (including that substring bug in its natural habitat). Go make your AI prove it isn’t lying.

— The HOSS