How I Built a 100-Check Postgres Tool With an AI Harness (and What the Reviews Caught)

Most of what I write about AI is “here’s a pattern for putting a model inside your app.” This one’s different. This is about building an app with an AI agent driving — and specifically, about the part everyone skips when they talk about AI coding: how you keep it from confidently shipping garbage.

The project is a PostgreSQL health-check tool. Hundred checks, scores your database, and (the interesting part) uses LLMs to analyze the results without hallucinating numbers at you. I built it with Claude Code. Not “vibe-coded in an afternoon.” Built, with a process. And the process is the actual story, because it’s the difference between AI that produces a demo and AI that produces something you’d run against production.

The loop

Here’s the deal: an agent will happily write you a thousand lines that look right and aren’t. The defense isn’t a smarter agent. It’s structure around it. The loop I used, every feature:

Brainstorm into a spec. Before any code, nail down what “done” means and which approach wins. For this tool the pivotal call was making findings structured objects instead of prose — and writing that decision down so it couldn’t quietly drift later.
Spec into a plan. Turn the spec into a task-by-task plan with the exact failing tests and code for each step. Not “add validation” — the actual test, the actual function.
A fresh agent per task. Each task gets its own subagent with a clean context: write the failing test, watch it fail, implement, watch it pass, commit. Then I throw that context away.
A separate agent reviews. A different subagent reads the diff against the spec and for quality, not trusting the implementer’s “it’s done.”

Why fresh agents? Context hygiene. The thing that makes agents go sideways is a polluted context: 40 files of half-remembered state bleeding into the next edit. By giving each task an isolated agent and keeping my context as the clean coordinator, I could reason about the whole build without drowning in any one piece. I’d get back a six-line “here’s what I did, tests pass” and verify it in one command.

That’s the meta-lesson that maps onto the in-app patterns, by the way. Same move both places: isolate the unit, define its interface, verify it independently. Good architecture and good agent-wrangling are the same discipline wearing different hats.

The reviews earned their keep (proof, not vibes)

The part that’ll convince you the review step isn’t ceremony: These are real bugs a separate review caught that every individual task’s tests had passed:

The verifier that manufactured false confidence. The anti-hallucination layer compares the model’s cited numbers against reality. The first version compared with substring matching — and decided a cited “2” matched a real “200,” because “2” is inside “200.” So the thing whose entire job was catching fabricated numbers was certifying them. No test failed. It would have shipped, looking more trustworthy while being less. The review traced the adversarial case; the fix went in with mean tests that live in the suite forever.

The hash that crashes on hardened servers. A finding’s ID used SHA-1. Fine — except on a FIPS-enabled host (exactly the locked-down box a DBA runs Postgres on), bare SHA-1 raises. The fix was one flag (usedforsecurity=False) and a comment so nobody “upgrades” it later. A test would never have caught it; it takes someone thinking about where this actually runs.

The feature that was invisible to half the users. The whole findings system worked… in the CLI. The web report path built its data separately and silently never loaded the findings. Every per-task test was green because each unit was correct in isolation. Only a wide review looking across the read paths caught that an entire surface dropped the feature.

A bug in my own plan. The plan specified a SQL query for “find the previous audit” that broke when two audits shared a timestamp. The implementer agent caught it and fixed it — a good reminder that the agent executing your plan is also a reviewer of your plan, if you let it be.

An overstated claim. The cost-optimization saved ~64% of LLM calls on an unchanged re-run. The review noticed one layer wasn’t actually gated, so the honest number was “64% on two of three layers,” not “cut your bill.” We wrote down the boundary instead of rounding up.

Notice the pattern in those five. The per-task reviews, narrow and focused on one unit, caught nothing, because each unit was correct. The bugs lived in the seams: between what a model returns and what code trusts, between the CLI path and the web path, between the plan and reality, between the claim and the mechanism. Per-task reviews verify that a unit does what you intended. A wide review verifies that what you intended was right, and that the units actually connect. Both layers. Every time.

The receipts are in the repo

The thing I’m proudest of isn’t the code — it’s that you can see how it was built. Every phase has a build log: the real prompts, the real decisions, and the real bugs the reviews caught, written down as they happened. Not a cleaned-up retrospective. The actual trail. There are git tags you can check out to stand at each stage, and build-along tutorials that walk the code line by line.

That’s deliberate. “AI-assisted development” is mostly discussed as either magic or fraud, and it’s neither. It’s a discipline with a specific shape: spec before code, plans with real tests, isolated execution, and adversarial review aimed exactly at the seams. When you run that loop, an agent can build something solid. When you skip it and just prompt-and-pray, you get a demo that falls over the first time someone trusts it.

What to take from this

Two takeaways, one for each hat.

If you’re using AI to build: the agent is a fast, capable, occasionally-wrong contributor. Treat it like one. Specs, tests, isolated tasks, and a review step that hunts the seams — that’s not bureaucracy, that’s the thing standing between “impressive demo” and “I’d run this in prod.”

If you’re putting AI in your app: same discipline, different target. Ground what you can ground for free, get a second opinion on the rest, remember what you learned, and never trust one model’s word as final. (The division-of-labor logic behind those four moves is its own essay, here.) The tool in this post does all four, and it’s open source so you can steal every pattern.

Go read a build log. Watch the bugs get caught in real time. It’s more honest than any launch post — me included.

— The HOSS