Tutorial 0 — We Vibe-Coded a Database Doctor. Does It Actually Work?
We vibe-coded a 100-point Postgres health check. Does it actually work? The quickstart, the two engines, and an honest gut-check of where the naive version fails.
The problem: We built an AI tool that checks a database’s health. It produces a confident report. But “confident” and “correct” are not the same word, and we have no idea which one we shipped. By the end: You’ll have the tool running against a real (broken) database, you’ll understand the two engines inside it — deterministic heuristics and a probabilistic LLM — and you’ll see exactly where the naive version breaks. That’s the on-ramp for everything in Tutorials 1–3.
This is the start of a four-part series. Parts 1–3 evolve the tool; this part explains what the tool is and why it needs evolving. Don’t skip it. The rest won’t land without it.
What we’re building
A “100-point inspection” for PostgreSQL. Point it at a running database, it collects metrics over a window, scores the database across ~100 checks in ten categories (memory, queries, vacuum, security, replication, and so on), and hands you a graded report card with recommendations. Think of it as a yearly physical for your Postgres, except it takes two minutes and doesn’t make you wait in a paper gown.
That’s the goal. Now let’s get it running, because this whole series is hands-on.
Quickstart — what you actually need
Three things, and only one of them is fussy:
1. Python 3.11+. Clone the repo and install it in a virtualenv:
git clone git@github.com:EnterpriseDB/clownfish-healthcheck.git
cd clownfish-healthcheck
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
2. Docker (for the practice patient). You don’t want to learn this against your production database. The repo ships a deliberately-broken Postgres. Grab the sample data and start it:
docker compose -f docker-compose.demo.yml up -d
3. An LLM API key — optional, but this is the interesting half. Drop one provider’s settings in
a .env file:
# .env (pick ONE provider)
llm_provider=anthropic # or openai, gemini, ollama
llm_api_key=YOUR_KEY_HERE # not needed for ollama (it's local)
llm_model=claude-sonnet-4-20250514
One thing worth knowing up front: the tool works without an LLM at all. No key, and you still get the full heuristic scorecard. You just don’t get the AI analysis on top. Hold onto that fact, because it’s the first clue about how the two engines divide the work.
Run it:
pg-healthcheck audit --host localhost --port 5432 --db demo --user postgres --password demo_password --quick
You get a report. A score, a grade, ten categories, a pile of findings. It works! We vibe-coded a database doctor in a weekend and it spits out a real report card. Pop the champagne.
Now let’s ruin the mood.
The two engines (this is the whole point)
Look closely at that report and you’ll notice it was produced by two completely different kinds of machinery, and confusing them is the original sin of AI tooling.
Engine one: heuristics. The 100 checks are deterministic rules. “shared_buffers under 25% of RAM?
Flag it.” “An extra superuser role? Flag it.” These are encoded expertise: twenty years of “things
a DBA yells about” turned into if statements. They are free, instant, perfectly reproducible,
auditable, and they physically cannot lie. You never wonder why a heuristic fired; you can read the
rule.
Engine two: the LLM. On top of the checks, the tool sends results to a language model to analyze, prioritize, and explain — to turn “23 checks scored below 7” into “your memory config and connection limits are fighting each other; fix the memory one first.” That’s synthesis a rule can’t do, written in language a human will actually read.
So — do both have a place? Yes. And here’s the take this whole series is built on:
It’s not heuristics versus LLMs. It’s a division of labor. Heuristics own the facts — anything you can encode deterministically, you should, because it’s cheaper and it can’t hallucinate. The LLM owns the judgment and the narrative — the contextual synthesis that’s genuinely hard to encode. The mistake is using the expensive, fallible engine for a job the cheap, reliable one should own.
Comparing two numbers is a job for ==, not a GPU. You bring in the model to weigh tradeoffs and
explain them — the part that actually needs a brain. Use each for what it’s good at.
Does it really work? (no)
The vibe-coded version commits exactly that sin: it hands the LLM the facts too. It dumps raw metrics at the model and says “analyze this,” in prose. And a probabilistic engine asked to recite facts will, sooner or later, make one up. Watch:
- It hallucinates. I’ve watched this version confidently report
shared_buffersat 256MB and recommend lowering it. The real value was 32MB. The model pattern-matched a plausible number and stated it with total confidence — in a tool whose whole job is telling a DBA what to change in prod. - It has amnesia. Run the audit again tomorrow and it re-discovers every problem from scratch and re-explains it in fresh words. It can’t tell you what changed, which is the only thing a returning user cares about.
- It’s not free. Every run pays full token freight to re-analyze checks whose data didn’t move an inch since last time.
So: it runs. It produces output. But you cannot trust the output, it doesn’t remember anything, and it costs more than it should. By the only definition that matters (would a DBA bet a production change on it?), it does not actually work. It’s a confident intern with no memory and a habit of making up numbers.
That’s not a failure of the idea. It’s a failure of the division of labor. We gave the LLM jobs that belong to heuristics. Fixing that is the rest of this series.
How it evolves
Each crack in the vibe-coded version becomes a tutorial. This is the arc:
| Tutorial | The problem it solves | |
|---|---|---|
| 1 | Foundations | The LLM’s output is unverifiable prose. Make it emit checkable, structured claims. |
| 2 | The Judge ⭐ | The LLM hallucinates facts. Verify its numbers with a free heuristic, then jury the judgment calls across models. This one is the division-of-labor thesis in code. |
| 3 | Memory & Cost | It forgets and overspends. Make findings remember themselves, and stop paying to re-analyze unchanged data. |
Notice the shape: every fix is about pulling misassigned work back to the right engine. Tutorial 2 is the clearest case — its Part 1 is a pure heuristic (does the cited number match reality?) and its Part 2 is the LLM (only for what heuristics can’t settle). The evolution isn’t “add more AI.” It’s “use AI for less, and verify the rest.”
Ready? Check out where the build begins and head to Tutorial 1:
git checkout tutorial-1-start # the vibe-coded v0, before the evolution
The doctor’s about to go to medical school.