Goldfish
Trustworthy AI · intermediate · 51 min

Tutorial 1 — Foundations: Structured Output + a Multi-Provider LLM Harness

Replace free-form LLM prose with structured Finding objects over a multi-provider harness — the checkable spine the rest of the build depends on.

Matt Yonkovit

Build-along — part 1 of the evolution. New here? Start with Tutorial 0: it explains what we’re building, the heuristics-vs-LLM split, and why the vibe-coded version doesn’t really work. This part begins the fix.

The problem: the v0 tool’s LLM analysis is free-form prose. You can’t fact-check a vibe — there’s no claim, no cited number, nothing to verify against. By the end: the analyzer emits structured Finding objects (a claim plus the exact metrics it cites) instead of prose. That’s the checkable spine Tutorials 2 and 3 build on. Prereqs: the Tutorial 0 quickstart (Python 3.11+, the repo, optionally an LLM key). Starts from tutorial-1-start; finished state is tutorial-1-foundations-complete. The prompts and bugs below are the real ones, not reenactments.

Get something to run against

Building the code is half of it. To watch it actually catch problems, point it at a database that has problems. Grab the sample dataset — a Postgres deliberately misconfigured for exactly this:

# Download ../sample/pg-healthcheck-sample-data.zip, unzip, then:
docker compose -f docker-compose.demo.yml up -d
pg-healthcheck audit --host localhost --port 5432 --db demo --user postgres --password demo_password --quick

What’s intentionally broken (and why): ../sample/README.md. What the output looks like, including the judge marking findings grounded vs ungrounded: ../sample/EXAMPLE-REPORT.md.

You’ve shipped this app before

You’ve seen this app a hundred times. It collects some data, ships it off to an LLM, gets back a wall of prose, and prints it. “AI-powered.” Ship it.

I’ve run databases for 20 years, and I’ll tell you exactly what happens next: the model says your shared_buffers is 256MB and recommends bumping it. Except it’s actually 32MB. The model made the number up. Nobody notices, because prose is unfalsifiable. There’s nothing to check it against. You’ve built a very expensive random-recommendation generator.

The fix isn’t a better model. The fix is structure. Before you can verify what an AI tells you, the AI has to make claims you can check: a specific assertion, tied to a specific number, tied to a specific source. Free-flowing prose can’t be judged. A structured finding can.

That’s what we build here. Two pieces:

  1. A multi-provider LLM interface so you’re not married to one vendor.
  2. A Finding object — the structured claim that becomes the spine of everything later.

No judge yet. No memory yet. Just the foundation that makes those possible. (Walk before you run. The HOSS has watched too many teams sprint straight into a wall.)

Decision #1: keep the provider interface dumb

First instinct when you want structured JSON out of an LLM: reach for the provider’s fancy “JSON mode” or tool-calling API, and bake it into your provider abstraction. Resist it.

The whole interface (pg_healthcheck/llm/base.py) is this short:

class LLMProvider(ABC):
    @abstractmethod
    async def analyze(self, system_prompt: str, user_prompt: str) -> str:
        """Send a prompt to the LLM and return the response text."""
        ...

That’s it. Text in, text out. OpenAI, Anthropic, Gemini, Ollama. They all implement that one method and nothing else. A factory picks the right one from config:

class LLMProviderFactory:
    @staticmethod
    def create(provider, api_key=None, model=None, base_url=None) -> LLMProvider | None:
        if provider == "openai":
            from pg_healthcheck.llm.openai_provider import OpenAIProvider
            return OpenAIProvider(api_key=api_key, model=model or "gpt-4o", base_url=base_url)
        elif provider == "anthropic":
            ...

Why so minimal? Because every provider speaks text, but they all disagree about everything else. JSON mode, tool schemas, function calling. Each one’s a little different, and three of the four change it every other release. If your abstraction bakes in one vendor’s structured-output gimmick, you’ve got a leaky abstraction that breaks on the next SDK bump.

So we get structure a different way: we ask for JSON in the prompt, and we parse it ourselves in one place. The parsing becomes a single, testable seam instead of four vendor-specific paths. (More on why that seam is the most important code in this whole tutorial in a minute.)

The Claude prompt that locked this in. When I scoped this with Claude Code, the design note was explicit: “Providers stay text-based — analyze(system, user) -> str is left untouched. Structured output is achieved by appending a JSON-schema instruction to the system prompt and parsing the model’s text in the analyzer. This avoids a provider-interface change and makes the parsing robustness a single testable seam.” That one sentence saved a provider refactor.

Decision #2: the Finding is the spine

Now the structured claim. The model (pg_healthcheck/models.py), trimmed to what matters for this tutorial:

class Finding(BaseModel):
    finding_id: str = ""
    audit_id: str = ""
    check_id: str
    category: Category
    claim: str
    cited_metrics: dict[str, Any] = {}        # {"shared_buffers": "32MB"} — the values it cites
    severity: Severity = Severity.MEDIUM
    recommendation: str = ""
    source: str = "llm"

    @staticmethod
    def make_id(check_id: str, claim: str) -> str:
        key = re.sub(r"\s+", " ", claim.strip().lower())[:120]
        # SHA-1 as a content key, NOT for security — usedforsecurity=False avoids FIPS failures.
        return hashlib.sha1(f"{check_id}|{key}".encode(), usedforsecurity=False).hexdigest()[:16]

    @model_validator(mode="after")
    def _ensure_id(self) -> "Finding":
        if not self.finding_id:
            self.finding_id = self.make_id(self.check_id, self.claim)
        return self

Look at cited_metrics. That’s the whole game. When the model says “shared_buffers is 32MB, too low,” it has to put {"shared_buffers": "32MB"} right there in the finding. Now there’s a number we can check against reality. We’re not doing the checking yet. But we made it possible. You can’t verify a claim that never committed to a number.

And make_id — the finding’s identity is a hash of (check_id, normalized claim). Same problem on the same check across two audits? Same id. Useless today, but in Tutorial 3 it’s the hook the whole memory system hangs on: “have we seen this exact finding before?” Build the spine with the joints already in place.

(Yes, there are extra fields on the real model — verdict, ground_status, first_seen_audit. Those are stubs with safe defaults that Tutorials 2 and 3 fill in. One model, populated in layers.)

The build caught a real one here. The first version used a plain hashlib.sha1(...). The code review flagged it: on a FIPS-enabled host — exactly the kind of locked-down box a DBA runs Postgres on — bare SHA-1 raises. The fix is usedforsecurity=False (it’s a content key, not a crypto hash) plus a comment so nobody “helpfully upgrades” it later and breaks every finding’s id. Small thing. Would’ve been a support ticket from your most security-conscious user. This is exactly the kind of thing the review step exists to catch.

Decision #3: parsing untrusted output is the dangerous part

This is where most “ask for JSON” tutorials wave their hands. They show you json.loads(response) and move on. In production that throws within the hour, because models wrap JSON in ```json fences, add a chatty preamble, or just… return something else.

The analyzer asks each category for a JSON array of findings, then runs the response through a parser whose entire job is to never raise and never lie:

@staticmethod
def _parse_findings_json(text: str) -> list[dict]:
    if not text:
        return []
    t = text.strip()

    # 1. Try the whole thing as JSON first (clean array, or reject an envelope object).
    try:
        data = json.loads(t)
        return data if isinstance(data, list) else []
    except Exception:
        pass

    # 2. Strip markdown fences LINE-BY-LINE.
    lines = t.splitlines()
    if lines and lines[0].strip().startswith("```"):
        lines = lines[1:]
    if lines and lines[-1].strip() == "```":
        lines = lines[:-1]
    t = "\n".join(lines).strip()
    try:
        data = json.loads(t)
        return data if isinstance(data, list) else []
    except Exception:
        pass

    # 3. Last resort: grab the outermost [...] from prose-wrapped output.
    start, end = t.find("["), t.rfind("]")
    if start == -1 or end == -1 or end < start:
        return []
    try:
        data = json.loads(t[start:end + 1])
    except Exception:
        return []
    return data if isinstance(data, list) else []

Three stages, each more desperate than the last, every one wrapped so a bad response becomes [] instead of a 500. The payoff: a tool that degrades to “no findings this category” when the model burps, instead of crashing the whole audit.

The bug the review caught here is my favorite of the whole project. The first version stripped fences with t.strip("\”). Looks fine! Except str.strip(chars)strips a *character set*, not a fence token — so a finding whose claim mentioned a SQL ``LIKE` “ would have its trailing backtick eaten, corrupting the JSON. No test failed. Nothing crashed. It would have just silently mangled findings that happened to quote SQL. That’s the worst kind of bug — invisible. The line-by-line fence strip above is the fix.

Wiring it together

The analyzer loops the audit’s categories, builds system_prompt = CATEGORY_PROMPT + JSON_FORMAT_INSTRUCTIONS, calls the one provider.analyze(...), parses, and constructs Findings — skipping any item missing a check_id or claim, mapping severity strings to the enum, defaulting to MEDIUM on junk. No provider configured? Returns []. One category’s call throws? Log it, skip it, keep going. The engine then stores the findings and renders a ## Findings section in the report.

That’s the foundation. Run the tests:

git checkout tutorial-1-foundations-complete
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest tests/unit/test_finding_model.py tests/unit/test_analyzer_findings.py -v

You’ll see the parser tests do the unglamorous work: empty string → [], fenced JSON → parsed, "not json at all"[], an item with an empty claim → skipped. That test file is the contract for “the model can return garbage and we stay standing.”

How this got built (the actual harness loop)

Because the other half of this repo is “how do you build this with an agent,” the real loop, not a sanitized one:

  1. Brainstorm → spec. Decided Approach A (Finding as a first-class object) over two alternatives, wrote it to a design spec, committed it.
  2. Plan. Turned the spec into a task-by-task plan with the exact failing tests and code — Finding model, DB table, prompt, analyzer, engine wiring, report rendering.
  3. Execute, one task per fresh subagent. Each task: write the failing test → watch it fail → implement → watch it pass → commit. Then a separate reviewer subagent checked the diff against the spec and for quality. That review is what caught the FIPS hash and the backtick-strip bug — neither of which any test would have flagged.

The lesson that generalizes past this repo: per-task reviews verify units; a wide review verifies the seams. The two nastiest bugs lived in the seam between “what the model returns” and “what your code trusts.” Be paranoid exactly there.

What you’ve got, and what’s next

You now have structured, identity-stable Findings coming out of any of four LLM providers, parsed from untrusted text without crashing. There’s nothing checking whether the findings are true yet. Verifying that is the entire point of Tutorial 2, where we ground every cited number against the real telemetry (for free, no tokens) and then put a multi-model jury on the survivors.

Check out tutorial-2-judge-complete when you’re ready to make the AI prove it isn’t lying.