Goldfish
Trustworthy AI · advanced · 64 min

Tutorial 3 — Memory & Cost: Stop Re-Diagnosing (and Re-Paying for) the Same Problems

Add cross-run memory and incremental skipping so the tool stops re-diagnosing the same issues and an unchanged re-run costs a fraction of the first.

Matt Yonkovit

Build-along — part 3 of the evolution, the finish line (after Tutorial 2; the arc starts at Tutorial 0).

The problem: the tool has amnesia and overspends — it re-diagnoses every problem from scratch on every run, can’t tell you what changed, and pays full token price for data that didn’t move. By the end: findings that remember their lineage across audits, prior findings injected so the model stops re-diagnosing, and incremental skipping that cuts the cost of an unchanged re-run by ~64%. Prereqs: starts from tutorial-3-start (the end of Tutorial 2); finished state is tutorial-3-memory-complete.

Get something to run against

Memory and cost only show up across repeated runs, so you need a database to audit more than once. Grab the sample dataset — a deliberately-misconfigured Postgres:

# Download ../sample/pg-healthcheck-sample-data.zip, unzip, then:
docker compose -f docker-compose.demo.yml up -d
pg-healthcheck audit --host localhost --port 5432 --db demo --user postgres --password demo_password --quick   # run it twice
pg-healthcheck compare <id-a> <id-b>                                                   # see what changed

What’s intentionally broken: ../sample/README.md. Example output: ../sample/EXAMPLE-REPORT.md.

The dumbest thing your AI tool does

Tutorials 1 and 2 gave us structured, verified findings. Run an audit, get a trustworthy report. Run it again tomorrow, and watch what happens: the tool re-collects the same metrics, sends them to the model, and the model dutifully re-discovers that shared_buffers is still 32MB and re-explains, in fresh prose, the exact same thing it told you yesterday. You pay full token price for the privilege.

Two problems, one root cause. The tool has no memory. So:

  1. It can’t tell you what changed since last time, which is the only thing a returning user actually cares about.
  2. It burns money re-analyzing checks whose data is byte-for-byte identical to the last run.

This tutorial fixes both, and the fixes share a foundation: a finding that knows its own history. We borrowed the shape of this from a sibling agent-memory project (internally, “benta”) — the good idea there is that memory isn’t a log, it’s a set of facts that supersede each other over time.

Findings that remember themselves

Remember make_id from Tutorial 1? A finding’s identity is hash(check_id + normalized claim). The same problem on the same check produces the same id, run after run. That’s the hook. If a finding’s id showed up in a previous audit, it’s the same finding, and we can carry its history forward.

reconcile_findings does exactly that, and it makes one design choice I want to defend:

def reconcile_findings(new_findings, prior_findings, current_audit_id):
    prior_by_id = {p["finding_id"]: p for p in _active(prior_findings)}
    new_ids = {f.finding_id for f in new_findings}
    claimed_checks = set()

    for f in new_findings:
        f.last_seen_audit = current_audit_id
        prior = prior_by_id.get(f.finding_id)
        if prior:                                          # persistent: seen before
            f.first_seen_audit = prior.get("first_seen_audit") or prior["audit_id"]
            continue
        f.first_seen_audit = current_audit_id              # new this run
        if f.check_id not in claimed_checks:               # changed: same check, new claim
            f.supersedes = [p["finding_id"] for p in prior_by_id.values()
                            if p["check_id"] == f.check_id and p["finding_id"] not in new_ids]
            if f.supersedes:
                claimed_checks.add(f.check_id)
    return new_findings

The defense: we never rewrite old audit snapshots. A naive memory system would reach back and flip last week’s finding to status = "resolved". Don’t. An audit is a photograph of a moment; mutating it later is lying about history. Instead, lineage is forward-recorded — the new finding carries first_seen_audit (how long has this been broken?) and supersedes (what older finding did this replace?). “Resolved” is computed when you compare two audits, never persisted by vandalizing the past.

(The review caught a sharp edge here too: the first version let two new findings on the same check both claim to supersede the same prior one — two children, one parent, ambiguous lineage. The claimed_checks set fixes it: only the first new finding on a check inherits the supersession. Small, but the kind of thing that makes a “what changed” diff lie if you skip it.)

Now the second half of memory. Before generating findings, the engine fetches the prior audit’s findings and injects them into the prompt: “you found these last time; tell me what still applies, don’t re-derive from scratch.” The model stops reinventing the wheel every run. It’s a few lines, and it’s also a quiet prompt-injection-hygiene lesson — the prior claims get their newlines stripped before they go into the prompt, because a finding’s text is data, and data doesn’t get to forge new instructions.

The compare payoff

With lineage recorded, compare stops being a score diff and becomes a finding-level story:

Findings Changes:
  Resolved: 2   Changed: 1   New: 3   Persistent: 8

classify_against_prior computes those buckets fresh from any two audits — persistent (same id), changed (same check, different claim), new (check had nothing before), resolved (prior finding whose check produced nothing this time). That’s the report a returning DBA actually reads. Not “you’re a B+ again,” but “the vacuum problem you fixed is gone, and three new things showed up.”

Now make it cheap

This is the part that turns a nice idea into a number your finance team likes. If a check’s collected data is identical to last audit, why would you pay a model to re-analyze it? You wouldn’t.

def changed_check_ids(current_details, prior_details) -> set[str]:
    changed = set()
    for cid, details in current_details.items():
        if cid not in prior_details or _details_hash(details) != _details_hash(prior_details[cid]):
            changed.add(cid)
    return changed

Hash every check’s details (sorted keys, so ordering doesn’t create false diffs). Anything new or changed goes in the set. Then the analyzer skips any category whose checks are all unchanged — it reuses the prior findings instead of calling the model. And it’s conservative on purpose: a single changed check in a category sends the whole category back to the LLM, so a real change can never be silently skipped.

The judge plays along. A finding carried forward from last audit already has a verdict, so the judge skips re-voting on it:

for f in findings:
    if f.verdict and f.verdict != "unverified":
        continue   # already judged last run; don't re-spend tokens
    ...

Put those two skips together and run an audit twice against an unchanged database. The test that proves it:

assert second_run_calls < first_run_calls   # 2nd run = identical telemetry → strictly fewer calls

End-to-end, that unchanged re-run dropped from 31 LLM calls to 11 — about 64% fewer. (The exact figure lives in the build log; the test itself just guarantees the second run is always cheaper.) So what does 64% mean on your Tuesday morning? The boring scheduled re-audit that finds nothing new costs a third of what it used to, and the spend concentrates on the checks that actually changed. Token discipline isn’t a vibe here — there’s a test that fails if an unchanged re-run ever costs as much as the first.

An honest boundary

Full disclosure, because the review insisted on it: the per-category prose layer still runs ungated, so the residual 11 calls are mostly that. The findings and judge layers are gated; the prose layer is the next slice. I’m telling you that instead of rounding “64% fewer on two of three layers” up to “cuts your LLM bill,” because an overstated lesson teaches worse than an honest partial one. (That’s also why it’s written down in the build log. Future me doesn’t get to forget.)

A summary that can’t lie

Last piece. The executive summary used to be free-form prose the model wrote however it liked — which means it could drift right back into hallucinating numbers, the exact problem we spent Tutorial 2 killing. So the narrative distiller flips it: build a deterministic skeleton from the verified findings (counts by severity, the new/changed/resolved deltas, the top issues), and hand the model only those facts with a “rewrite this, invent nothing” instruction.

async def distill_narrative(self, audit_result, findings, prior_findings=None):
    skeleton = self._narrative_skeleton(audit_result, findings, prior_findings or [])
    if self._provider is None:
        return skeleton                       # no model? the skeleton is already a real summary
    return await self._provider.analyze(_NARRATIVE_SYSTEM, skeleton)

The model polishes the prose; it never sources the facts. And with no provider configured at all, you still get a usable summary — the deterministic skeleton stands on its own. The AI is a writing assistant here, not a source of truth. That’s the right job for it.

Run it

git checkout tutorial-3-memory-complete
pytest tests/unit/test_memory.py tests/unit/test_distill_narrative.py -v
pytest tests/unit/test_engine.py -k "across_audits or incremental" -v

test_engine_tracks_findings_across_audits proves a finding seen in audit 2 carries first_seen_audit from audit 1. test_engine_incremental_skips_llm_on_unchanged_rerun proves the second run is cheaper. Those two tests are memory and cost, respectively, made real.

You built it

Three tutorials ago, the tool piped data to a model and trusted the output. Now it emits structured findings, grounds every cited number against reality for free, juries the judgment calls across multiple models, remembers what it found last time, tells you what changed, and refuses to pay to re-discover the unchanged. That’s not “AI-powered” as a sticker. It’s AI used like you’d use any other component you don’t fully trust: with verification, memory, and a budget.

If you want the story of how the whole thing got built — the spec, the plans, the fresh-subagent- per-task loop, and every bug the reviews caught (there were good ones) — that’s in the build logs under docs/build-log/, and it’s the subject of the companion write-up. Go break something.