Goldfish

field-notes

Engineering War Story: The Day We Found Our Rescan Could Eat Your Findings

A confession about a bug in our own tool, and the two database quirks that made fixing it harder than it looked.

Matt Yonkovit · 5 min read

I’m going to do the thing vendors never do and tell you about a bug we shipped. Not a typo — a genuine data-loss hazard, sitting in the core of the assessment tool, in the exact operation users run most. We caught it in a pre-release hardening pass before it bit anyone, but it was real, and the fix taught us two things about databases that are worth more than the embarrassment costs me.

Here’s the setup. When you re-scan a project in Swordfish, the tool replaces the project’s findings: out with the old, in with the new. The code did the obvious thing.

# delete the old findings, then insert the new ones
DELETE FROM findings WHERE project_id = ?
# ... then, separately ...
create_findings_bulk(new_findings)

Read that and the bug is right there, but it’s the kind of obvious you only see once someone points at it. There’s no transaction around those two operations. The DELETE commits. Then the insert runs. If anything goes wrong between the delete and the end of the insert — the process crashes, the worker gets killed mid-operation, a single bad row throws — you’ve deleted the project’s findings and not replaced them. A rescan that fails partway doesn’t leave you where you started. It leaves you with nothing. The previous good scan is gone.

For an assessment tool, that’s about the worst failure mode there is. The whole value proposition is “we tell you what’s in your codebase.” Losing that on a routine re-run is unacceptable, full stop.

The obvious fix, and why it wasn’t

“Just wrap it in a transaction.” Right. Obviously. So we did — and immediately ran into the fact that “wrap it in a transaction” means two different things on the two databases we support, and both have a trap.

SQLite doesn’t auto-begin a transaction before DDL. Our migration system (separate code, same class of fix) ran multiple statements and we wrapped them in a transaction so a failed migration would roll back cleanly. Except on the first rollback test, the rolled-back changes… weren’t rolled back. The schema change survived. It turns out SQLite’s default driver mode auto-begins a transaction before DML (INSERT/UPDATE/DELETE) but not before DDL (CREATE/ALTER) — so a CREATE TABLE was quietly running in autocommit mode and surviving the rollback we thought wrapped it. The fix was to issue an explicit BEGIN for SQLite so DDL actually participates in the transaction. If you’ve never been bitten by this, you will be the first time you assume “BEGIN; …; ROLLBACK” protects a CREATE TABLE. It doesn’t, not by default.

PostgreSQL aborts the entire transaction on one bad statement. The findings insert processes rows in bulk, with a fallback: if a chunk fails, retry row by row and salvage the good rows. That logic is fine outside a transaction. Inside a Postgres transaction it’s a disaster, because in Postgres a single failed statement poisons the whole transaction — every subsequent statement errors with “current transaction is aborted” until you roll back. So our nice row-by-row salvage, wrapped naively in a transaction, would turn “one bad row” into “all rows lost,” which is the exact opposite of salvage. The fix was a SAVEPOINT per chunk/row: a bad row rolls back to its savepoint without poisoning the transaction, the good rows commit, and the dropped-row count gets surfaced instead of swallowed.

What it actually took

The real fix wasn’t one line. It was a proper transaction() primitive in the database layer that does the right thing on both backends: an explicit BEGIN on SQLite so DDL is covered, a dedicated pooled connection with a real transaction on Postgres, savepoints around the salvageable bulk operations, and rollback-on-anything. Then the finding-replace and the migrations both route through it, so a failure leaves your data exactly as it was before. We added tests that force a failure mid-insert and assert the prior findings survive intact — because the only way to trust a data-integrity fix is to reproduce the data loss and watch it not happen.

Why I’m telling you this

Two reasons. The first is that these two quirks — SQLite not auto-beginning on DDL, Postgres aborting the whole transaction on one error — are exactly the kind of behavioral difference this entire tool exists to catch in your migrations. We got caught by the same class of cross-database behavioral trap we warn customers about. There’s no better proof that this stuff is real and sneaky than the fact that the people who built a tool to find it still managed to ship an instance of it. Databases are not interchangeable, and the differences hide in the operations you’d never think to test.

The second reason is trust. A tool that handles your source code and your findings should be honest about its own failure modes, and a vendor that pretends its software never had a serious bug is a vendor who either isn’t looking or won’t tell you. We looked, we found a bad one, we killed it before release, and I’d rather you hear that from me than wonder. The dog food tastes like everyone else’s. We just check it for glass before we serve it.

Next, and last in the lessons series: why a tool that reads your entire codebase has no business being a cloud service that keeps it.


Swordfish is an open-source (Apache-2.0) assessment harness for migrating Oracle, MySQL, SQL Server, Sybase, and DB2 to PostgreSQL — it shows you what’s in your codebase, what needs to change, and hands scoped tasks to the copilot you already use. Source: github.com/EnterpriseDB/swordfish-migrations