Goldfish

deep-dive

The Stuff That Isn't in the Code: Inferred Knowledge and Why Migrations Lose It

Your application knows things it never wrote down. The migration is where it forgets.

Matt Yonkovit · 6 min read

There’s a comment in a codebase I worked on years ago that I think about more than is healthy. It said -- do not remove, breaks billing. That’s it. No ticket number, no name, no explanation of what breaks or why. Just a tiny prayer left by someone who’d been burned, taped to a line of SQL like a “wet floor” sign over a hole in the ground.

That comment is the good case. At least somebody warned you. The terrifying version is the same load-bearing logic with no sign at all, because the person who understood it assumed everyone else understood it too. Then they left. And now it’s just a number in a WHERE clause, doing something critical for a reason nobody can name.

This is inferred knowledge, and it’s the single most underestimated risk in any database migration. The schema is documented in the catalog. The code is at least present, even when it’s ugly. But the knowledge about what the code means — that lives in people’s heads, in Slack history nobody can search anymore, in the muscle memory of the one engineer who’s been there nine years. None of it ships with the repo. And a migration is precisely the operation that snaps every one of those invisible threads at once.

The four flavors of knowledge that isn’t there

I’ve started sorting these into buckets, because naming the monster helps you hunt it. There are roughly four kinds.

Magic values with social meaning. The status = 3 from my last post. The region IDs that skip 4. The account_type IN ('A','C','F') where everyone “knows” F means “former employee, comp account, do not bill.” These aren’t in a lookup table. They’re conventions, and conventions are exactly what a migration tool, an ORM, and a coding agent all treat as opaque integers. The value ports perfectly. The meaning doesn’t come along.

Behavioral assumptions baked into the engine. Code that relies on how a specific database behaves, not on anything in the SQL standard. Oracle treating empty string as NULL. MySQL sorting a GROUP BY a certain way without an ORDER BY because that’s what the old optimizer happened to do. SQL Server’s MONEY rounding. The code never says “I depend on this.” It just quietly does, and it’s been right for a decade because the engine never changed underneath it. Change the engine and the assumption is suddenly, silently false.

Ordering and timing nobody enforced. The report that’s correct only because a batch job runs at 2am before the aggregation job at 3am. The query that returns rows “in the right order” with no ORDER BY, because the old storage engine happened to hand them back by insertion order and some downstream code now depends on it. None of that is written as a constraint. It’s enforced by cron, by habit, and by the fact that nothing has perturbed it yet. A migration perturbs everything.

Defensive scar tissue. The isolation level somebody bumped to SERIALIZABLE in 2015 after a race condition ate an order, with no comment explaining why. The retry loop wrapped around one specific call. The WITH (NOLOCK) sprinkled across reports because a DBA got tired of lock contention. Every one of those is a fix for a real bug that happened to a real person, and every one of them encodes a hard-won lesson that is completely invisible to anyone reading the code fresh — human or machine.

Why a tool can’t just recover it

Here’s the uncomfortable truth I have to be honest about, because it shapes how we built Swordfish: you cannot automatically recover inferred knowledge, because by definition it isn’t in the inputs. No model, no matter how big, can read status = 3 and know that 3 means cancelled, that there’s a fourth status everyone forgot about, or that the original intent was “cancelled OR refunded” and the code has quietly had a bug since 2019. The information required to answer those questions does not exist in the codebase. It exists in Dave. And we cannot ship Dave.

Anyone who tells you their AI “understands your business logic” is doing one of two things. They’re either pattern-matching against common conventions and getting lucky on the easy 70% — which is genuinely useful, right up until the 30% where your business is weird, and every business is weird in its own way. Or they’re hallucinating an explanation that sounds authoritative, which is strictly worse than saying nothing, because now there’s a confident, wrong comment in your migrated code that the next engineer will trust.

So we made a deliberate choice. Swordfish doesn’t pretend to know what status = 3 means. What it does instead is the thing a tool actually can do: it finds the places where inferred knowledge is most likely to be hiding and hands you a checklist.

What you can actually do about it

The move isn’t recovery. It’s surfacing candidates for human review — making the invisible at least visible, so a person who might still know the answer gets asked the question before it’s too late.

Concretely, that means flagging the patterns that correlate with hidden meaning: magic numbers and string literals sitting in WHERE clauses and CASE statements; queries that depend on implicit ordering (a result set consumed positionally with no ORDER BY); engine-specific behavioral traps from the catalog (the empty-string and collation and rounding stuff), because those are exactly where a behavioral assumption is probably load-bearing; non-default isolation levels and locking hints, because those are scar tissue and scar tissue marks an old wound worth understanding.

None of those flags is an answer. Every one of them is a question, routed to the human most likely to know — ideally while that human still works there. That’s the whole game. A migration that interrogates its inferred knowledge on purpose, early, with a checklist, beats one that discovers it in production when the billing report comes out wrong and the only person who understood it is three jobs away and not returning LinkedIn messages.

So before you migrate anything, go find your magic numbers. Grep your WHERE clauses for bare integers and string literals. Make a list. Then walk it over to the longest-tenured person on the team and ask, one by one, “what does this mean, and what breaks if it’s wrong?”

Write down the answers this time. Put them somewhere a machine can read. That’s not migration busywork — that’s you doing the one part of this job that no tool, no model, and no amount of compute will ever do for you. The code remembers what it does. Only you remember why.


Swordfish is an open-source (Apache-2.0) assessment harness for migrating Oracle, MySQL, SQL Server, Sybase, and DB2 to PostgreSQL — it shows you what’s in your codebase, what needs to change, and hands scoped tasks to the copilot you already use. Source: github.com/EnterpriseDB/swordfish-migrations