deep-dive
Security for a Tool That Reads Your Entire Codebase
A migration tool needs to read all of your proprietary source. That should make you suspicious — and it should shape how the tool is built.
Think about what a database migration assessment tool has to see to do its job. All of your application code. Your schema. Your stored procedures, which encode your actual business logic. The queries that touch your customer data. To find what needs to change, it needs access to essentially everything you’d least like to leak.
So the first question you should ask any such tool isn’t “how accurate is it?” It’s “where does my code go, and who else can touch it?” Because a tool with that much access, built carelessly, is a bigger risk than the migration it’s helping with. I think a lot of the industry has this backwards — racing on capability while hand-waving the trust model. For a tool that ingests your crown jewels, the trust model is the product. Here’s how we think about it.
Default to not phoning home
The single most important property: Swordfish is air-gap-safe by default. Out of the box it sends nothing to any external service. The rule engine is local. The embeddings run on a local ONNX model — no API, no key, no network. There’s no telemetry, no analytics, no license check phoning home, no “anonymized usage data.” If you choose to use LLM features, you point it at a model — and you can point it at one running on your own hardware, so even the AI tier stays inside your network.
That “by default” matters. Plenty of tools are technically self-hostable but ship pointed at someone’s cloud, so the safe configuration is something you have to discover and assemble. We inverted it: the private, offline configuration is the default, and reaching out is the thing you opt into deliberately.
Bind to loopback, not the world
Here’s a mistake that’s easy to make and embarrassing to ship: we caught our own backend defaulting to bind 0.0.0.0 (every network interface) with no authentication. For a local dev tool that’s convenient; for anything a user actually deploys, it means the assessment API (which can read every finding and every uploaded source file) was reachable by anyone on the network by default.
We fixed it to bind 127.0.0.1 (loopback, this-machine-only) by default, with a loud startup warning if you deliberately expose it. The principle: a tool holding your source code should be reachable by you, not by the network, unless you explicitly and knowingly decide otherwise. Exposure should be a choice you make, not a default you inherit.
Gate the dangerous operations
There are destructive admin operations — reset a project, clear findings, kill the worker. Those sit behind an optional admin token: set ADMIN_TOKEN and they require it; leave it unset and they’re allowed only on a loopback bind and refused on a network bind. So the dangerous endpoints can’t be both network-reachable and unauthenticated at the same time. You have to actively configure your way into an exposed-and-protected state; you can’t accidentally land in exposed-and-open.
Don’t let “read my repo” become “run arbitrary commands”
The tool can clone a git repo you point it at. That’s a feature, and it’s also an attack surface, because a URL handed to git can do more than clone if you’re not careful — non-standard schemes and option-injection can turn “clone this” into “run this.” So repo URLs are scheme-allowlisted (https, http, ssh only), option-injection is rejected, and the argument list to git is terminated so a crafted URL can’t smuggle in flags. The subprocess machinery generally runs with explicit argument lists, never through a shell, with the auto-approve “yolo” flags deliberately stripped. The tool does what you asked and nothing the input tried to sneak past it.
Why a migration tool shouldn’t be a SaaS that keeps your code
I’ll put the opinion plainly: I’m deeply skeptical of the model where you upload your entire proprietary codebase to someone’s cloud so their service can assess it and, in many cases, retain it. Even setting aside training-data questions, you’ve now got your most sensitive source sitting in a third party’s storage, in scope for their breaches, their subpoenas, their employees, their retention policy. For a regulated shop it’s a non-starter; for everyone else it’s a risk you took on to save some setup time.
Self-hosted, offline-by-default, local-model-capable isn’t a limitation we apologize for — it’s the correct posture for software that has to read everything. The most secure place for your proprietary code during a migration is the same place it already lives: inside your network, under your control. A tool worth trusting with that access is one built to keep it there.
So when you evaluate anything that wants to read your whole codebase (migration tooling, AI code assistants, “modernization platforms”) make the trust model the first question, not the last. Where does the code go? What’s the default bind? Does it phone home? Can it run fully offline? The answers tell you whether the vendor treated your source as something to protect or something to collect. We wrote ours down in a SECURITY.md; ask for theirs.
That closes the lessons series. One post left: taking a single app all the way through, end to end.
Swordfish is an open-source (Apache-2.0) assessment harness for migrating Oracle, MySQL, SQL Server, Sybase, and DB2 to PostgreSQL — it shows you what’s in your codebase, what needs to change, and hands scoped tasks to the copilot you already use. Source: github.com/EnterpriseDB/swordfish-migrations