Why We Run Multiple LLMs (and Make Them Argue)

Ask a language model whether a chunk of PL/SQL will behave the same in PostgreSQL, and it’ll tell you. Clearly. Authoritatively. With a tidy little explanation. And a meaningful fraction of the time, it’ll be wrong — not “obviously broken” wrong, but “plausible, well-formatted, ships past review” wrong, which is the expensive kind.

I spent the last few years elbow-deep in AI infrastructure, and the single most uncomfortable thing I’ve internalized is this: these models are calibrated for fluency, not for honesty about uncertainty. They don’t have a little gauge that goes “eh, I’m only 60% on this one.” They produce the most probable next token, and the most probable phrasing of a guess looks exactly like the most probable phrasing of a fact. So if your migration tool asks one model one question and prints the answer, you’ve built a very expensive way to launder a guess into something that looks like ground truth.

That’s the problem we set out to not have.

What “make them argue” actually means

The fix isn’t a bigger model. The fix is not trusting any single model’s say-so on the findings that matter. In Swordfish, when an LLM-derived finding is consequential, we don’t take one model’s word for it. We get a second opinion, sometimes a third, and we pay attention to where they disagree.

Three mechanisms do the work:

Multi-model validation. A finding flagged by one pass gets re-examined, and you can route that re-examination through a different model than the one that raised it. When two independent models, ideally from different families, both land on “yes, this NVL is sitting on top of the empty-string trap,” that agreement is worth a lot more than one model saying it twice. When they split, that split is information — it tells you exactly which findings need a human’s eyes instead of which findings to rubber-stamp.

Confidence ratings that mean something. Instead of a single model’s self-reported “high confidence” (which, again, it’ll happily give you for a hallucination), confidence reflects corroboration — how many independent looks agreed, and how strongly. A finding three models concur on and a finding one model emitted and two couldn’t reproduce should not wear the same badge. They don’t.

Consolidation, so agreement isn’t drowned in noise. Run multiple models over thousands of code sites and you get the same underlying problem described five different ways: “possible NULL-handling difference,” “COALESCE semantics mismatch,” “empty string vs NULL issue.” A naive system shows you five findings. Ours detects the phrasing drift and groups them under one concern_key, so you see one real concern with its corroboration attached, not five rows that make your dashboard look like a crime scene and bury the signal.

Why not just use one really good model?

Because the failure mode I care about isn’t “the model isn’t smart enough.” It’s “the model is confidently, fluently wrong and nothing tells me which findings those are.” A bigger model is more fluent, which can make that worse, not better — a more articulate wrong answer is a more convincing wrong answer.

Diversity beats raw capability for this specific job. Different model families have different training, different blind spots, different failure modes. When I want to know whether to trust a finding, “three different architectures independently agree” is a far stronger signal than “the biggest model is really sure.” It’s the same reason you get a second medical opinion from a different doctor, not the same doctor saying it louder.

There’s a cost argument too, and it’s the one that makes this practical rather than precious. We don’t run the multi-model gauntlet on everything — that’d be wildly expensive. Remember the funnel from the heuristics post: deterministic rules handle the known patterns, and the LLM only touches the long tail. The multi-model treatment is reserved for an even smaller slice: the consequential, uncertain findings where a second opinion changes what a human does next. Spend the expensive, redundant compute exactly where being wrong is costly, and nowhere else.

The honest part

I’m not going to tell you this makes the LLM findings correct. It doesn’t. It makes them honestly uncertain, which is a different and more useful thing. Multi-model validation, confidence scoring, and consolidation reduce the noise and surface the disagreement — they turn “an AI said so” into “two of three models agree, here’s the one that didn’t, here’s the code, you decide.” That’s not magic. That’s a system designed around the assumption that any individual model output might be garbage, and built to catch it before you act on it.

The principle generalizes well beyond migrations, and if you’re building anything that makes decisions on LLM output, steal it: never let a single model’s confidence be your signal. Cross-check the things that matter, treat disagreement as data, and collapse the redundant noise so the real disagreement is visible. The model that’s sure of itself is not the one you should be sure of. The three that happen to agree, after trying not to? Those you can start to trust.

Next in the series: the behavioral traps those models are arguing about — the “compiles fine, runs wrong” findings that are the entire reason any of this paranoia is justified.

Swordfish is an open-source (Apache-2.0) assessment harness for migrating Oracle, MySQL, SQL Server, Sybase, and DB2 to PostgreSQL — it shows you what’s in your codebase, what needs to change, and hands scoped tasks to the copilot you already use. Source: github.com/EnterpriseDB/swordfish-migrations