General-purpose large language models outperform specialized clinical AI tools on medical benchmarks

6 min read

The doctors spent years building specialized AI tools for medicine. GPT just walked in and beat them at their own game. If that doesn’t make the entire clinical AI industry sweat, nothing will.

A new study published in Nature Medicine found that general-purpose large language models — think GPT-4 and its successors — consistently outperform AI tools that were purpose-built for clinical use on standard medical benchmarks. Not by a little. By enough to matter. Enough to make hospital procurement teams ask some very uncomfortable questions about where their budgets just went.

The Specialists Are Losing to the Generalists

Here’s the setup: for years, health tech companies sold the idea that medical AI needed to be trained specifically on clinical data to be trustworthy. Protected health information, curated datasets, HIPAA-compliant pipelines — the whole architecture of specialized clinical AI was built on the premise that general models were too broad, too blunt, too uncontrolled for life-or-death decisions.

Enjoying this story?

Get sharp tech takes like this twice a week, free.

Subscribe Free →

That premise just took a serious hit.

GPT-class models trained on the general internet — including medical literature, yes, but also Reddit threads and recipe blogs and god knows what else — are now performing better on medical reasoning tasks than the tools healthcare systems paid millions to deploy. That’s not a quirk. That’s a structural problem for an entire product category.

What the Benchmarks Actually Measured

The benchmarks in question tested clinical reasoning, diagnostic accuracy, and treatment recommendation logic. These aren’t soft metrics. They’re the kinds of tasks where wrong answers hurt people. The general-purpose models didn’t just pass — they led. Consistently.

Part of this is scale. GPT-class models are trained on more data, with more compute, by teams with more resources than most health tech startups could dream of matching. The specialized tools were built with good intentions and domain expertise. But intentions don’t close a compute gap.

Part of it is also that medical reasoning isn’t as alien as clinical AI vendors implied. It’s still language. It’s still logic. It’s still pattern recognition over text. A model that gets really, really good at all of those things doesn’t suddenly go stupid the moment you show it an EHR.

What This Means for Health Tech’s Billion-Dollar Bet

Hospitals and health systems have been spending heavily on clinical AI point solutions. Sepsis prediction tools. Radiology assistants. Prior authorization bots. Discharge summary generators. Many of these tools were built on older, smaller, more constrained models — and sold at premium prices justified by the claim that specialization equals reliability.

If a general-purpose model now outperforms those specialized tools on the benchmarks those tools were literally designed to ace, the business case starts to unravel fast. Why pay a clinical AI vendor’s licensing fees when you could build on GPT through an API at a fraction of the cost?

This mirrors something we’ve seen across the creator economy too. Platform lock-in felt safe until the underlying tools got commoditized. When the rails get cheap, the gatekeepers lose their grip. We wrote about how creator platforms are rewriting their revenue models in real time as AI changes what it costs to produce content — the same pressure is now hitting clinical AI from the other direction.

The Hot Take

Most clinical AI startups were never really selling AI. They were selling compliance theater. The pitch was never “our model is smarter.” It was “our model is auditable, regulated, and won’t get your legal team fired.” That’s a real value — but it’s a consulting value, not a technology value. Now that the underlying tech has blown past them, the compliance wrapper is the only thing keeping the lights on. That’s a very fragile business to be in.

The Regulatory Question Nobody Wants to Answer

Here’s where it gets politically messy. General-purpose LLMs aren’t cleared as medical devices. They don’t go through FDA review for clinical use. The specialized tools, however imperfect, at least exist inside a regulatory framework. If hospitals start quietly swapping them out for GPT wrappers because the benchmarks are better, they’re making a risk trade that nobody has officially signed off on.

And right now, the political environment around AI regulation is chaotic at best. The White House AI deal that would override state-level AI laws is still live and unresolved. The FDA’s approach to AI as a medical device is still evolving. Meanwhile, health systems are making real decisions in a regulatory vacuum.

There’s also a quieter concern worth raising. We know that patients using certain medical interventions change their behavior in unexpected ways — like how people on GLP-1 drugs like Ozempic started moving less, not more. Deploying more powerful AI in clinical settings could trigger its own behavioral shifts — in clinicians, in patients, in the culture of diagnosis itself. The benchmark score doesn’t capture any of that.

General-purpose LLMs winning on medical benchmarks isn’t the end of the story — it’s the beginning of a much harder conversation about what “better” actually means in healthcare, who gets to define it, and whether the institutions that built their business models on specialized AI can adapt fast enough to survive what just hit them.

General-purpose large language models outperform specialized clinical AI tools on medical benchmarks

The Specialists Are Losing to the Generalists

Enjoying this story?

What the Benchmarks Actually Measured

What This Means for Health Tech’s Billion-Dollar Bet

The Hot Take

The Regulatory Question Nobody Wants to Answer

Watch the Breakdown

Like this:

Related

Charles

The Specialists Are Losing to the Generalists

Enjoying this story?

What the Benchmarks Actually Measured

What This Means for Health Tech’s Billion-Dollar Bet

The Hot Take

The Regulatory Question Nobody Wants to Answer

Watch the Breakdown

Google is experimenting with machine learning-powered age-estimation tech in the US

Beware ‘drone-tastic’ thinking: UK armed forces chief says ‘traditional’ capabilities still matter

Dual National Pleads Guilty in U.S.-China Missile Detection Espionage Case

Like this:

Related

Charles

Sharp tech takes.Twice a week. Free.

Sharp tech takes.
Twice a week. Free.