In 3-game healthcare series, underdog large-language models sweep specialized AI powerhouses

Pitting three multipurpose LLMs against two healthcare-specific AI tools, researchers have discovered the consumer-level AI can beat its purpose-built counterparts in healthcare scenarios—and soundly, at that. 

To be sure, the one-size-fits-all models were among the most advanced in their respective makers’ frontier AI quivers—OpenAI’s GPT-5.2, Google’s Gemini 3.1 Pro and Anthropic’s Claude Opus 4.6. 

Still, their rivals in the academic showdown, OpenEvidence’s OpenEvidence platform and Wolters Kluwer’s UpToDate Expert AI, are purpose-built for aiding clinical decision-making with evidence grounded in research.

The project was conducted at New York University and is described in an open-access study published this month by Nature Medicine.

Specialized tools got swept 

Graduate student Krithik Vishwanath, neurosurgeon Eric Oermann, MD, and colleagues put the models through their paces in three stages:

  • Posing 500 MedQA questions to test medical knowledge; 
     
  • Using 500 HealthBench items to measure alignment with clinicians; and 
     
  • Deploying a novel benchmark, Real Clinical Queries (RCQ), built by the present research team from 100 de-identified queries from physicians to a general-purpose language model in a live clinical environment. 

The RCQ benchmark involved assigning 12 U.S. clinicians to perform randomized, blinded review of model outputs, producing 1,800 model-question annotations, the study report explains. 

The evaluation’s key finding: The frontier LLMs “outperformed clinical AI tools in all three evaluations.”

At the same time, the clinical AI tools performed comparably to auto-enabled Google Search AI Overview on the RCQ, the researchers report.

“These findings highlight the need for independent, real-world evaluation of AI tools before they enter clinical settings,” the study authors comment.

Scale, alignment, cross-domain reasoning  

In their discussion, Vishwanath and colleagues note that clinical AI tools often “carry institutional legitimacy and are likely safe for routine use.”

However, they deduce from their results, specialized clinical AI tools are not superior to frontier models on knowledge, communication or clinical alignment. 

“The superior performance of frontier models in our study suggests that scale, alignment and cross-domain reasoning may outweigh domain-specific tuning as determinants of medical competency for particular tasks,” they write, underscoring that this finding has implications for procurement, reimbursement and regulatory oversight. 

“The path forward may ultimately lie with hospital-specific LLMs that leverage institutional data to mitigate external harm, along with careful use of frontier models for less-sensitive tasks.”

More: 

‘As generative LLMs become integrated into healthcare at the enterprise, individual clinician and consumer levels, there is an increasing need for rigorous, independent evaluation on real-world tasks.’

Read the rest.

 

Subscribe to Health Exec News

Dave Pearson

Dave P. has worked in journalism, marketing and public relations for more than 30 years, frequently concentrating on hospitals, healthcare technology and Catholic communications. He has also specialized in fundraising communications, ghostwriting for CEOs of local, national and global charities, nonprofits and foundations.

Subscribe to Health Exec News

Subscribe to Health Exec News