This is a bit of rant. Apologies.
I’m tired of seeing papers showing that commercial frontier models (like ChatGPT, Claude, Llama, etc.) hallucinate and that they’re biased.
For example, Faithfulness Hallucination Detection in Healthcare AI. (2024):
Faithfulness hallucinations, where AI-generated contents diverge from input contexts, pose significant risks in high-stakes environments like healthcare. In clinical settings, the reliability of systems is crucial, as any deviations can lead to misdiagnoses and inappropriate treatments… The findings highlight the necessity for robust hallucination detection methods to ensure reliability of AI applications in healthcare.
But no-one is proposing that high-stakes clinical reasoning and/or medical diagnosis will be carried out by ChatGPT or Llama (the study above used Llama-3); remember, Llama is an open-source model developed by Meta, trained on Facebook data! This isn’t going to be the foundation for anything truly consequential in the world, like supporting healthcare systems.
In addition, for every paper emphasising the fact that language models make mistakes, we have others showing how they do relatively well. For example: Eriksen, A., et al. (2023-11-09). Use of GPT-4 to Diagnose Complex Clinical Cases. New England Journal of Medicine.
GPT-4 correctly diagnosed 57% of cases, outperforming 99.98% of simulated human readers generated from online answers. We highlight the potential for AI to be a powerful supportive tool for diagnosis; however, further improvements, validation, and addressing of ethical considerations are needed before clinical implementation.
But even this study was out of date at the moment the PDF hit the NEJM servers. Soon after this was published OpenAI released GPT-4o, which addressed some of the earlier challenges highlighted by the authors.
Companies like Google DeepMind are building medical AI from the ground up, using the available literature, clinical practice guidelines, and clinical expert input. And these platforms are far more likely to be the ones integrated into any healthcare product in the future, rather than simply using the GPT or Claude API. Granted, the medical fine-tuned models aren’t available for most researchers to use, but we know they exist, and we know they’re very good.
Just look at the progression of Google’s Med-PaLM and AMIE models:
Singhal, K., et al. (2022). Large Language Models Encode Clinical Knowledge (No. arXiv:2212.13138). arXiv. http://arxiv.org/abs/2212.13138
Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset, including 67.6% accuracy on MedQA (US Medical License Exam questions), surpassing prior state-of-the-art by over 17%. However, human evaluation reveals key gaps in Flan-PaLM responses. To resolve this we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians.
Singhal, K., et al. (2023). Towards Expert-Level Medical Question Answering with Large Language Models (No. arXiv:2305.09617). arXiv. http://arxiv.org/abs/2305.09617
The MedLM models are based on Med-PaLM 2, the second iteration of Google’s large-scale medical language models. Med-PaLM 2 has improved by 18 % over its predecessor this year, achieving 85 % accuracy, which Google equates to the level of a medical specialist… Med-PaLM 2 was tested against 14 criteria, including scientific factuality, accuracy, medical consensus, reasoning, bias, and harm, evaluated by clinicians and non-clinicians from diverse backgrounds and countries.
McDuff, D., et al. (2023). Towards Accurate Differential Diagnosis with Large Language Models (No. arXiv:2312.00164). arXiv. https://doi.org/10.48550/arXiv.2312.00164
Our study suggests that our LLM for DDx has potential to improve clinicians’ diagnostic reasoning and accuracy in challenging cases, meriting further real-world evaluation for its ability to empower physicians and widen patients’ access to specialist-level expertise.
Tu, T., et al. (2024). Towards Conversational Diagnostic AI (No. arXiv:2401.05654). arXiv. http://arxiv.org/abs/2401.05654
Here, we introduce AMIE (Articulate Medical Intelligence Explorer), a Large Language Model (LLM) based AI system optimized for diagnostic dialogue… AMIE demonstrated greater diagnostic accuracy and superior performance on 28 of 32 axes according to specialist physicians and 24 of 26 axes according to patient actors.
To be clear, these fine-tuned medical models still have significant gaps in competence and capabilities. They still make mistakes and they’re not perfect. But let’s not be confused by thinking that human doctors are perfect. We need to stop comparing AI to the best possible human, and start comparing them to the best available human.
The NHS isn’t suddenly going to start using ChatGPT or Llama for medical diagnosis and clinical reasoning. But they might look at future variants of AMIE, with a demonstrable track record of improving over time, developed in specific medical contexts.
Stop doing research showing that ChatGPT, Llama, Claude, and all the other models hallucinate, and that we therefore shouldn’t trust them to make high-stakes decisions. We know they hallucinate. We know they’re biased. It says so on the tin.

We’re in this weird twilight zone, where some researchers are publishing papers showing that ChatGPT and Llama still hallucinate (we know this), and other researchers are publishing papers arguing that, because they hallucinate, we shouldn’t trust them (we also know this). Meanwhile, Google is just getting on with building medical AI systems like AMIE, which are very good.
And no-one seems to be paying attention.
End of rant. I know that publishing is nuanced, and that even studies showing things we already know can have value. However, I feel like I’m seeing more papers that have little to offer besides, “don’t trust LLMs”, and “ethics!”