Ayers, J. W., et al. (2023). Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Internal Medicine.
In this cross-sectional study within the context of patient questions in a public online forum, chatbot responses were longer than physician responses, and the study’s health care professional evaluators preferred chatbot-generated responses over physician responses 4 to 1. Additionally, chatbot responses were rated significantly higher for both quality and empathy, even when compared with the longest physician-authored responses.
Patients aren’t going to care if their health care needs are met by a physician or a language model. All they’re going to care about is that those needs are met, cheaply, and competently.
The authors note that clinicians’ roles encompass far more than simply answering questions, so the results of this study should be considered in that context i.e. answering health-related questions better than humans doesn’t mean that chatbots can replace doctors.
Some points to note:
- The researchers don’t control the underlying language model, which is undergoing a lot of changes in the background, which are not transparent.
- Changes made to the LLM on the back-end will probably improve the model in terms of the quality of the language but not necessarily the accuracy of the response (because GPT isn’t trained specifically for medical questions).
- Submitting the same question twice, to the same model, won’t generate the same response. So the comparison between GPT and physician responses seems brittle i.e. the next iteration of the ChatGPT response could have been worse than the clinician’s response (although this seems unlikely).
- Articles like this represent a snapshot in time, as the model is constantly undergoing change. It’s not like using SPSS to run a statistical analysis on a dataset, where the software is reliable enough that you’ll get the same output every time.

I think it’s worth noting that this research was done using GPT-3.5. In other words, this is most likely a worse result than what we would see with the current (GPT-4) version of the model. In addition, this result was obtained through ChatGPT, which only has access to a subset of the full GPT language capability.
And, we’re seeing the emergence of language models that are fine-tuned on medical databases (for example, Google’s Med-PaLM). And I think it’s safe to assume that those kinds of models will significantly outperform vanilla ChatGPT.
So, to recap, this study showed that ChatGPT responses were better than physician responses with respect to both quality and empathy, and that this wasn’t even using a language model that’s been fine-tuned on medical data, or even the most recent version of GPT.
One of the limitations, not mentioned in the article, is that it’s fairly obvious which responses were generated by ChatGPT compared to physicians. And even though the authors blinded the evaluators to the authors of responses, I can’t believe that evaluators didn’t know which authors they were evaluating. Have a look at the two responses below, and see if you can tell which one was generated by a human.
