Singhal, K., et al. (2023). Towards Expert-Level Medical Question Answering with Large Language Models (arXiv:2305.09617). arXiv.
From the abstract: We performed detailed human evaluations on long-form questions along multiple axes relevant to clinical applications. In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility (p < 0.001). We also observed significant improvements compared to Med-PaLM on every evaluation axis (p < 0.001) on newly introduced datasets of 240 long-form “adversarial” questions to probe LLM limitations. While further studies are necessary to validate the efficacy of these models in real-world settings, these results highlight rapid progress towards physician-level performance in medical question answering.
See also this brief overview from The Verge.
Here’s another one: Beam, Kristyn, et al. “Performance of a Large Language Model on Practice Questions for the Neonatal Board Examination.” JAMA Pediatrics, July 2023.