Michael Rowe

Trying to get better at getting better

Matthias Bastian (2024-10-30). OpenAI Releases SimpleQA Benchmark to Test AI Model Factual Accuracy.

Many academics seem to be quite gleeful about the results of the study reported at the link above, so I wanted to take a moment to clarify a few things.

First of all, the context of the study:

The SimpleQA test includes 4,326 questions across science, politics, and art. Each question was designed to have only one clearly correct answer, verified by two independent reviewers. The low percentage of correct answers must be understood in the specific context of SimpleQA’s methodology: The researchers only included questions where at least one of the GPT-4s used to generate most of the data gave an incorrect answer…

In other words, this is a test designed specifically to evaluate a model’s ability to answer difficult questions that have a single correct answer, where the model has previously failed. The whole point of the test is to explore the problem space where we know models don’t do well. The low numbers are an expected outcome, not a surprise result.

This also means that the reported percentages reflect the performance of the models on particularly difficult questions, not their general ability to give correct answers to factual questions.

Note: I see that Perplexity isn’t included in the study. I’d love to know how it would have fared in the test, because Perplexity is the model I use when I want to be more confident in the accuracy of an answer.

So while I think this is an interesting study that highlights the challenge of LLMs when it comes to answering hard questions correctly, I don’t think it’s the big deal that some people are making it out to be.

Yes, models still fail tests of accuracy. We know this, and so shouldn’t expect them to answer all of our questions correctly (what is sometimes called an Oracle AI).

Which is why I encourage people to use language models as thinking partners where the aim is to converge on negotiated solutions to problems. Being inaccurate isn’t always an obstacle to being useful, as should be clear from most human conversations, where factual accuracy isn’t always an impediment to making progress.

In other words, use language models as thinking partners, not oracles.


Note:

The study shows also shows that AI language models significantly overestimate their own capabilities when answering questions. When researchers asked the models to rate their confidence in their answers, the AIs consistently gave inflated scores about their own accuracy.

This sounds a bit like humans (see the Dunning-Kruger effect, where those with the least competence can often be the most confident in their ability).


Share this


Discover more from Michael Rowe

Subscribe to get the latest posts to your email.