Photo: Koshiro K/Shutterstock.com

A discrepancy was found between the first-party and third-party benchmark of OpenAI’s o3 AI model, raising questions about the company’s model testing processes and transparency.

The AI company had previously claimed that the model could answer more than one-fourth of the questions on FrontierMath, considered a complex set of math problems. When the model was revealed in December, the score had blown away the competition, as only 2% of the problems were answered correctly by the next-best model.

Mark Chen, the chief research officer at OpenAI, said, “We’re seeing [internally], with o3 in aggressive test-time compute settings, we’re able to get over 25%.” The number provided was most likely an upper limit achieved by the version of the O3 model that had more computing power than the model made public by the company on April 16.

The research institute behind FrontierMath, Epoch AI, released standalone benchmark test results that showcased the o3 model only answering approximately 10% of the problems accurately. Epoch AI further said that differences in the testing process could be a possible cause for a lower-than-promised score. An updated version of FrontierMath was also utilised for the evaluations.

This is an image of frontiermath openai indepedent benchmark taken from epoch ai — *The third-party benchmark results for OpenAI’s models compared to other AI models. | Source: Epoch AI*

Wenda Zhou, a member of OpenAI’s technical staff, said that the o3 AI model in production is better optimised for real-world use cases and speed compared to the demo version showcased in December 2024. Due to the differences in the models, it may display disparities in performance. Zhou said, “[W]e’ve done [optimisations] to make the [model] more cost efficient [and] more useful in general.”

Despite the public release scoring lower than initially implied by the company, OpenAI’s o3-mini-high and o4-mini models performed better than o3 on FrontierMath and the company plans to introduce more powerful versions of o3 and o3-pro, in the next few weeks.

As of recently, Elon Musk’s xAI was also accused of publishing misleading results for Grok 3, its most recent AI model. This is a reminder that AI benchmarks should not be taken at face value, especially when the source is a company that sells services that need to be optimised as a product for users.

In the News: Russian hosting provider abused for malware delivery