Photo: Koshiro K/Shutterstock.com
OpenAI’s latest models, GPT o3 and o4-mini, reportedly hallucinate more, at least twice as much as their earlier models. The claim was reported in the system card report that accompanies new AI models.
The report, published by OpenAI, claims that o4-mini is less accurate and hallucinates more than the o1 and o3 models. The AI giant found that the model hallucinated in 48 percent of its responses, thrice as much as o1 when tested using PersonQA, an internal test run by the company to test its models. However, considering o4-mini is smaller, faster, and cheaper to run compared to o3, it wasn’t expected to outperform in the first place.
The o3 model also hallucinates 33 percent of the time — twice as much as o1. The model makes more claims compared to the other two, but is also the most accurate of the three, leading to more accurate as well as hallucinated claims. OpenAI claims that “more research is needed to understand the cause of this result.”
Hallucinations tend to arise from inaccuracies in a model’s training data. Reducing inaccuracies can lower the chance of a model hallucinating, but it’s not a permanent fix. Fact-checking AI outputs is also difficult, given that it requires cognitive skills like those seen in humans.

Regardless, AI models are expected to improve in all aspects, including hallucinations, in each generation, which makes the system card report on o3 and o4-mini models unexpected. These are also reasoning models, meaning they show the user what they’re doing as they’re processing each prompt. Another report from AI research lab Transluce claims that o3 often presents fake actions to users, such as claiming to run Python in a coding environment even though it can’t do so.
To make matters worse, the Transluce report also claims that o3 justifies its false claims and actions when questioned by the user, even claiming to go as far as using an “external MacBook Pro to perform computation” and copying the final output into ChatGPT. The report adds that these occurrences are more common in o-series models like o1, o3-mini, and o3 compared to the GPT lineup, including 4.1 and 4o.
OpenAI doesn’t seem to be doing much about the issue, other than plastering an accuracy warning over the ChatGPT prompt box. The responsibility still falls on you as the user to fact-check any AI model’s output, let alone ChatGPT for the best results.
In the News: New RustoBot malware hijacks routers to inject commands