Contact Us About Us
Log In
6 min read

Hallucination Rates Spike in OpenAI’s o3 & o4-Mini Models

OpenAI’s newly released AI models, o3 and o4-mini, are hallucinating at significantly higher rates than previous versions, despite being designed for advanced reasoning tasks.

Internal tests show o3 fabricates information in 33% of factual queries, while o4-mini gets it wrong nearly half the time. The company says it doesn’t yet know why.

The models launched in OpenAI’s next-generation reasoning suite were designed to deliver smarter, context-aware results. However, they have raised a new reliability concern that worries researchers, developers, and enterprise users alike. 

As hallucinations surge, experts warn these models may be too unpredictable for real-world deployment, particularly in industries where accuracy is non-negotiable.

Hallucination Rates Spike in OpenAI’s o3 & o4-Mini Models

The Paradox at the Heart of OpenAI’s Reasoning Models

The launch of o3 and o4-mini was supposed to mark a leap forward in artificial intelligence. Positioned as successors to OpenAI’s earlier reasoning models (like o1 and o3-mini), these systems are designed to handle more complex tasks with better contextual understanding, especially in areas like coding, math, and problem decomposition.

But while they may have improved at executing certain logic-driven functions, their overall factual reliability has taken a hit.

In one of OpenAI’s in-house benchmarks PersonQA, which tests how well a model can recall facts about individuals o3 hallucinated on 33% of questions. That’s more than double the hallucination rate of its predecessors: o1 scored 16% and o3-mini, 14.8%. 

The o4-mini model did even worse, generating incorrect information in 48% of PersonQA cases.

This is not just a statistical anomaly. It’s a systemic regression in one of the most fundamental areas of AI functionality: truthfulness.

Models That Fabricate Their Own Process

It’s not only that o3 and o4-mini are getting facts wrong, they’re inventing entire processes.

Transluce, a nonprofit AI research lab, conducted independent tests on o3 and observed the model creating fictional narratives about how it arrived at its answers. 

In one example, o3 claimed that it had run a piece of code on a 2021 MacBook Pro “outside of ChatGPT” and then copied the result into its answer. 

While o3 does have access to some tools, executing code outside its sandboxed environment simply isn’t possible.

This kind of behavior, fabricating not just the answer but the method, adds another layer of concern. 

When users can’t even trust how a model says it reached a conclusion, the reliability of any AI-generated output comes into question.

Sarah Schwettmann, co-founder of Transluce, summarized the risk plainly: “O3’s hallucination rate may make it less useful than it otherwise would be.”

OpenAI: “More Research Is Needed”

OpenAI, for its part, isn’t denying the issue. In the technical report accompanying the model launches, the company openly noted that “more research is needed” to understand why hallucinations have improved despite progress in reasoning.

One working theory centers on how these models are trained. The o-series models undergo a form of reinforcement learning that may inadvertently amplify hallucination-prone behaviors, especially as they’re optimized to generate more detailed and confident answers.

Neil Chowdhury, a former OpenAI employee and now a researcher at Transluce, believes the reinforcement learning approach might be backfiring.

 “Our hypothesis is that the kind of reinforcement learning used for o-series models may amplify issues that are usually mitigated (but not fully erased) by standard post-training pipelines,” he explained in an email to TechCrunch.

In essence, the same processes that make the models better at “thinking” may also make them more inclined to confidently fabricate.

The Real-World Risks of Confidently Wrong AI

For consumers casually asking ChatGPT for fun facts or movie trivia, a few hallucinations may seem harmless. But for businesses and professional users, the stakes are much higher.

Kian Katanforoosh, CEO of upskilling platform Workera and a Stanford adjunct professor, has been testing o3 in real-world coding environments. While he praises the model’s advanced capabilities, he notes a recurring problem: “It tends to hallucinate broken website links,” providing references that don’t exist or lead to dead ends.

That might seem minor in software development, but in other contexts—like generating legal contracts or patient information—hallucinations can result in serious harm, compliance violations, or costly errors.

Enterprises are increasingly interested in AI-powered automation, but persistent hallucination issues can be a dealbreaker. Reliability and factual consistency are not just technical challenges—they’re business imperatives.

Could Search Integration Be the Solution?

One possible way to curb hallucinations is to give models access to real-time web search. OpenAI’s GPT-4o, which includes a web browsing capability, achieves 90% accuracy on another benchmark test called SimpleQA. 

Unlike reasoning models that must “recall” facts from training data, search-enabled models can reference live information to fact-check themselves in real-time.

But there’s a caveat: incorporating web search opens the door to privacy concerns. Queries may be routed through third-party services, exposing sensitive data to external providers. It also introduces new engineering complexity in balancing retrieval with response fluency.

Still, many experts view this hybrid approach—known as Retrieval-Augmented Generation (RAG)—as one of the most promising paths forward in reducing hallucinations without sacrificing model sophistication.

A Broader Challenge for the AI Industry

The hallucination dilemma arises at a crucial juncture in AI advancement.

In the last year, the industry has predominantly redirected its attention from expanding model sizes to enhancing reasoning—a strategy designed to improve performance while using fewer computational resources.

Reasoning models like o3 and o4-mini reflect this shift, promising better task execution, even with smaller model sizes. But as the latest results show, this shift may come with unforeseen trade-offs.

If increasing a model’s ability to reason simultaneously decreases its ability to stay tethered to facts, the industry will face a hard choice: pursue logic at the cost of truth, or rethink how we build and train AI altogether.

As OpenAI spokesperson Niko Felix told TechCrunch, “Addressing hallucinations across all our models is an ongoing area of research, and we’re continually working to improve their accuracy and reliability.”

But for now, the problem remains unsolved, and potentially getting worse.

What Users Should Do Now

While model developers work on long-term fixes, users can take immediate steps to minimize risks:

  1. Always fact-check AI outputs, especially when using models in legal, medical, or technical environments.
  2. Use models with web access when real-time accuracy is more important than prompt privacy.
  3. Implement layered review systems, such as human-in-the-loop oversight, for critical outputs.
  4. Favor retrieval-augmented tools that combine generative AI with verified databases.
  5. Monitor model updates closely, as patch releases may improve accuracy and reduce hallucinations over time.

Key Takeaways

  • New models aren’t always more accurate – despite better reasoning, hallucinations are increasing.
  • Fabricated processes are a growing issue – o3 and o4-mini often lie about how they “know” something.
  • Businesses must tread carefully – factual errors make these models risky for critical use cases.
  • Search and RAG may help – integrating live data sources can reduce reliance on flawed memory.
  • Transparency is essential – users must know what a model can and can’t do to use it responsibly.
Dileep Thekkethil

Dileep Thekkethil is the Director of Marketing at Stan Ventures, where he applies over 15 years of SEO and digital marketing expertise to drive growth and authority. A former journalist with six years of experience, he combines strategic storytelling with technical know-how to help brands navigate the shift toward AI-driven search and generative engines. Dileep is a strong advocate for Google’s EEAT standards, regularly sharing real-world use cases and scenarios to demystify complex marketing trends. He is an avid gardener of tropical fruits, a motor enthusiast, and a dedicated caretaker of his pair of cockatiels.

Keep Reading

Related Articles

Link Building Vendor Scorecard
Built from auditing 40+ vendors
⏸️

Wait. You're This Close to Your Score.

You've answered several out of 20 questions. Just a few more and you'll see your full vendor scorecard.

If you leave now, you won't see how your vendor stacks up against industry standards, where your biggest risk gaps are, or what your peers are doing differently. Finish the last few questions to unlock your complete report.