By Michael Babalola
June 28, 2025
In a move that’s shaking the foundations of the AI world, Apple has released a research paper that questions the very core of what we think AI can do. Titled “The Illusion of Thinking,” the paper challenges the assumption that today’s most powerful AI models — including OpenAI’s ChatGPT, Google’s Gemini, and Anthropic’s Claude — are capable of real reasoning.
According to the report, these AI systems aren’t truly “thinking.” Instead, they’re mimicking reasoning by memorizing and matching patterns — and Apple’s experiments back this up in a way that’s hard to ignore.
Breaking Down the Bombshell
To test the limits of large language models (LLMs), Apple researchers designed a series of puzzles divided into three levels of difficulty:
- Easy
- Medium
- Hard
These weren’t your average math questions or trivia prompts. Apple chose carefully controlled logic puzzles — including the Tower of Hanoi, River Crossing, Blocks World, and Checkers Jumping — which allowed the complexity to be scaled up systematically without bias from training data.
The results were revealing:
- At the easy level, models like ChatGPT and Gemini performed well, solving the puzzles accurately.
- At the medium level, performance started to decline, with frequent mistakes and breakdowns in logic.
- But at the hard level — when the puzzle required true multi-step reasoning — performance collapsed entirely.
One might assume that these models could recover if given step-by-step instructions or the solution itself. But here’s where it gets alarming.
Even With the Right Answer, AI Failed
Apple gave some models the correct algorithm to solve complex problems like the Tower of Hanoi — and yet they still failed to execute the steps properly. One model that successfully completed a 100-step Tower of Hanoi challenge later failed a 4-step river crossing puzzle.
How is that possible?
Apple researchers suggest that the more complex puzzle was likely included in the model’s training data — while the simpler, unfamiliar river crossing puzzle wasn’t. In other words, the model wasn’t solving the problem — it was recalling a memorized solution.
This inconsistency reveals a deeper issue: AI models often appear intelligent only when the problem looks like something they’ve seen before.
The “Illusion” in Full View
The research outlines three key performance regimes:
- Low Complexity – Standard LLMs outperform even reasoning-enhanced models, because extra reasoning isn’t needed.
- Medium Complexity – Some reasoning models show improvement by applying step-by-step logic.
- High Complexity – All models fail — accuracy drops to near zero, even when computational resources are sufficient.
In some cases, models appeared to “give up” — reducing their reasoning effort even though they had enough token budget left to continue. This behavior wasn’t just inefficient. It was disturbing.
The paper concludes that these models do not engage in human-like reasoning. Instead, they produce fluent, confident outputs through pattern completion — an impressive trick, but not the same as true understanding.
Implications: Rethinking the AI Narrative
Apple’s findings have sparked fierce debate across the AI community, especially because of their timing — just ahead of Apple’s WWDC 2025, where the company announced its more cautious “Apple Intelligence” suite. Some see the paper as a subtle critique of AI hype from rivals like Google and OpenAI.
Others see it as a necessary reality check.
“If your AI assistant can solve a 100-step problem it memorized, but fails a 4-step problem it hasn’t seen, that’s not intelligence. That’s smoke and mirrors.”
— AI Researcher, commenting on Hacker News
A New Path Forward?
While Apple’s paper stops short of offering definitive solutions, it signals the need for:
- New benchmarks that avoid contamination from training data.
- Hybrid models that combine symbolic reasoning with language models.
- Less anthropomorphizing of AI systems — and more transparency.
Bottom Line
Apple’s “The Illusion of Thinking” doesn’t just raise academic questions. It exposes a fundamental misunderstanding in how we interpret AI capabilities today. For all their fluency and coherence, today’s models are not reasoning — they’re replaying.
As the AI industry races ahead with smarter assistants, autonomous agents, and AGI speculation, Apple’s warning is clear:
The biggest lie in AI might be that it’s thinking at all.
