Sarah Connor’s warning might have been right, but it is still too early to tell.

Last year, it seemed like traditional language models (LLMs) were about to hit a wall—until “reasoning models” like OpenAI’s o1 and DeepSeek’s R1 hit the scene. These models surprised the world by breaking challenging problems into clear, logical steps. Many believed these large reasoning models (LRMs) were the start of machines that could think and reason their way to new discoveries, even about things they’d never seen before.
However, a new study from Apple, The Illusions of Thinking, challenges that view. The researchers put reasoning and standard (“non-thinking”) models through controlled puzzles, tweaking complexity from simple to mind-bending.

The results? Eye-opening.

Non-reasoning models performed better on the simplest puzzles—less prone to “overthinking.” For moderately challenging puzzles, reasoning models shone, using detailed step-by-step thinking to pull ahead. But even the best reasoning models buckled at the highest complexity levels. They didn’t just get answers wrong; they gave up, using fewer “thinking tokens” (the AI equivalent of mental effort) as the puzzles got harder. Adding more compute didn’t help.

What does this me