A recently published Apple Machine Learning Research study revealed new information surrounding AI reasoning models like OpenAI’s o1, and Claude’s thinking variants, suggesting that the models might not actually be reasoning.
According to MacRumors, all tested reasoning models—including o3-mini, DeepSeek-R1, and Claude 3.7 Sonnet—experienced what is described as a “complete accuracy collapse” beyond certain thresholds and dropped to a zero percent success rate. It was also revealed that the models lowered their effort in terms of thinking the more complex the problems got.
Even when researchers provided algorithms with complete solutions, the models still failed at the exact same points. Researchers state that it indicates the problem lies within the basic logical step execution, rather than the problem-solving strategy. The models also showed inconsistencies, sometimes succeeding on a problem that required 100+ steps but failing on a problem that required 11 steps.
Apple researchers designed controllable puzzle environments, which allowed for analysis of both the final answers and the model’s internal reasoning while avoiding data contamination. This information could also be tracked across a variety of complexity levels.
As MacRumors notes, the researchers’ findings are that currently, large reasoning models (LRMs) rely on sophisticated pattern matching rather than actual reasoning capabilities. The findings also suggest that large language models (LLMs) don’t adjust reasoning like humans do; instead, they overthink easy problems and don’t think enough for hard problems.
This paper was published just days before Apple’s WWDC event, where the company made a big deal about new AI capabilities coming to its products.
Source: MacRumors
MobileSyrup may earn a commission from purchases made via our links, which helps fund the journalism we provide free on our website. These links do not influence our editorial content. Support us here.
