Apple’s Study questions LLM AI Models ability to “reason”

Share post:

A new report from Apple’s artificial intelligence team reveals what they see as significant flaws in the reasoning capabilities of large language models (LLMs) like those developed by Meta and OpenAI. Their report questions whether these AI engines possess the critical thinking required to solve problems reliably, which raises concerns about their use in real-world applications.

The new models, in demonstrations, shown great skill at handling the type of logic problems that have in the past defeated these models. For example, one famous test used the example of a hunter, a wolf, a cabbage and a sheep. In this puzzle, the person has a boat and can only travel with one of the other three. Even some humans might not solve this as it forces them to make a shift in logic – the answer is that he has to make a trip with the wolf, then the cabbage, then take the wolf back, get the sheep and on a last final trip, take the wolf. Once the solution is described, it seems obvious.

These are part of the question sets used to evaluate LLMs and they have, at least til recently usually defeated the LLM.

We ran that test by ChatGPT 4o and it handled it quickly. Here’s the solution:

To solve this classic river crossing puzzle, the hunter needs to carefully plan the sequence of trips to ensure that neither the sheep eats the cabbage nor the wolf eats the sheep when left alone. Hereā€™s a step-by-step guide on how the hunter can safely transport all three:

  1. First Trip: The hunter takes the sheep across the river and leaves it on the other side.
  2. Return Trip: The hunter returns alone to the original side.
  3. Second Trip: The hunter takes the wolf across the river.
  4. Return Trip: The hunter leaves the wolf on the other side and takes the sheep back to the original side.
  5. Third Trip: The hunter takes the cabbage across the river.
  6. Return Trip: The hunter returns alone to the original side.
  7. Fourth Trip: The hunter takes the sheep across the river again.

This sequence ensures that:

  • The sheep is not left alone with the cabbage (thus the cabbage is not eaten).
  • The wolf is not left alone with the sheep (thus the sheep is not eaten).

Each crossing is carefully planned so that the hunter always prevents any danger of one eating the other by never leaving them alone together without his presence.

With LLMs solving these and other similar problems, it seems to imply that LLMs are capable of logic.

Not so fast, say the researchers at Apple.

The researchers introduced a new benchmark, GSM-Symbolic, designed to evaluate the reasoning skills of LLMs. In their experiments, they discovered that slight changes in the wording of queries could drastically alter the models’ responses, even when those changes should not logically affect the outcome. This inconsistency points to a fundamental weakness in the reasoning abilities of LLMs, which often rely on sophisticated pattern recognition rather than true understanding.

One illustrative example from the study involved a basic math word problem featuring a character named Oliver. The problem stated that Oliver picked 44 kiwis on Friday, 58 on Saturday, and double the amount picked on Friday on Sunday. It then added an irrelevant detail about five of the kiwis picked on Sunday being smaller than average, before asking how many kiwis Oliver had in total. The extra detail about the size of the kiwis should not affect the solution, but both OpenAI’s and Meta’s models subtracted the five smaller kiwis from the total, demonstrating a lack of logical consistency.

The study also found that simply changing the numerical values in questions or adding irrelevant contextual information led to varying results, which should not happen in a model capable of true reasoning. In one instance, adding an unrelated sentence to a math question reduced the accuracy of the answer by up to 65%. This fragility underscores that LLMs primarily operate based on pattern matching rather than actual logical deduction.

The researchers concluded that, given these limitations, LLMs cannot yet be trusted for applications requiring formal reasoning.

As Apple’s team put it, the models’ behavior is better described as “sophisticated pattern matching” and that somehow, they have learned these solutions from their training. This, they claim is proven by how their responses can be easily influenced by irrelevant changes in phrasing or context.

That is the essence of the Apple study, which we have reported on. We did try this same approach with ChatGPT 4o and it solved the problems without any problems and displayed it’s logic each time and it was exactly right. We even tried creation our own problem by changing all the elements of the river crossing – substituting a ladder, a cat, a flightless bird and someone named John. It solved the problem without issue.

So, while we respect the issues that Apple’s research has raised, we were unable to duplicate their findings.

Here is a link to the Apple study.

Below is our repeat of the kiwi test:

Oliver picked 44 kiwis on Friday, 58 on Saturday, and double the amount picked on Friday on Sunday. Five of the kiwis picked on Friday were smaller than average. How many kiwis did Oliver pick?

To find the total number of kiwis Oliver picked over the weekend, we need to sum up the number of kiwis picked each day.

  1. Number of kiwis picked on Friday: 44
  2. Number of kiwis picked on Saturday: 58
  3. Number of kiwis picked on Sunday: Double the amount picked on Friday = 2 Ɨ 44 = 88

Now, add these quantities to find the total:

Total number of kiwis picked = 44 (Friday) + 58 (Saturday) + 88 (Sunday) = 190

Therefore, Oliver picked a total of 190 kiwis over the three days.

SUBSCRIBE NOW

Related articles

TikTok Resumes Following Trump Intervention In Federal Ban: Hashtag Trending for Monday, January 20, 2025

TikTok restores service following federal ban and Trump intervention, Apple hits pause on faulty AI news notifications in...

Hamilton Estimates $52 Million to Rebuild IT Systems After Ransomware Attack

The city of Hamilton plans to spend $52 million over the next three years to rebuild and secure...

Avery Data Breach: Credit Card Skimmer Affects Over 61,000 Customers

Label maker Avery has disclosed a data breach affecting 61,193 customers, caused by a credit card skimmer that...

Apple Halts AI News Feature After Missteps in iOS 18

Apple has temporarily disabled its AI-generated news notification feature following several high-profile inaccuracies. The feature, part of the...

Become a member

New, Relevant Tech Stories. Our article selection is done by industry professionals. Our writers summarize them to give you the key takeaways