The Hidden Flaw in LLM Reasoning (And How I'm Trying Fixing It)

Jan 12, 2025

Why Can't AI Think Like Humans Yet, Even When We Teach Them to Show Their Work?

For the last few days, I've worked with smaller LLM models called Llama and Phi-4, trying to get them to reason and solve problems. One common way to do this is called "Chain of Thought Supervised Fine-tuning" (CoT-SFT). Think of it like showing the AI many examples of how humans think through a problem step-by-step, and then letting it practice (fine-tuning). We provide them with a dataset full of these thought processes, hoping they'll learn to mimic that way of thinking.

This method does improve things, especially for these smaller models. However, it's not a magic solution. While the models can get good at looking like they're thinking, it turns out they often just fall back on their initial "gut feeling" or internal biases.** To improve this, we need to go deeper and change how these models decide what to say next, in order to enhance their actual reasoning ability.

The Problem: First Impressions Aren't Always Right

Imagine you ask the AI a multiple-choice question. Even if it "reads" and "understands" the question, if its first instinct is that answer "A" is correct, it will tend to try to justify that answer, even if it's wrong. It will produce a chain of reasoning that looks logical on the surface, but it's often a flawed "patch" to support its initial guess rather than a genuine exploration of all the options.

This is because even if a well-trained model has the correct answer somewhere in its vast network of parameters (think of them as connections in its "brain"), it might not be able to access it easily. Instead, it latches onto the first thing that comes to mind.

The Solution: Making the AI Consider Multiple Perspectives (and Increase Think Time)

In the video below, I demonstrate my solution using a modified version of Llama, called Llama 3.2 MedIT 3B o1.

This model has some key improvements:

Multiple "Thinkers": I've changed the Llama architecture so that it can essentially have multiple "thinker" modules working on the problem simultaneously. This increases the "Test-Time Compute" (TTC), which basically means the model works harder during the test.
Better Decision-Making: These "thinkers" don't just work independently. They compare their internal "beliefs" about what the right answer might be. The model then chooses the next word based on which "thinker" has the strongest, most confident reasoning.

An Example: Counting the 'R's

To illustrate this, I ask the model to count the number of "R"s in the word "strawberry." As a tricky twist, I also prompt it to count the "R"s in "rover." This challenges the model to be precise and not jump to conclusions.

In short: We're trying to teach the AI to not just go with its first guess, but to consider multiple lines of reasoning and then choose the best one. This is a step towards making AI models that can truly reason and solve problems, not just imitate the appearance of thinking.

For the Geeks:

CoT-SFT: Chain of Thought Supervised Fine-tuning - the process of training a model on datasets that show step-by-step reasoning.
TTC: Test-Time Compute - the computational resources used when the model is being tested (as opposed to during training).
Internal Beliefs: This refers to the model's internal representation of information and its confidence in different possibilities.
We are focusing on models with fewer than 14 billion parameters. (Larger models may behave differently, but I haven't been able to test them due to limited computing resources.)
The model repeats and patches its first thought, it tries to choose next token which in average confirms its internal belief.

Feel free to ask if you'd like more details on any of this!

Enjoy 🤗

PS: Gemini 2.0 Advanced helped me rewrite my article to be more understandable for a broader audience. I'm a poor writer, so please don't judge me!

DoctorAI Digest

Discussion about this post