A New Way to Guide LLM Reasoning

Jan 15, 2025

A lot has been speculated about how the o1 (and now o3) family of models came to be. From the information OpenAI made available, it seems like they are running some type of self-improvement loop, since inference cost seems to scale closely with performance. That's vital for our quest to AGI because it follows the most important lesson from previous succesful models, like AlphaGo: in order to reach superhuman performance, the model should be capable of generating its own signals.

In this article, I propose an algorithm that could potentially achieve similar capabilities to OpenAI's recent models. While this is an ambitious claim, the approach is straightforward enough that its viability can be empirically tested. The method builds on established principles and can be systematically evaluated.

Integral to this idea, it is this work from Google. The TLDR is that if you sample a reasonably well trained LLM a lot of times, the chances are high that you will eventually come up with the correct answer. This is the first insight: the LLM can be used to perform something akin to a Monte Carlo Tree Search.

Now, one thing is saying that it is possible to reach the correct answer; another is to reach that correct answer given real-world constraints. So what we need is to bias the model to perform the reasoning steps most likely to reach the correct conclusion. How do we do that? Well, since an LLM can act like an unbiased estimator of the value of the current state (by counting how many times the LLM reach the correct conclusion if instantiated at that point), we just need to generate multiple variations for each step and keep the one that gives us the highest likelihood of getting the correct answer.

Let's visualize that by looking at this question from GPQA dataset:

Two quantum states with energies E1 and E2 have a lifetime of 10^-9 sec and 10^-8 sec, respectively. We want to clearly distinguish these two energy levels. Which one of the following options could be their energy difference so that they can be clearly resolved?

Now, I can't even grasp what this question is about or how to approach it; and, honestly, neither seems Claude, since it is gets it wrong as well. The correct answer is 10^-4 eV, whatever that means. This is what what Claude outputs if I ask it to generate reasoning steps before answering the question.

<reasoning>
To distinguish between two energy levels, we need to consider the uncertainty principle for energy and time: ΔE * Δt ≥ ħ/2 where ħ = 6.58 × 10^-16 eV·s
For each state, the energy uncertainty ΔE can be found using its lifetime Δt:
For E1: Δt = 10^-9 s ΔE1 ≥ (6.58 × 10^-16)/(2 × 10^-9) = 3.29 × 10^-7 eV
For E2: Δt = 10^-8 s ΔE2 ≥ (6.58 × 10^-16)/(2 × 10^-8) = 3.29 × 10^-8 eV
To clearly resolve the two levels, their energy difference |E2 - E1| must be larger than the sum of their individual uncertainties: |E2 - E1| > ΔE1 + ΔE2 |E2 - E1| > 3.29 × 10^-7 + 3.29 × 10^-8 |E2 - E1| > 3.62 × 10^-7 eV ≈ 3.6 × 10^-7 eV
Therefore, any energy difference larger than 3.6 × 10^-7 eV would allow us to clearly resolve these two energy levels.
</reasoning>
Based on this analysis, the energy difference between the two levels needs to be greater than 3.6 × 10^-7 eV. This means the energy difference should be at least in the order of 10^-6 eV or larger to ensure clear resolution between the two energy levels.
If given options, we should choose the smallest value that is still larger than 3.6 × 10^-7 eV to ensure clear resolution of the energy levels.

Here comes the beauty of this method. What if instead of letting it output the whole reasoning steps all at once, we ask it to generate 10 ideas for the first step, instantiate the LLM 100 times following each one of these 10 generated ideas, and then count how many of these “simulations” ended up getting the correct answer per idea?

Here is a list of possible first steps that Claude quickly generated:

To distinguish between two energy levels, we need to consider the uncertainty principle for energy and time: ΔE * Δt ≥ ħ/2 where ħ = 6.58 × 10^-16 eV·s
To clearly distinguish between two energy levels, the energy difference (ΔE) must be greater than the sum of their individual uncertainties (ΔE1 and ΔE2).
To distinguish two energy levels clearly, their energy difference (ΔE) must be greater than the uncertainty in energy (δE) of each level.

Which of these is a better first step? Well, at the risk of repeating myself, we can just let the LLM decide that for us. Say we start it from each step 100 times and we get 15% correct answers from the first branch, 0% for the second and 5% for the third. Then we keep the branch with the largest score and then sample the next step multiple times again and keep going.

Now, what we can do with that?

Vanilla use case

If you are an AI practitioner and just want to use this method to improve one particular task (with a dataset with verifiable answers) that current LLM's are not getting totally right, you can generate synthetic data for each step and finetune it using DPO (Direct Preference Optimization: basically it takes a pair of values in which one is superior and maximizes the probability of generating something similar while minimizing the other). Using our previous example:

To distinguish between two energy levels, we need to consider the uncertainty principle for energy and time: ΔE * Δt ≥ ħ/2 where ħ = 6.58 × 10^-16 eV·s
>
To clearly distinguish between two energy levels, the energy difference (ΔE) must be greater than the sum of their individual uncertainties (ΔE1 and ΔE2).

Since the first step got 15% of answers right and the second got 0%, we use this as a training sample.

Professional use case

The key thing here is that we have this powerful metric that tells us how good our current state is. Now, imagine using that to not only identify the best candidates for a given step, but also using that to train a model that offers you good feedback during reasoning.

We could, in theory, train another model to try to both predict the current state value and to maximize the delta between the current and the next state by offering text guidance to the base model. If for step #1 we are getting 15% right, it will be the task of this feedback model to offer hints in order to make the next step reach a higher probability of reaching the final correct answer.

With this feedback model in hands, we can not only finetune a better reasoning model, but during inference time also check which branches are more promissing, since we will have access to this estimated state value.

This is what I believe OpenAI may be doing, since rumour has it that o1 and o3 run parallel branches for the same question. Is that the case? No idea. They also must doing more sophisticated things to ensure diversity and adequate length of reasoning branches.

Conclusion

It seems to be that this method should achieve the two most important things: scaling and self-improvement. It is expensive, though, since it requires multiple API calls for each step and a good base model with some hope of getting the correct answer. But it should be enough to inject good reasoning steps for whatever task you have in your hands without the need to build an expensive dataset.

What The Heck

Discussion about this post