Comment by highfrequency
1 day ago
Approach is analogous to Grok 4 Heavy: use multiple "reasoning" agents in parallel and then compare answers before coming back with a single response, taking ~30 minutes. Great results, though it would be more fair for the benchmark comparisons to be against Grok 4 Heavy rather than Grok 4 (the fast, single-agent model).
Yeah the general “discovery” is that using the same reasoning compute effort, but spreading them over multiple different agents generally leads to better results.
It solves the “longer thinking leads to worse results” problem by approaching multiple paths of thinking in parallel, but just not think as long.
> Yeah the general “discovery” is that using the same reasoning compute effort, but spreading them over multiple different agents generally leads to better results.
Isn’t the compute effort N times as expensive, where N is the number of agents? Unless you meant in terms of time (and even then, I guess it’d be the slowest of the N agents).
Not exactly N times, no. In a traditional transformer arch token 1 is cheaper to generate than token 1000 is cheaper than token 10k and so on. So having 10x 1000 tokens would be cheaper to run concurrently than 10.000 in one session.
You also run into context issues and quality degradation the longer you go.
(this is assuming gemini uses a traditional arch, and not something special regarding attention)
The idea is that instead of assigning 10,000 thinking tokens to one chain of thought, assigning 1,000 thinking tokens to 10 chains of thought and composing those independent outputs into a single output yields better results.
The fact that it can be done in parallel is just a bonus.
What makes you sure of that? From the article,
> Deep Think pushes the frontier of thinking capabilities by using parallel thinking techniques. This approach lets Gemini generate many ideas at once and consider them simultaneously, even revising or combining different ideas over time, before arriving at the best answer.
This doesn't exclude the possibility of using multiple agents in parallel, but to me it doesn't necessarily mean that this is what's happening, either.
What could “parallel thinking techniques” entail if not “using multiple agents in parallel”?
How can it not be exactly what’s happening?
Grok-4 heavy benchmarks used tools, which trivializes a lot of problems.
Dumb (?) question but how is Google's approach here different than Mixture of Experts? Where instead of training different experts to have different model weights you just count on temperature to provide diversity of thought. How much benefit is there in getting the diversity of thought in different runs of the same model versus running a consortium of different model weights and architectures? Is there a paper contrasting results given fixed computation between spending that compute on multiple runs of the same model vs different models?
MOE is just a way to add more parameters/capacity to a model without making it less efficient to run, since it's done in a way that not all parameters are used for each token passing through the model. The name MOE is a bit misleading since the "experts" are just alternate paths though part of the model, not having any distinct expertise in the way the name might suggest.
Just running the model multiple times on the same input and selecting the best response (according to some judgement) seems a bit of a haphazard way of getting much diversity of response, if that is really all it is doing.
There are multiple alternate approaches to sampling different responses from the model that come to mind, such as:
1) "Tree of thoughts" - generate a partial response (e.g. one token, or one reasoning step), then generate branching continuations of each of those, etc, etc. Compute would go up exponentially according to number of chained steps, unless heavy pruning is done similar to how it is done for MCTS.
2) Separate response planning/brainstorming from response generation by first using a "tree of thoughts" like process just to generate some shallow (e.g. depth < 3) alternate approaches, then use each of those approaches as additional context to generate one or more actual responses (to then evaluate and choose from). Hopefully this would result in some high level variety of response without the cost of of just generating a bunch of responses and hoping that they are usefully diverse.
Mixture of Experts isn't using multiple models with different specialties, it's more like a sparsity technique, where you massively increase the number of parameters and use only a subset of the weights in each forward pass.
Surprised no one has released an app yet that pits all the major models against each other for a final answer.
That this kind of approach works is good news for local LLM enthusiasts, as it makes Cloud LLM using this more expensive while local LLM can do so for free up to a point (because LLM inference is limited by memory bandwidth not compute, you can run multiple queries in parallel on your graphic card at the same speed as the single one. Until you become compute-bound of course).
> because LLM inference is limited by memory bandwidth not compute, you can run multiple queries in parallel on your graphic card at the same speed as the single one
I don't think this is correct, especially given MoE. You can save some memory bandwidth by reusing model parameters, but that's about it. It's not giving you the same speed as a single query.
Wait, how does this work? If you load in one LLM of 40 GB, then to load in four more LLMs of 40 GB still takes up an extra 160 GB of memory right?
It will typically be the same 40 GB model loaded in, but called with many different inputs simultaneously
Is o3-pro the same as these?
No, it doesn't take 30 minutes
I am surprised such a simple approach has taken so long to be actually used. My first image description cli attempt did basically that: Use n to get several answers and another pass to summarize.
People have played with (multi-) agentic frameworks for LLMs from the very beginning but it seems like only now with powerful reasoning models it is really making a difference.
It's very resource intensive so maybe they had to wait until processes got more efficient? I can also imagine they would want to try and solve it in a... better way before doing this.
I have a similar thing built around a year ago w/ autogen. The difference now is models can really be steered towards "part" of the overall goal, and they actually follow that.
Before this, even the best "math" models were RLd to death to only solve problems. If you wanted it to explore "method_a" of solving a problem you'd be SoL. The model would start like "ok, the user wants me to explore method_a, so here's the solution: blablabla doing whatever it wanted, unrelated to method_a.
Similar things for gathering multiple sources. Only recently can models actually pick the best thing out of many instances, and work effectively at large context lengths. The previous tries with 1M context lengths were at best gimmicks, IMO. Gemini 2.5 seems the first model that can actually do useful stuff after 100-200k tokens.
I agree but I think its hard to get a sufficient increase in performance that would justify 3-4x increase in cost.
It's an expensive approach, and depends on assessment being easy, which is often not the case.