Comment by tibbar
4 days ago
The trick they announce for Grok Heavy is running multiple agents in parallel and then having them compare results at the end, with impressive benchmarks across the board. This is a neat idea! Expensive and slow, but it tracks as a logical step. Should work for general agent design, too. I'm genuinely looking forward to trying this out.
EDIT: They're announcing big jumps in a lot of benchmarks. TIL they have an API one could use to check this out, but it seems like xAI really has something here.
I can understand how/that this works, but it still feels like a 'hack' to me. It still feels like the LLM's themselves are plateauing but the applications get better by running the LLM's deeper, longer, wider (and by adding 'non ai' tooling/logic at the edges).
But maybe that's simply the solution, like the solution to original neural nets was (perhaps too simply put) to wait for exponentially better/faster hardware.
This is exactly how human society scaled from the cavemen era to today. We didn't need to make our brains bigger in order to get to the modern industrial age - increasingly sophisticated tool use and organization was all we did.
It only mattered that human brains are just big enough to enable tool use and organization. It ceased to matter once our brains are past a certain threshold. I believed LLMs are past this threshold as well (it has not 100% matched human brain or ever will, but this doesn't matter.)
An individual LLM call might lack domain knowledge, context and might hallucinate. The solution is not to scale the individual LLM and hope the problems are solved, but to direct your query to a team of LLMs each playing a different role: planner, designer, coder, reviewer, customer rep, ... each working with their unique perspective & context.
I get that feeling too - the underlying tech has plateaued, but now they're brute force trading extra time and compute for better results. I don't know if that scale anything but, at best, linearly. Are we going to end up with 10,000 AI monkeys on 10,000 AI typewriters and a team of a dozen monkeys deciding which one's work they like the most?
> the underlying tech has plateaued, but now they're brute force trading extra time and compute for better results
You could say the exact same thing about the original GPT. Brute forcing has gotten us pretty far.
3 replies →
Yes. It works pretty well.
grug think man-think also plateau, but get better with tool and more tribework
Pointy sticks and ASML's EUV machines were designed by roughly the same lumps of compute-fat :)
This is an interesting point. If this ends up working well after being optimized for scale it could become the dominant architecture. If not it could become another dead leaf node in the evolutionary tree of AI.
Isn't that kinda why we have collaboration and get in room with colleagues to discuss ideas? i.e., thinking about different ideas, getting different perspectives, considering trade-offs in various approaches, etc. results in a better solution than just letting one person go off and try to solve it with their thoughts alone.
Not sure if that's a good parallel, but seems plausible.
Maybe this is the dawn of the multicore era for LLMs.
It's basically a mixture of experts but instead of a learned operator picking the predicted best model, you use a 'max' operator across all experts.
You could argue that many aspects of human cognition are "hacks" too.
…like what? I thought the consensus was that humans exhibit truly general intelligence. If LLMs require access to very specific tools to solve certain classes of problems, then it’s not clear that they can evolve into a form of general intelligence.
5 replies →
They are, but I think the keyword is "generalization". Humans do very well when innovation is required, because innovation needs generalized models that can be used to make very specialized predictions and then meta-models that can predict how specialized models relate to each other and cross reference those predictions. We don't learn arithmetic by getting fed terabytes of text like "1+1=2". We only use text to communicate information, but learn the actual logic and concept behind arithmetic, and then we use that generalized model for arithmetic in our reasoning.
I struggle to imagine how much further a purely text based system can be pushed - a system that basically knows that 1+1=2 not because it has built an internal model of arithmetic, but because it estimates that the sequence of `1+1=` is mostly followed by `2`.
1 reply →
> Expensive and slow
Yes, but... in order to train your next SotA model you have to do this anyway and do rejection sampling to generate good synthetic data.
So if you can do it in prod for users paying 300$/month, it's a pretty good deal.
Very clever, thanks for mentioning this!
Like llm-consortium? But without the model diversity.
https://x.com/karpathy/status/1870692546969735361
https://github.com/irthomasthomas/llm-consortium
that's how o3 pro also works IMO
I can’t help but call out that o1-pro was great, it rarely took more than five minutes and I was almost never dissatisfied with the results per the wait. I happily paid for o1-pro the entire time it was available. Now, o3-pro is a relative disaster, often taking over 20 minutes just to refuse to follow directions and gaslight people about files being available for download that don’t exist, or provide simplified answers after waiting 20 minutes. It’s worse than useless when it actively wastes users time. I don’t see myself ever trusting OpenAI again after this “pro” subscription fiasco. To go from a great model to then just take it away and force an objectively terrible replacement, is definitely going the wrong way, when everyone else is improving (Gemini 2.5, Claude code with opus, etc). I can’t believe meta would pay a premium to poach the OpenAI people responsible for this severe regression.
I have never had o3-pro take longer than 6-8 minutes. How are you getting it to think for 20 minutes?! My results using it have also been great, but I never used o1-pro so I don't have that as a reference point.
This is the speculation, but then it wouldn't have to take much longer to answer than o3.
Interesting. I'd guess this technique should probably work with any SOTA model in an agentic tool loop. Fun!
> I'm genuinely looking forward to trying this out.
Myself, I'm looking forward to trying it out when companies with less, um, baggage implement the same. (I have principles I try to maintain.)
I've suspected that technique could work on mitigating hallucinations, where other agents could call bullshit on a made up source.
You are making the mistake of taking one of Elon's presentations at face value.
I mean, either they cheated on evals ala Llama4, or they have a paradigm that's currently best in class in at least a few standard evals. Both alternatives are possible, I suppose.
[flagged]
So the progress is basically to brute force even more?
We got from "single prompt, single output", to reasoning (simple brute-forcing) and now to multiple parallel instances of reasoning (distributed brute-forcing)?
No wonder the prices are increasing and capacity is more limited.
Impressive. /s