Comment by LeafItAlone
21 hours ago
> It’s just bizarrely uncompetitive with o3-pro and Grok 4 Heavy.
In my experience Grok 4 and 4 Heavy have been crap. Who cares how many requests you get with it when the response is terrible. Worst LLM money I’ve spent this year and I’ve spent a lot.
It's interesting how multi-dimensional LLM capabilities have proven to be.
OpenAI reasoning models (o1-pro, o3, o3-pro) have been the strongest, in my experience, at harder problems, like finding race conditions in intricate concurrency code, yet they still lag behind even the initial sonnet 3.5 release for writing basic usable code.
The OpenAI models are kind of like CS grads who can solve complex math problems but can't write a decent React component without yadda-yadda-ing half of it, while the Anthropic models will crank out many files of decent, reasonably usable code while frequently missing subtleties and forgetting the bigger picture.
Those may have been the exact people creating training material for OpenAI…
It's just wildly inconsistent to me. Some times it'll produce a work of genius. Other times, total garbage.
Unfortunately we are still in the prompt optimization stage, garbage in garbage out
I hear this repeated so many times I feel like its a narrative pushed by the sellers. Year ago you could ask for glass of wine filled to the brim and you just wouldnt get it. It wasnt garbage in, garbage out, it was sensibility in, garbage out.
The line where chatbots stop being sensible and start outputting garbage is in movement, but slower than avg joe would guess. You only notice it when you get an intuition of the answer before you see it, which requires a lot of experience on range of complexity. Persisten newbies are the best spotters, because they ask obvious basic questions while asking for stuff beyond what geniuses could solve, and only by getting garbage answer and enduring a process of realizing its actually garbage they truly make wider picture of AI than even most powerusers, who tend to have more balanced querries.
1 reply →
Maybe. That could be true.
But doesn’t happen the same with other tools. I’ll give the same exact prompt to all of LLMs I have access to and look at the responses for the best one. Grok is consistently the worst. So if it’s garbage in, garbage out, why are the other ones so much better at dealing with my garbage?
1 reply →