← Back to context

Comment by observationist

2 days ago

Grok 4 Heavy wasn't considered in comparisons. Grok meets or exceeds the same benchmarks that Gemini 3 excels at, saturating mmlu, scoring highest on many of the coding specific benchmarks. Overall better than Claude 4.5, in my experience, not just with the benchmarks.

Benchmarks aren't everything, but if you're going to contrast performance against a selection of top models, then pick the top models? I've seen a handful of companies do this, including big labs, where they conveniently leave out significant competitors, and it comes across as insecure and petty.

Claude has better tooling and UX. xAI isn't nearly as focused on the app and the ecosystem of tools around it and so on, so a lot of things end up more or less an afterthought, with nearly all the focus going toward the AI development.

$300/month is a lot, and it's not as fast as other models, so it should be easy to sell GLM as almost as good as the very expensive, slow, Grok Heavy, or so on.

GLM has 128k, grok 4 heavy 256k, etc.

Nitpicking aside, the fact that they've got an open model that is just a smidge less capable than the multibillion dollar state of the art models is fantastic. Should hopefully see GLM 4.7 showing up on the private hosting platforms before long. We're still a year or two from consumer gear starting to get enough memory and power to handle the big models. Prosumer mac rigs can get up there, quantized, but quantized performance is rickety at best, and at that point you look at the costs of self hosting vs private hosts vs $200/$300 a month (+ continual upgrades)

Frontier labs only have a few years left where they can continue to charge a pile for the flagship heavyweight models, I don't think most people will be willing to pay $300 for a 5 or 10% boost over what they can run locally.

It seems like someone at X.ai likes maxing benchmarks but real world usage shows it significantly behind frontier models.

I do appreciate their desire to be the most popular coding model on OpenRouter and offer Grok4-Fast for free. That's a notable step down from frontier models but fine for lots of bug fixing. I've put hundreds of millions of tokens through it.

In my experience, Grok 4 expert performs way worse then what the benchmarks say.

I’ve tried it with coding, writing and instructions following. The only thing it excels at currently and searching for things across the web is+ twitter.

Otherwise, I would never use it for anything else. At coding, it always includes an error, when it patches it, it introduces another one. When writing creative text and had to follow instructions, it hallucinates a lot.

Based on my experience, I am suspecting XAI for bench-maxing on Artificial Analysis because no way Grok 4 expert performs close to Gpt-5.2, Claude sonnet 4.5 and Gemini 3 pro

Grok, in my experience, is extremely prone to hallucinations when not used for coding. It will readily claim to have access to internal Slack channels at companies, it will hallucinate scientific papers that do not exist, etc. to back its claims.

I don’t know if the hallucinations extend to code, but it makes me unwilling to consider using it.

  • Fair - it's gotten significantly better over the last 4 months or so, and hallucinations aren't nearly as bad as they once were. When I was using Heavy, it was excellent at ensuring grounding and factual statements, but it's not worth $100 more than ChatGPT Pro in capabilities or utility. In general, it's about the same as ChatGPT Pro - once every so often I'll have to call out the model making something up, but for the most part they're good at using search tools and ensuring claims get grounding and confirmation.

    I do expect them to pull ahead, given the resources and the allocation of developers at xAI, so maybe at some point it'll be clearly worth paying $300 a month compared to the prices of other flagships. For now, private hosts and ChatGPT Pro are the best bang for your buck.

    • What are you doing with GPT Pro? I've compared it directly with Claude Max x20 and Google's premium offer. I just don't see myself ever leaving Claude Code as my daily driver. Codex is slow and opaque, albeit accurate. And Gemini is just super clumsy inside of it's CLI (and in OpenRouter) often confusing BASH and plans with actual output.

  • I had Grok write me a 150 line shell script which it nearly oneshot, except for the fact it made a one character typo in some file path handling code that took me an hour to diagnose. On one hand it’s so close to being really really good for coding, but on the other with this sort of errors (unlike other frontier models which have easily diagnosable error modes) it can be super frustrating. I’m hopeful we will see good things from Grok 5 in the coming months.

every time i use grok is get some bad results. basically is all 1000% perfect from his point of view, review the code... "bollocks" methods that dont exists or just one line of code or method created with a nice comment: //#TODO implement

" Grok 4 Heavy wasn't considered in comparisons. Grok meets or exceeds the same benchmarks that Gemini 3 excels at, saturating mmlu, scoring highest on many of the coding specific benchmarks. Overall better than Claude 4.5, in my experience, not just with the benchmarks."

I think these types of comments should just be forbidden from Hacker News.

It's all feelycraft and impossible to distinguish from motivated speech.