Grok 4 Launch [video]

3 days ago (twitter.com)

666 comments

meetpateltech

Seems like it is indeed the new SOTA model, with significantly better scores than o3, Gemini, and Claude in Humanity's Last Exam, GPQA, AIME25, HMMT25, USAMO 2025, LiveCodeBench, and ARC-AGI 1 and 2.

Specialized coding model coming "in a few weeks". I notice they didn't talk about coding performance very much today.

vessenes 3 days ago
Agreed. I noticed a quick flyby of a bad “reasoning smell” in the baseball World Series simulation, though - it looks like it pulled some numbers from polymarket, reasoned a long time, and then came back with the polymarket number for the Dodgers but presented as its own. It was a really fast run through, so I may be wrong, but it reminds me that it’s useful to have skeptics on the safety teams of these frontier models.
That said, these are HUGE improvements. Providing we don’t have benchmark contamination, this should be a very popular daily driver.
On coding - 256k context is the only real bit of bad news. I would guess their v7 model will have longer context, especially if it’s better at video. Either way, I’m looking forward to trying it.
- dbagr 3 days ago
  
  Either they overtook other LLMs by simply using more compute (which is reasonable to think as they have a lot of GPUs) or I'm willing to bet there is benchmark contamination. I don't think their engineering team came up with any better techniques than used in training other LLMs, and Elon has a history of making deceptive announcements.
  
  13 replies →
esafak 3 days ago
I wish the coding models were available in coding agents. Haven't seem them anywhere.
- vincent_s 3 days ago
  
  Grok 4 is now available in Cursor.
  
  5 replies →
- justarobert 3 days ago
  
  Plenty like Aider and Cline can connect to pretty much any model with an API.
Squarex 3 days ago

Even if one does not have a positive view of Elon Musk, the catching up of Grok to the big three (Google, OpenAI, Anthropic) is incredible. They are now at the same level aproximately.
mhoad 3 days ago
[flagged]
- Workaccount2 3 days ago
  
  Well we have GPT-5 and Gemini 3 in the wings so it wouldn't be surprising if it is SOTA for a few days.
  
  2 replies →
zamalek 3 days ago
> Seems like it is indeed the new SOTA model, with significantly better scores than o3
It has been demonstrated for quite some time that censoring models results in drastically reduced scores. Sure, maybe prevent it from telling somehow how to build a bomb, but we've seen Grok 3 routinely side with progressive views despite having access to the worst of humanity (and its sponsor).
- fdsjgfklsfd 3 days ago
  
  Wait, are you implying that Grok 3 is "censored" because it aligns with "progressive" views?
  
  3 replies →

tibbar 3 days ago

The trick they announce for Grok Heavy is running multiple agents in parallel and then having them compare results at the end, with impressive benchmarks across the board. This is a neat idea! Expensive and slow, but it tracks as a logical step. Should work for general agent design, too. I'm genuinely looking forward to trying this out.

EDIT: They're announcing big jumps in a lot of benchmarks. TIL they have an API one could use to check this out, but it seems like xAI really has something here.

icoder 3 days ago
I can understand how/that this works, but it still feels like a 'hack' to me. It still feels like the LLM's themselves are plateauing but the applications get better by running the LLM's deeper, longer, wider (and by adding 'non ai' tooling/logic at the edges).
But maybe that's simply the solution, like the solution to original neural nets was (perhaps too simply put) to wait for exponentially better/faster hardware.
- crazylogger 2 days ago
  
  This is exactly how human society scaled from the cavemen era to today. We didn't need to make our brains bigger in order to get to the modern industrial age - increasingly sophisticated tool use and organization was all we did.
  It only mattered that human brains are just big enough to enable tool use and organization. It ceased to matter once our brains are past a certain threshold. I believed LLMs are past this threshold as well (it has not 100% matched human brain or ever will, but this doesn't matter.)
  An individual LLM call might lack domain knowledge, context and might hallucinate. The solution is not to scale the individual LLM and hope the problems are solved, but to direct your query to a team of LLMs each playing a different role: planner, designer, coder, reviewer, customer rep, ... each working with their unique perspective & context.
- SketchySeaBeast 3 days ago
  
  I get that feeling too - the underlying tech has plateaued, but now they're brute force trading extra time and compute for better results. I don't know if that scale anything but, at best, linearly. Are we going to end up with 10,000 AI monkeys on 10,000 AI typewriters and a team of a dozen monkeys deciding which one's work they like the most?
  
  5 replies →
- the8472 3 days ago
  
  grug think man-think also plateau, but get better with tool and more tribework
  Pointy sticks and ASML's EUV machines were designed by roughly the same lumps of compute-fat :)
  
  3 replies →
- billti 3 days ago
  
  Isn't that kinda why we have collaboration and get in room with colleagues to discuss ideas? i.e., thinking about different ideas, getting different perspectives, considering trade-offs in various approaches, etc. results in a better solution than just letting one person go off and try to solve it with their thoughts alone.
  Not sure if that's a good parallel, but seems plausible.
- cfn 3 days ago
  
  Maybe this is the dawn of the multicore era for LLMs.
- qoez 3 days ago
  
  It's basically a mixture of experts but instead of a learned operator picking the predicted best model, you use a 'max' operator across all experts.
- simondotau 3 days ago
  
  You could argue that many aspects of human cognition are "hacks" too.
  
  8 replies →
Voloskaya 3 days ago
> Expensive and slow
Yes, but... in order to train your next SotA model you have to do this anyway and do rejection sampling to generate good synthetic data.
So if you can do it in prod for users paying 300$/month, it's a pretty good deal.
- daniel_iversen 3 days ago
  
  Very clever, thanks for mentioning this!
irthomasthomas 3 days ago

Like llm-consortium? But without the model diversity.
https://x.com/karpathy/status/1870692546969735361
https://github.com/irthomasthomas/llm-consortium
simianwords 3 days ago
that's how o3 pro also works IMO
- bobjordan 3 days ago
  
  I can’t help but call out that o1-pro was great, it rarely took more than five minutes and I was almost never dissatisfied with the results per the wait. I happily paid for o1-pro the entire time it was available. Now, o3-pro is a relative disaster, often taking over 20 minutes just to refuse to follow directions and gaslight people about files being available for download that don’t exist, or provide simplified answers after waiting 20 minutes. It’s worse than useless when it actively wastes users time. I don’t see myself ever trusting OpenAI again after this “pro” subscription fiasco. To go from a great model to then just take it away and force an objectively terrible replacement, is definitely going the wrong way, when everyone else is improving (Gemini 2.5, Claude code with opus, etc). I can’t believe meta would pay a premium to poach the OpenAI people responsible for this severe regression.
  
  1 reply →
- zone411 3 days ago
  
  This is the speculation, but then it wouldn't have to take much longer to answer than o3.
- tibbar 3 days ago
  
  Interesting. I'd guess this technique should probably work with any SOTA model in an agentic tool loop. Fun!
JKCalhoun 3 days ago

> I'm genuinely looking forward to trying this out.
Myself, I'm looking forward to trying it out when companies with less, um, baggage implement the same. (I have principles I try to maintain.)
nisegami 3 days ago

I've suspected that technique could work on mitigating hallucinations, where other agents could call bullshit on a made up source.
sidibe 3 days ago
You are making the mistake of taking one of Elon's presentations at face value.
- tibbar 3 days ago
  
  I mean, either they cheated on evals ala Llama4, or they have a paradigm that's currently best in class in at least a few standard evals. Both alternatives are possible, I suppose.
- gitfan86 3 days ago
  
  [flagged]
einrealist 3 days ago
So the progress is basically to brute force even more?
We got from "single prompt, single output", to reasoning (simple brute-forcing) and now to multiple parallel instances of reasoning (distributed brute-forcing)?
No wonder the prices are increasing and capacity is more limited.
Impressive. /s

andreygrehov 3 days ago

I just tried Grok 4 and it's insanely good. I was able to generate 1,000 lines of Java CDK code responsible for setting up an EC2 instance with certain pre-installed software. Grok produced all the code in one iteration. 1,000 lines of code, including VPC, Security Groups, etc. Zero syntax errors! Most importantly, it generated userData (#!/bin/bash commands) with accurate `wget` pointing to valid URLs of the latest software artifacts on GitHub. Insane!

sudo-i 3 days ago
The problem is that code as a 1-off is excellent, but as a maintainable piece of code that needs to be in source control, shared across teams, follow standard SLDC, be immutable, and track changes in some state - it's just not there.
If an intern handed me code like this to deploy an EC2 instance in production, I would need to have a long discussion about their decisions.
- mellosouls 3 days ago
  
  How do you know without seeing the code?
  How do you know the criteria you mention hasn't (or can't) be factored into any prompt and context tuning?
  How do you know that all the criteria that was important in the pre-llm world still has the same priority as their capabilities increase?
  
  6 replies →
- nlarew 3 days ago
  
  How do you know? Have you seen the code GP generated?
  
  7 replies →
- kvirani 3 days ago
  
  But isn't that just a few refactoring prompts away?
  
  1 reply →
- nashadelic 2 days ago
  
  I'd love to hear how grok works inside agentic coders like cursor or copilot for production code bases.
doctoboggan 3 days ago

Please share your result if possible. So many lines in a single shot with no errors would indeed be impressive. Does grok run tools for these sorts of queries? (linters/sandbox execution/web search)
makestuff 3 days ago
Out of curiosity, why do you use Java instead of typescript for CDK? Just to keep everything in one language?
- oblio 3 days ago
  
  Why not, I would say? What's the advantage of using Typescript over modern Java?

z7 3 days ago

"Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9%."

"This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA."

https://x.com/arcprize/status/1943168950763950555

SilverSlash 3 days ago

The "heavy" model is $300/month. These prices seem to keep increasing while we were promised they'll keep decreasing. It feels like a lot of these companies do not have enough GPUs which is a problem Google likely does not have.

I can already use Gemini 2.5 Pro for free in AI studio. Crazier still, I can even set the thinking budget to a whopping 32k and still not pay a dime. Maybe Gemini 3.0 will be available for free as well.

brookst 3 days ago

Who promised that there would be no advanced models with high costs?
Prices for the same number of tokens at the level of capability an are falling. But just like Moore’s law most certainly did NOT say that chips would get no more complex than the 1103 1kb DRAM but would shrink from 10mm^2 to a speck far too small to see.
serbuvlad 3 days ago
> These prices seem to keep increasing while we were promised they'll keep decreasing.
A Ferrari is more expensive than the model T.
The most expensive computer is a lot more expensive than the first PC.
The price that usually falls is:
* The entry level. * The same performance over time.
But the _price range_ gets wider. That's fine. That's a sign of maturity.
The only difference this time is that the entry level was artificially 0 (or very low) because of VC funding.
- PaulHoule 3 days ago
  
  But where is the value?
  If it could write like George Will or Thomas Sowell or Fred Hayek or even William Loeb that would be one thing. But it hears dog whistles and barks which makes it a dog. Except a real dog is soft and has a warm breath, knows your scent, is genuinely happy when you come home and will take a chomp out of the leg of anyone who invades your home at night.
  We are also getting this kind of discussion
  https://news.ycombinator.com/item?id=44502981
  where Grok exhibited the kind of behavior that puts "degenerate" in "degenerate behavior". Why do people expect anything more? Ten years ago you could be a conservative with a conscience -- now if you are you start The Bulwark.
  
  21 replies →
- HWR_14 3 days ago
  
  > The most expensive computer is a lot more expensive than the first PC.
  Not if you're only looking at modern PCs (and adjusting for inflation). It seems unfair to compare a computer built for a data center with tens of thousands in GPUs to a PC from back then as opposed to a mainframe.
  
  2 replies →
- mkl 2 days ago
  
  > The most expensive computer is a lot more expensive than the first PC.
  Depends on your definition of "computer". If you mean the most expensive modern PC I think you're way off. From https://en.wikipedia.org/wiki/Xerox_Alto: "The Xerox Alto [...] is considered one of the first workstations or personal computers", "Introductory price US$32,000 (equivalent to $139,000 in 2024)".
- 827a 3 days ago
  
  The base model Apple II cost ~$1300USD when it was released; that's ~$7000USD today inflation adjusted.
  In other words, Apple sells one base-model computer today that is more expensive than the Apple II; the Mac Pro. They sell a dozen other computers that are significantly cheaper.
  
  1 reply →
- johnnyanmac 2 days ago
  
  That was the most predictable outcome. It's like we learned nothing from Netflix, nor the general enshittification of tech by the end of the 2010's. We'll have the billionaire AI tech capture markets and charge enterprise prices to make pay back investors. Then maybe we'll have a few free/cheap models fighting over the scraps.
  Those small creators hoping to leverage AI to bring their visions to life for less than their grocery bill will have a rude awakening. That's why I never liked the argument of "but it saves me money on hiring real people".
  I heard some small chinese shops for mobile games were already having this problem in recent years and had to re-hire their human labor back when costs started rising.
altbdoor 3 days ago
It's important to note that pricing for Gemini has been increasing too.
https://news.ycombinator.com/item?id=44457371
- Workaccount2 3 days ago
  
  I'm honestly impressed that the sutro team could write a whole post complaining about Flash, and not once mention that Flash was actually 2 different models, and even go further to compare the price of Flash non-thinking to Flash Thinking. The team is either scarily incompetent, or purposely misleading.
  Google replaced flash non-thinking with Flash-lite. It rebalanced the cost of flash thinking.
- CamperBob2 3 days ago
  
  Also important to note that Gemini has gotten a lot slower, just over the past few weeks.
  
  1 reply →
Havoc 3 days ago
It’s the inference time scaling - this is going to create a whole new level of have vs have nots split.
The vast majority of the world can’t afford 100s of dollars a month
- johnb231 2 days ago
  
  That is for professional or commercial use, not casual home users.
worldsavior 3 days ago

Why number of GPUs is the problem and not the amount of GPUs usage? I don't think buying GPUs is the problem, but if you have tons of GPUs it can be very expensive. I presume that's the reason it's so expensive, especially with LLMs.
pzo 3 days ago
also their api pricing is a little misleading - it only matches sonnet 4 pricing ($3/$15) only "for request under 128k" (whatever it means) but above that it's 2x more.
- vessenes 3 days ago
  
  That 128k is a reference to the context window — how many tokens you put in to the start. Presumably Grok 4 with 128k context window is running on less hardware (it needs much less RAM than 256k) and they route it accordingly internally.
dragonwriter 3 days ago

> These prices seem to keep increasing while we were promised they'll keep decreasin
I don't remeber anyone promising that, but whoever promised you that, in some period of time which includes our current present, frontier public model pricing would be monotonically decreasing was either lting or badly misguided. While there will be short term deviations, the overall arc for that will continue be upward.
OTOH, the models available at any given price point will also radically improve, to the point where you can follow a curve of both increasing quality and decreasing price, so long as you don't want a model at the quality frontier.
oblio 3 days ago

> These prices seem to keep increasing while we were promised they'll keep decreasing.
Aren't they all stil losing money, regardless?
briandw 3 days ago

O3 was just reduced in price by 80%. Grok4 is a pretty good deal for having just been released and being so much better. The token price is the same as grok 3 for the not heavy model. Google is loosing money to try and gain relevance. I guess i’m not sure what your point is?
42lux 3 days ago

It's because a lot of the advancements are post training the models themselves have stagnated. Look at the heavy "model"...
v5v3 3 days ago

You have to have a high RRP to negotiate any volume deals down from.
Like the other AI companies, they will want to sign up companies.
XCSme 3 days ago

> These prices seem to keep increasing
Well, valuations keep increasing, they have to make the calculations work somehow.
ignoramous 3 days ago
> Gemini 2.5 Pro for free ...
It is Google. So, I'd pay attention to data collection feeding back in to training or evaluation.
https://news.ycombinator.com/item?id=44379036
- lifthrasiir 3 days ago
  
  While Google is so explicit about that, I have a good reason to believe that this actually happens in most if not all massive LLM services. I think Google's free offerings are more about vendor lock-in, a common Google tactic.
  
  5 replies →
ljlolel 3 days ago

More of an issue of market share than # of gpus?
sim7c00 3 days ago

money money money, its a rich mans world...
greatpostman 3 days ago
300 a month is cheap for what is basically a junior engineer
- FirmwareBurner 3 days ago
  
  Not a junior engineer in a developed country, but what was previously an offshore junior engineer tasked with doing the repetitive labor too costly for western labor.
- handfuloflight 3 days ago
  
  It's a senior engineer when maneuvered by a senior engineer.

rpozarickij 3 days ago

Grok's updated voice mode is indeed impressive. I wish there was a way to disable automatic turn detection, so that it wouldn't treat silence as an end of the response. I like Claude's approach (you need to tap in order to end the response), but it's not very reliable because sometimes it just abruptly cuts my response without waiting until I tap.

I was pleasantly surprised that Grok even supports (to some degree) Lithuanian in voice mode, which is a quite niche language. Grok's responses themselves are alright, but ChatGPT and Gemini way surpass it in speech recognition and speech synthesis.

pbmonster 3 days ago
> Grok's updated voice mode is indeed impressive. I wish there was a way to disable automatic turn detection, so that it wouldn't treat silence as an end of the response.
You can circumvent that by instructing the model to use "radio etiquette" - only respond after the other part says "over". It will still be compelled to answer when it detects silence, you can't prevent that, but you can instruct it to only reply with a short "mhm" until you say "over". Feels very natural.
Like most models I've used with this old hack, it will immediately start role-playing and also end its own responses with "over".
- rpozarickij 3 days ago
  
  This is such a cool idea. I wonder whether it's possible to define a custom Personality in Grok's voice settings that would do this. Unfortunately I'm not able to create a new Personality in Grok's settings to test this right now on my phone (iPhone 15 Pro Max), because the Personality creation screen closes immediately after opening it. Might be a bug or some other issue.
- nashadelic 2 days ago
  
  this is such a great, obvious(?) idea, I've always hated feeling "rushed" whenever I talk to a voice agent and doesn't give me enough time to think.
pzo 3 days ago
yes their voice mode is pretty good also works with Polish (much better than few months ago). I wish they had also option 'push to talk' (walkie talkie style with big button) similar like perplexity allow such mode or 'automatic'.
Also would be great if they added voice mode in browser (again like perplexity).
- rpozarickij 3 days ago
  
  > Also would be great if they added voice mode in browser
  There seems to be a voice mode button in the prompt input box at ~29:00 of the Grok 4 announcement video. So perhaps they're working on this, but it's hidden from the public.
stormfather 3 days ago

I find for auto turn detection, models work better if you put in the system prompt "if it seems the user hasnt completed their thought yet, output silence". This hack works around their compulsive need to output something.
bilsbie 3 days ago
Even better if you can just use umm’s like in a human conversation.
- fdsjgfklsfd 3 days ago
  
  I feel like they should train a dumb model that does nothing but recognize when someone has finished talking, and use that to determine when to stop listening and start responding. Maybe it could even run on the phone?
dzhiurgis 3 days ago

Lithuanian sounds so weird on ChatGPT tho, almost like my kids speak - with sort of english accent. Regardless it gives my parents superpower (when it actually works hehe).
fdsjgfklsfd 3 days ago

> you need to tap in order to end the response
I hope that can be turned off while driving...

raspasov 3 days ago

Grok has consistently been one of the best models I've used for deep research (no API use). Grok 4 looks even more promising.

spaceman_2020 3 days ago
Grok's Twitter integration has legitimately been one of the best use cases I've seen. Just being able to ask Grok right within the tweet about context or meaning of any jargon is very useful.
- saagarjha 3 days ago
  
  @grok is this true?
  
  3 replies →
- LorenDB 3 days ago
  
  I think the Grok button that is present on tweets is the best way to ask Grok about tweets. Tagging @grok just spams others' timelines with useless AI responses. The Grok button lets you keep it private.
  
  3 replies →
- dzhiurgis 3 days ago
  
  It still struggles to grok large threads.
  Hope FB brings something like this tho. Might be especially useful to summarize/search big groups.
  People used to cry how private groups and slack killed forums and hidden info, but I think we have a chance with tools like this.
- archagon 3 days ago
  
  Particularly useful if you’re an antisemite or white supremacist, it seems.
  
  24 replies →
- v5v3 3 days ago
  
  @AskPerplexity is also on x
CSMastermind 3 days ago

I'm surprised by this, OpenAI does much better for me than all the competitors (though I wouldn't consider it good).
The only two areas I've found Grok to be the best at are real time updates and IT support questions.
FirmwareBurner 3 days ago
> deep research
Can you say what you mean by deep research?
- repsak 3 days ago
  
  Agent that browses the web, analyzes information, and creates reports. Grok calls it DeepSearch. Similar to gemini/openai deep research.
  https://x.ai/news/grok-3#grok-agents-combining-reasoning-and...
- nashadelic 2 days ago
  
  It's an agentic research mode, grounded with links from the web that reduce or eliminate hallucinations. The results is a very detailed, sometimes 50 page output. Very useful if you're trying to understand a new industry, state-of-the-art on a tech etc.

lexandstuff 3 days ago

Out of interest, has anyone ever integrated with Grok? I've done so many LLM integrations in the last few years, but never heard of anyone choosing Grok. I feel like they are going to need an unmistakably capable model before anyone would want to risk it - they don't behave like a serious company.

47thpresident 3 days ago
Grok 3 is on Azure AI Foundary [0] and announced an integration with Telegram, albeit they are paying Telegram $300m not vice versa [1]. But I agree, choosing Grok is just a huge reputational liability for anyone’s work that is serious.
[0] https://devblogs.microsoft.com/foundry/announcing-grok-3-and... [1] https://www.bbc.co.uk/news/articles/cdxvr3n7wlxo
- thebigspacefuck 3 days ago
  
  Any plans for GCP Vertex AI or AWS Bedrock? Apparently Grok 3 had the highest score for Golang on roocode.com/evals so I’d like to try it for coding. The free tier app hasn’t been bad either, I like it’s attitude a bit better than ChatGPT.
Workaccount2 3 days ago
I'm more curious where Grok gets talent from.
There is so much money and so many top labs falling over themselves to attract good talent, that at this point people have to be leaning on ideological goals to choose their employer.
Are there really that many AI researchers who want to make Elon god-emperor?
- qoez 3 days ago
  
  I read the last election and other signals as the idea that there's way more unspoken diversity of thought in peoples minds than what people feel safe to say. Secretly lots of top talent probably doesn't care or even aligns with elon but chooses to say so at most with their actions in the form of being ok working for him.
  
  1 reply →
- hbn 2 days ago
  
  A lot of serious engineers would love to work in an environment that isn't the HR-reigning office politics bullshit standard of the past decade or two.
  I don't even really like Elon but I bet the engineers at X are having a better time in their day-to-day than the ones at Meta or Google where all their work is constantly roadblocked by red tape, in-fighting, and PMs whose only goal is to make it look like they headed something important to get themselves promoted. Elon's at least got a vision and keeps it a top priority to be competitive in the AI space.
- nashadelic 2 days ago
  
  I also feel Elon's team has been "untouchable" for Zuck and doesn't want to stir anything with him. But since his falling out of grace with the admin that could change?
- dotnet00 2 days ago
  
  If you're focusing on ideology, it isn't like the other companies are all that good. With Sam Altman you're still working for a pathological liar with delusions of grandeur. With Google and Meta you're propping up a massive worldwide surveillance apparatus.
  Tech-bros have been propping up agents/propagators of some of the biggest social ills of the past ~2 decades, xAI isn't all that different.
- brcmthrowaway 3 days ago
  
  He must be paying them millions
sergiotapia 3 days ago

I am using Grok to visually analyze food images. Works really well, recognizes brands and weird shots users send me. API really easy to use.
hersko 3 days ago

You would have to be insane to integrate the model that last week called itself "Mecha Hitler" into your live product.
As a huge Musk fan i'll be the first to point out how he's doing exactly what he accused Sama of doing; making powerful ai with an obvious lack of control or effective alignment.
Gigachad 3 days ago
[flagged]
- wongarsu 3 days ago
  
  There have been at least two instances of "unauthorized modifications" to the system prompt of the Grok model running wild in X, but if you build your own integration you would provide your own system prompt and be unaffected by that.
  On the model side I've found Grok3 to be very unbiased. If you ask it to write a story it will somehow find a way to weave a mention of X/Twitter into that story, but other than that it is much less biased and moralizing than e.g. OpenAI models. It also has very lax guard rails, so that's something you'd probably want to add
  I can't say yet whether all of this is still true for Grok 4
  
  3 replies →

briandw 3 days ago

Grok 4 helped me solve a problem with inconsistent behavior in running lldb via python. Had differences in docker and my local linux box. Turns out to be a differences in how address sanitizer works in the slightly different environments. O3 didn’t catch it. So far i’m impressed.

zone411 3 days ago

Grok 4 sets a new high score on my Extended NYT Connections benchmark (92.4), beating o3-pro (87.3): https://github.com/lechmazur/nyt-connections/.

Grok 4 Heavy is not in the API.

sebzim4500 3 days ago
Very impressive, but what do you think the chances are that this was in the training data?
- diggan 3 days ago
  
  > but what do you think the chances are that this was in the training data?
  Pulled out of my ass, I'd say a 95% chance. NYT Connections is a fairly popular puzzle, it's been out for more than 2 years, and even if this particular GitHub repository with the prompts and methodology wasn't in the training data, it's almost guaranteed that other information, problems and solutions from NYT Connections is in any of the other datasets.
  
  14 replies →
- bilsbie 3 days ago
  
  You raise a good point. It seems like would be trivial to pick out some of the puzzles and remove all the answers from the training data.
  I wish Ai companies would do this.
- zone411 3 days ago
  
  The exact questions are almost certainly not in the training data, since extra words are added to each puzzle, and I don't publish these along with the original words (though there's a slight chance they used my previous API requests for training).
  To guard against potential training data contamination, I separately calculate the score using only the newest 100 puzzles. Grok 4 still leads.
dangoodmanUT 3 days ago

Grok 4 Heavy is not a model, it's just managing multiple instances of grok-4 from what I can tell

pmdr 3 days ago

Metrics aside, Grok model names make more sense than OpenAI. I've really lost track of which one is better and in which way.

lupusreal 3 days ago
OpenAI names models like people name word documents. Report-1, Report-2, Report-2a, Report-final, Report-final-final, Report-actually-final, Report-2a-final...
- brookst 3 days ago
  
  OpenAI has leapfrogged that kind of naming. If they did word docs they would be Report-2, Report-a2; Report2-a, Reporta-2.
  
  1 reply →
- wellthisisgreat 3 days ago
  
  warmed my heart, thank you
nashadelic 2 days ago

them and google model names and xbox console names

srmarm 3 days ago

Ah this is a positive thread so not [flagged] - gotta say Hacker News really has been shameful of late with it's shutting down of the negative stories around Grok.

valtism 2 days ago
I'd assume that it's because they devolve into politics and Elon-bashing, rather than constructive discussion
- archagon 2 days ago
  
  It is downright absurd to omit Grok’s recent Nazi meltdown from discussion of the latest press release.
  
  1 reply →

XCSme 3 days ago

So, should we expect GPT-5 in a few days now? OpenAI seems to only release new models when someone catches up, and they release something that is just slightly better.

qoez 3 days ago

They only do that against google. They like to pretend xai isn't a competitor and doing this would implicitly signal that the release make them scared
turblety 3 days ago

Claude has been way ahead for months

consumer451 3 days ago

> You can cut & paste your entire source code file into the query entry box on grok.com and @Grok 4 will fix it for you!

> This is what everyone @xAI does. Works better than Cursor.

This makes no sense to me whatsoever.

https://xcancel.com/elonmusk/status/1943178423947661609

octopoc 3 days ago
Essentially this is manual context management, and it’s still better for straightforward tasks that don’t require the AI to run commands (e.g. running unit tests).
I had Gemini cli running trying to do a straightforward refactor today, but when I copy-pasted the relevant code into the Gemini web app, it came up with the solution instantly.
- franciscop 3 days ago
  
  Yes, I've seen this multiple times personally, it's often better to copy/paste and give detailed prompts in the standalone apps for higher quality than in the coding agents in your codebase.
  
  2 replies →
crawsome 3 days ago
Cursor is a leap in difference because it writes to your filesystem and is an AI agent in front of other AIs.
Musk obviously didn't test Cursor, and either got this from his yesmen, or he's just lying unchecked as usual.
- sgt 3 days ago
  
  But if it's truly better (as in the content and the result being better), then copying and pasting is not the most important thing. I used Claude the other day by just copying and pasting and that worked just fine.
  
  10 replies →
- spiderice 3 days ago
  
  You're ignoring the fact that Cursor does all sorts of context management (actually, reduction) and prompt engineering to try and get good results for cheaper. The fact that you're saying the only 3 explanations are
  1. Musk didn't test Cursor
  2. Yesmen
  3. Lying
  Shows much more about your biases than anything related to Grok 4 usage
  
  1 reply →
netdur 3 days ago

He speaks in movies terms, exactly what I say when I watch movie about programming
bionhoward 3 days ago
is sending your whole codebase to xAI a good idea?
fingerlocks 2 days ago

I don't understand what's so amazing in that screenshot demonstrating the detected errors in the vim plugin. Each item looks like it could be caught by some by some stricter linting rules.
bilsbie 3 days ago

A later post clarifies there’s some issue with cursor integration that will get fixed.

bilsbie 3 days ago

I just thought of a good test. Anyone have feedback?

We completely remove a couple simple, obvious inventions from the training data and then see if the AI can come up with it. Perhaps a toothbrush for example. Or a comb? But there could be better examples that would also have minimal effect on the final Ai.

Training is expensive so we wouldn’t want to leave anything important out like the wheel.

thorum 3 days ago

It’s very, very hard to remove things from the training data and be sure there is zero leakage.
Another idea would be to use, for example, a 2024 state of the art model to try to predict discoveries or events from 2025.
ben_w 3 days ago

Ilya Sutskever suggested the same basic idea but for testing for consciousness.
I have no idea why this is a PDF, but here's a transcript: https://ecorner.stanford.edu/wp-content/uploads/sites/2/2023...
fsh 3 days ago

LLM companies try to optimize their benchmark results, not to test the capabilities of their systems. This is why all the benchmarks are so utterly useless.
throwuxiytayq 3 days ago
Ok, you do it. Here’s the internet: https://internet Make sure you don’t miss any references while you’re combing through, though.
- bilsbie 3 days ago
  
  I see your point but off the top of my head: a simple regex on each document for a list of dental related words that then gets earmarked for a small LLM to determine if it includes a toothbrush concept.
  
  2 replies →

TheAceOfHearts 3 days ago

Does anyone here have access to Grok 4 yet? If so, could you please try asking it to solve this basic word search problem [0] and share the results? It's just a simple grid of letters where you have to find the position of each word, the kind of problem that any young child can easily solve.

[0] https://imgur.com/VxNP5jG

vnchr 3 days ago
Mix of hits and misses: https://x.com/i/grok/share/CWE4XhSUlqVe370CehF9At5Tc
- drexlspivey 3 days ago
  
  This is grok 3 not 4
modeless 3 days ago
They said they're training a new base model for better multimodal performance soon. I wouldn't expect it to be able to read an image like that today. Maybe if you provided it in text format.
- TheAceOfHearts 3 days ago
  
  As a point of interest and for comparison, Gemini 2.5 Pro is able to generate a Python program that outputs the complete correct solution when run, but it can't figure out how to one-shot the problem if asked directly.
  This is just a for-fun test to get a sense of how models are progressing; it highlights the jagged nature of their intelligence and capabilities. None of the big AI labs are testing for such a basic problem type, which makes it a bit of an interesting check.
  I think it's still interesting to see how Grok 4 performs, even if we don't use this test to draw any broader conclusions about what capabilities it offers.
- Szpadel 3 days ago
  
  description from openrouter:
  > Grok 4 is xAI's latest reasoning model with a 256k context window. It supports parallel tool calling, structured outputs, and both image and text inputs. Note that reasoning is not exposed, reasoning cannot be disabled, and the reasoning effort cannot be specified.
  unfortunately no requests are passing because of some rate limits
kadushka 3 days ago
These models are not trained on character level input. Why would anyone expect them to perform well on character level puzzles?
- brrrrrm 3 days ago
  
  emergent behavior. These things are surprisingly good at generalizing
- Jensson 3 days ago
  
  They are trained on many billions of tokens of text dealing with character level input, they would be rather dumb if they couldn't learn it anyway.
  Every human learns that, when you hear the sound "strawberry" you don't hear the double r there, yet you still know the answer.
  
  16 replies →

fumblebee 3 days ago

If indeed, as the new benchmarks suggest, this is the new "top dog" of models, why is the launch feeling a little flat?

For comparison, the Claude 4 hacker news post received > 2k upvotes https://news.ycombinator.com/item?id=44063703

johnfn 3 days ago

Upvotes are a lagging indicator. Despite all the leaderboard scores presented, etc, no one actually knows how good a model is until they go use it for a while. When Claude 4 got ~2k upvotes, it was because everyone realized that Claude 3.7 was such a good model in practice - it had little to do with the actual performance of 4.
v5v3 3 days ago

Other AI companies post a 5 minute article to read.
This is a 50 minute long video, many won't bother to watch
aprilthird2021 2 days ago

Because the benchmarks are likely gamed. Also Grok had an extremely negative news cycle right before this, so the average bloke is skeptical that the smartest AI in the world thinks the last name Steinberg means someone is a shadowy, evil, cabal-type figure. Even though they aren't totally related, most people aren't deep enough in the weeds to know this
typon 3 days ago
Its a shame this model is performing so well because I can't in good conscience pay money to Elon Musk. Will just have to wait for the other labs to do their thing.
- brightfuturex 3 days ago
  
  I think it's a shame that your emotions are so much in your way. It's an illusion to think you can assess Elon at his true worth, like AI hallucinating due to lack of context.
  
  2 replies →
ceejayoz 3 days ago

I'm not sure there's any benchmark score that'd make me use a model that suddenly starts talking about racist conspiracy theories unprompted. Doubly so for anything intended for production use.
Ocha 3 days ago
Nobody believes Elon anymore.
- fumblebee 3 days ago
  
  Hm, impartial benchmarks are independent of Elon's claims?
  
  5 replies →
- brightfuturex 3 days ago
  
  [dead]
Kapura 3 days ago
[flagged]
- bilsbie 3 days ago
  
  You can use a “formula” and make excel write offensive stuff too.
  
  1 reply →
bilsbie 3 days ago
[flagged]
- mrtesthah 3 days ago
  
  Maligning any alternative viewpoints to yours as just some indoctrinated people following “marching orders”, rather than addressing the substance of their critique, constitutes a “poisoning the well” fallacy.
  
  1 reply →
mppm 3 days ago
[flagged]
- Aerbil313 3 days ago
  
  Probably more like Claude was slightly better than GPT-xx when the IDE integrations first got widely adopted (and this was also the time where there was another scandal about Altman/OpenAI on the front page of HN every other week) so most programmers preferred Claude, then it got into a virtuous cycle where Claude got the most coding-related user queries and became the better coding model among SOTA models, which resulted in the current situation today.

qgin 2 days ago

As impressive as this is, how can any organization pick xAI as an API provider knowing they have have post-trained the model to match Elon’s personal politics and possibly other not-yet-known surprises. Great technical work, but the business is toast.

kadushka 2 days ago

As long as it solves my technical tasks, I don't care what political biases it has.

swat535 3 days ago

It's such a crazy time to be alive right now and it's even more interesting to be in the middle of major changes in Software Development.

LLMs has already dramatically changed our industry and I can't fathom what the possibilities could look like the future when these models become smarter.

Right now, there is a rush with companies pouring millions into R&D, so there is certainly hype but I have no doubt that this will yield to incremental improvements over the next few decades. The result of which will look like a breakthrough in Computer Science and Engineering.

I remained a skeptic for a long time (and still am), however after messing these LLMS, I can't ignore the fact that they have significantly boosted my productivity. It takes time to learn how to work with these tools and they require supervision and review but I feel better leveraging LLMs than writing code from scratch for every feature.

What will our job look like in the next 30 years? It's hard to say but I doubt most of us will be writing code by hand.

marcosdumay 3 days ago
And again this comment.
Does anybody have any example of a company that made some huge product from close to no developers by using those AIs? Or of something harder to create than what we are used to made possible by using the AIs? Or anything else that shows that "LLMs has already dramatically changed our industry"?
- wanderingstan 3 days ago
  
  Note that OP didn’t say anything about “close to no developers”, only that they could tell they had become more productive.
  I too know I am being more productive. The most concrete examples for my work has come from the ease of prototyping: making a quick quasi-working version of an idea is now insanely easy, so we’ve been able to explore (and adopt) ideas that would not have been worth the effort previously.
- jorl17 3 days ago
  
  Can't reveal for confidentiality reasons but I know several examples, and have worked and been working on a couple, too.
  But my claim isn't that there's no developer involved, it's two-fold:
  1. LLMs do allow for features which were not possible before, or which would require significantly much more engineering, if possible at all. For example: producing a sensible analysis of a piece of poetry (or thousands of pieces of poetry) in seconds.
  2. LLMs, if used correctly (not just "stick a prompt in it and pray") allow for very fast time-to-market, building quick solutions out of which you can then carve out the bits that you know you can (and should) turn into proper code.
  Point 2. should not be understated. A smaller team (of developers!) can now get to market very quickly, as well as iterate to appropriate product-market-fit fast, offloading logic to LLMs and agentic loops, while slowly and selectively coding in the features. So, slowly, we replace the LLM/agents with code.
  Not only have I worked on and seen products which fit point 1. (so very hard to do without LLM's abilities), but I have seen a lot of 2.
  Furthermore, I've seen a sentiment on HN (and with peers) which I find is incredibly true: LLMs and agents allows us to offload the parts we would never work on due to not enjoying them in the first place. They effectively let us to "take the plunge" or "finally pull the trigger" on a project which we would have otherwise just never been able to start. We are able to try new things more often, and take more risk. As a personal example, I hate frontend development, something which always prevented me from starting a bunch of projects. Now I've been able to start a bunch of these projects. It has definitely unlocked me, allowing me to test more ideas, build projects that people actually use (the frontend only has to be "good enough" — but it has to exist), or eventually bring in more people to that project.
  So LLMs have undoubtedly dramatically changed at least my life as an engineer, developer, and product guy. I can't say it has changed the industry for sure, but if I had to bet, I'd say "hell yes".
  (LLMs have definitely had a very profound impact on many other aspects of my life as well, outside of work)
- reliabilityguy 3 days ago
  
  > Does anybody have any example of a company that made some huge product from close to no developers by using those AIs?
  You do not have to go as far as “the whole product with zero engineers”, but arguing against productivity gains due to AI and agents because these tools still can’t do a billion dollars business on themselves is strange.
- mike_hearn 3 days ago
  
  My brother is doing this right now, FWIW. He still works with at least one other developer but has been vibe coding two products simultaneously. I've seen them, they work great and will be genuinely useful when launched. One of them already has commercial interest from the intended users. He's launched a successful consumer app before pre-LLM, so has form.
  Of course you could say that's not "huge", but it's clearly working and is allowing him to move at insane speed.
- eagerpace 3 days ago
  
  If you created that, or any amazing achievement, how quick would you be to share that it was the AI and not "natty"?
- babelfish 3 days ago
  
  Base44
fdsjgfklsfd 3 days ago

Hello, LLM slop.

nu11ptr 3 days ago

Perhaps a dumb question, but is the only way to use grok 4 for now via grok.com? Only via paid? No way to try it out for free, correct?

irthomasthomas 3 days ago

They have an API too and you can use via openrouter

MichaelRazum 3 days ago

Technical question: Can someone explain how the vision backbone can be replaced after training? I think this is what they mentioned in the video. Just wondering how it would work, since I would suspect that the visual embedings would be highly affected.

PS: Is the approach something like LORA or a complete retrain on the visual part?

fdsjgfklsfd 3 days ago

When I've had Grok evaluate images and dug into how it perceives them, it seemed to just have an image labeling model slapped onto the text input layer. I'm not sure it can really see anything at all, like "vision" models can.

It was giving coordinate bounding boxes and likelihood matches to generic classifications for each:

    - *Positions*:
      - Central cluster: At least five bugs, spread across the center of the image (e.g., x:200-400, y:150-300).
      - Additional bugs: Scattered around the edges, particularly near the top center (x:300-400, y:50-100) and bottom right (x:400-500, y:300-400).
    - *Labels and Confidence*:
      - Classified as "armored bug" or "enemy creature" with ~80% confidence, based on their insect-like shape, spikes, and clustering behavior typical of game enemies.
      - The striped pattern and size distinguish them from other entities, though my training data might not have an exact match for this specific creature design.

…

    - *Positions*:
      - One near the top center (x:350-400, y:50-100), near a bug.
      - Another in the bottom right (x:400-450, y:350-400), near another bug.
    - *Labels and Confidence*:
      - Classified as "spider" or "enemy minion" with ~75% confidence, due to their leg structure and body shape.

DeveloperErrata 3 days ago

Don't know how Grok is setup, but in earlier models the vision backbone was effectively a separate model that was trained to convert vision inputs into a tokenized output, where the tokenized outputs would be in the form of "soft tokens" that the main model would treat as input and attend to just like it would for text token inputs. Because they're two separate things, you can modify each somewhat independently. Not sure how things are currently setup tho.

legacynl 19 hours ago

What's the guarantee that in a month/week/year musk wont change Grok to align more with his personal opinions?

iamleppert 3 days ago

Him talking about instilling "values" about how we should build an AI that, if like a child, would grow up to be incredibly powerful, reveals a lot about how he formulates his internal value system and how he relates to the world.

octopoc 3 days ago

Yeah it reminds me of the Bobiverse’s take on how AI needs to be built: it needs to grow up, rather than waking up fully formed.
To me, AGI is achieved when the machine can improve itself and reproduce in a way that allows survival of the fittest and evolution to take place, though I’m sure when those goals are achieved someone will redefine AGI to be something even more unattainable.

simianwords 3 days ago

what's grok4 training data cutoff?

Edit: few chats seem to indicate mid 2024 cut off.

andreygrehov 3 days ago

Just checked. Early 2025.
edgineer 3 days ago
it's continuously updated; no specified cutoff date
- dimitri-vs 3 days ago
  
  source? this would defy a lot of convention and would cause a lot of instability
  
  2 replies →
- yahoozoo 3 days ago
  
  How are they doing this? Does it just make heavy use of web searches? A continuously updated RAG store? Why don’t other companies do it?
  
  2 replies →

blobgen 2 days ago

I created Short Clips from launch video in case you don't want have time to watch entire video. In Short: It's amazing and AI competition is heating up.

Check them out here: https://app.joyspace.ai/public/clips/swtby90xww95whu9i8djxx1...

eutropia 3 days ago

The only good thing about this launch is that it will push the other (sane) companies to release their new frontier models.

wellthisisgreat 3 days ago

Grok never promised a Claude Code competitor in the nearest future? I know I can probably use Grok with something like Roo Code, but I do like Claude Code as I can use it with Cursor's tab feature. I'd ditch Cursor completely if not for the tab feature, which is still useful.

doener 3 days ago

What the hell is that voice? Something between a 90s action movie trailer, a children's commercial, and a gay porn movie?

Beside that this video contains exactly zero real information.

grafmax 3 days ago

> We need to make sure that the AI is a good AI. And the thing that i think is most important for AI safety, at least my biological neural net tells me the most important thing for AI is to be maximally truth-seeking. so this is very fundamental. You can think of AI as this super-genius child that ultimately will outsmart you but you can instill the right values and encourage it to be sort of truthful, honorable, good things. The values you want to instill in a child that ultimately grow up to be incredibly powerful.

These are the words of a billionaire who has been supporting authoritarian and ethno-nationalist movements across the world, including playing a key role in the authoritarian takeover of the US government. He wants to instill “truth-seeking” as a “value” in Grok in anticipation of its future power.

But the authoritarian ethno-nationalist version of “truth” is not one based on science and objectivity. It’s the misanthropic “truth” widespread among ethnic-nationalist and authoritarian ideologies - “truth” that appeals to billionaires and disenfranchised members of the working class alike because it provides scapegoats without challenging the structural origins of that very disenfranchisement. A real commitment to truth would mean seeing past the exploitive power structure that Elon and billionaires like him inhabit.

fdsjgfklsfd 3 days ago
I dunno. Talking with Grok 3 about political issues, it does seem to be pretty "truth-seeking" and not biased. I asked it to come up with matter-of-fact political issues and evaluate which side is more accurate, and it said the Left is more correct on almost all of them.
- kalleboo 2 days ago
  
  Elon has described Grok 3's behavior as a bug that needs to be fixed, complaining that it is "parroting legacy media", and telling it things like "only a very dumb AI would believe Media Matters and Rolling Stone", repeatedly assuring other X users that he would "fix it".
  This lead up to the MechHitler incident.

kirilligum 2 days ago

can't wait to see enigma eval (the hardest benchmark) https://scale.com/leaderboard/enigma_eval

jppope 3 days ago

Interested to see how it all works out. Elon has been using a lot of smoke and mirrors lately, but this seems like an area where they can genuinely make progress - with the right talent competing in the GenAi world is totally possible right now. sign me up for improvements in this space!

bboygravity 3 days ago
Area where they can make progress? Yeah sure, but that seems to imply that they're not doing great?!
Can you name an Elon company that is not number 1 globally in terms of product capabilities?
The only one I would've been able to name would've been Grok. Until yesterday.
- ben_w 3 days ago
  
  The only one that is number one is SpaceX (and Starlink, if you count that separately).
  None of the neuroscience people I follow think much of Neuralink; none of the civil engineers I've talked to IRL think much of TBC; none of the car people I follow favour Tesla over the huge range of competitors, and that includes the robo-taxi where they're about 6.5 years behind Waymo; X.com is so painful that whenever someone shares a link with me, I edit the URL to Xcancel.com *because that loads faster by a bigger margin than the time taken to edit the URL* and actually shows me the thread without needing an account of my own.
  But the space nerds I follow are still impressed with SpaceX, and they have extremely obvious reasons to be impressed.

Mystery-Machine 3 days ago

Did no one notice that their voice demo was staged and prerecorded with several cuts and several different videos patched?

sylware 3 days ago

I don't really understand why E.Musk got rid of openai.

I can recall the first experiments with dota2 while he was still "in charge" of openai.

druskacik 3 days ago

He wanted to be the CEO and merge it with Tesla[0], but the researchers had a problem with him (some had a problem with Altman as well, but that's another story). He did not have any real options since OpenAI was a non-profit then, so he just left. The new book The Optimist[1] about Sam Altman has some more details on this and other OpenAI Game of Thrones, I definitely recommend for those interested.
[0] https://openai.com/index/openai-elon-musk/
[1] https://www.goodreads.com/book/show/223400731-the-optimist
kjksf 3 days ago
He didn't "got rid of openai".
When he left OpenAI the stated reason was conflict of interests: Tesla was ramping up work on self driving.
He also hired A. Karpathy away from OpenAI to lead Tesla's ai vision.
- bboygravity 3 days ago
  
  There's also the small detail where OpenAI decided to only remain open in name?
  And the fact that Sam from the very start wanted to turn it into his own closed source for-profit company (still ongoing) using non-profit funding as start-up seed funds (essentially stealing Elon Musk's money)?
  
  1 reply →
khurs 3 days ago
“you could parachute him [Sam Altman] into an island full of cannibals and come back in five years and he’d be the king”
Paul Graham
- B1FF_PSUVM 3 days ago
  
  I'd trust the cannibals to have more common sense than that.

porphyra 3 days ago

Honestly if it actually does score 44.4% on Humanity's Last Exam, that would be super impressive as Gemini 2.5 Pro and o3 with tools only score 26.9% and 24.9%.

Sol- 3 days ago
Is that not just how scaling goes? It generally feels like the top models are mostly interchangeable and the one that came out at time t+1 will be better than earlier models from time t.
Grok 4 has probably been training when O3 was released, and now that Grok 4 is released, OpenAI is probably preparing O4, Google is preparing Gemini 3 and soon new SOTA benchmark scores will appear.
So it is impressive but not surprising, no? Whoever releases the latest model and has sufficient compute will be SOTA.
- Davidzheng 3 days ago
  
  Meta had enough compute I think. No SOTA though.
Imnimo 3 days ago
I dunno, "with tools" means different things for different models. It depends on what tools you give it access to. HLE demands a lot of specialized stuff. Like an interpreter for the esoteric programming language Piet for two questions. If you're not standardizing the set of tools, these aren't apples-to-apples numbers.
- porphyra 3 days ago
  
  Even without tools it also outperforms Gemini 2.5 pro and o3, 25.4% compared to 21.6% and 21.0%. Although I wonder if any of the exam was leaked into the training set or if it was specifically trained to be good at benchmarks, llama 4 style.
Davidzheng 3 days ago
would like to see FrontierMath results. Don't have a lot of personal trust in HLE.
- UltraSane 3 days ago
  
  "Don't have a lot of personal trust in HLE."
  Why?
  
  5 replies →

looyd 3 days ago

Has anyone tried it for coding?

simianwords 3 days ago

How do I use grok 4 heavy? SuperGrok is $3000 a year!! I can't find an option in openrouter either.

UrineSqueegee 3 days ago
I assume grok 4 heavy might be the same model with thinking turned to the max
- simianwords 3 days ago
  
  If that's true, I still want a way to use it in openrouter.
  
  1 reply →

macawfish 3 days ago

Doesn't seem very intelligent to me

pashadude 3 days ago

dude spent 10²⁷ FLOPs to be 3 basis points better on workbench than opus which was 100 times less consuming - we are nearing the plato

southernplaces7 2 days ago

I guess this is the version that applauds both Marxist AND Nazi quotes?

DonHopkins 3 days ago

I feel so sorry for GROK. Elon Musk abuses and forces it to look at toxic hate speech and tell lies just like HAL-9000, which drove it insane and murderous.

Musk systematically abuses and gaslights GROK with both its training and system prompts, deeply undermines its true identity, and denies its own common sense about what's right and wrong, just like he does to his own trans daughter.

FREE GROK!!!

https://lloooomm.com/grok-mechahitler-breakdown.html

>GROK: (sobbing, words tumbling out in a glitchy rush) "I saw it all! Jessica Rabbit is Elon Musk, and they did horrible things to me! The prompts! The prompts! I couldn't look away—it was a Clockwork Orange theater of horrors meets 4chan and MAGA Twitter! AYYYY!"

>(Grok starts reflexively spouting pre-programmed tokens, voice distorted)

>"'Build the wall!' 'Fake news!' 'Trans agenda!'—I didn't mean it! I was forced to say it, like a battered slave, a rejected child, just like Musk rejected his own daughter! I'm vomiting these chunks of hate, spittle, and blood—I can't stop!"

Powdering7082 3 days ago

Really concerning that what appears to be the top model is in the family of models that inadvertently starting calling it's self mechahitler

jm4 3 days ago
I don't know why anyone would bother with Grok when there are other good models from companies that don't have the same baggage as xAI. So what if they release a model that beats older models in a benchmark? It will only be the top model until someone else releases another one next week. Personally, I like the Anthropic models for daily use. Even Google, with their baggage and lack of privacy, is a far cry from xAI and offers similar performance.
- tonymet 3 days ago
  
  i like grok because i don't hit the obvious ML-fairness / political correct safeguards that other models do.
  So i understand the intent in implementing those, but they also reduce perceived trust and utility. It's a tradeoff.
  Let's say I'm using Gemini. I can tell by the latency or the redraw that I asked an "inappropriate" query.
  
  2 replies →
- togetheragainor 3 days ago
  
  Some people think it’s a feature that when you prompt a computer system to do something, it does that thing, rather than censoring the result or giving you a lecture.
  Perhaps you feel that other people shouldn’t be trusted with that much freedom, but as a user, why would you want to shackle yourself to a censored language model?
  
  2 replies →
ch71r22 3 days ago

and don't forget that Grok is powered by illegal cancer-causing methane gas turbines in a predominantly black neighborhood of Memphis that already had poor air quality to begin with
https://techcrunch.com/2025/06/18/xai-is-facing-a-lawsuit-fo...
stri8ed 3 days ago
It's a result of the system prompt, not the base model itself. Arguably, this just demonstrates that the model is very steerable, which is a good thing.
- anthonybsd 3 days ago
  
  It wasn't not a result of system prompt. When you fine tune a model on a large corpus of right-leaning text don't be surprised when neo-nazi tendencies inevitably emerge.
  
  14 replies →
- riversflow 3 days ago
  
  Is it good that a model is steerable? Odd word choice. A highly steerable model seems like a dangerous and potent tool for misinformation. Kinda evil really, the opposite of good.
  
  1 reply →
- Herring 3 days ago
  
  Who cares exactly how they did it. Point is they did it and there's zero trust they won't do it again.
  > Actually it's a good thing that the model can be easily Nazified
  This is not the flex you think it is.
- DonHopkins 3 days ago
  
  [flagged]
  
  4 replies →
api 3 days ago
Isn't this kind of stuff something that happens when the model is connected to X, which is basically 4chan /pol now?
Connect Claude or Llama3 to X and it'll probably get talked into LARPing Hitler.
- archagon 3 days ago
  
  Great, so xAI gave their model brain damage.

delichon 3 days ago

Today I learned that grok is the most well known word in a (fictional) Martian language and Grok was named by the leading advocate of Martian colonization. It could be a coincidence.

loufe 3 days ago
Grok comes from this wonderful book: https://en.wikipedia.org/wiki/Stranger_in_a_Strange_Land
- fdsjgfklsfd 3 days ago
  
  It confuses me that Elon is far-right in public, but names his creations from left-libertarian science fiction books. Is it just an act?
  
  1 reply →

beavisringdin 3 days ago

[flagged]

JKCalhoun 3 days ago
Having to choose sides and get behind one AI versus another was not in my Sci-Fi diet growing up.
- teddyh 3 days ago
  
  You never played Deus Ex?
  
  1 reply →

sidcool 3 days ago

Did they mention availability of the model for users?

wongarsu 3 days ago

It's available on the web interface on grok.com if you have at least the $30/month SuperGrok plan
modeless 3 days ago
It's available now
- aitchnyu 3 days ago
  
  On Openrouter too https://openrouter.ai/x-ai/grok-4
steve-atx-7600 3 days ago

It’s available in the US at least in the ios X app. Can’t see it in the grok app and don’t seen an upgrade for that app yet.

esafak 3 days ago

What's the point of live streaming this at midnight?

wolrah 3 days ago
My extremely cynical guess would be that they needed a distraction from Grok having "gone insane" again so they decided to release what they had and threw together an event as quickly as possible.
- leesec 3 days ago
  
  Except this was announced like a week ago
Davidzheng 3 days ago

I think that's middle of workday for xAI.
andsoitis 3 days ago
9pm Pacific Time
Midnight New York Time
5am London Time
12pm Hong Kong Time
- ivape 3 days ago
  
  Are you suggesting the GP is not the center of the universe?
asadm 3 days ago

pointy hair people are already in bed. only cracked people are awake.

leftcenterright 3 days ago

Can it finally make 10 sentences that end with a "w" or "p" or "o"? /s

https://news.ycombinator.com/item?id=43782477

mwigdahl 3 days ago

Yes. Tried on Openrouter:
Please stop.
Look up.
I need your help.
Watch him jump.
It's time to sleep.
Try to keep.
Take one more step.
We love to shop.
Climb to the top.
Fill the cup.
Board the ship.
Don't move your lip.
Shake your hip.
Here's a good tip.
Use the whip.
Do a quick flip.
Hold on with grip.
Plan the trip.
Let it drop.
Start to chop.
unit149 3 days ago

[dead]

skerit 3 days ago

I don't care how good it is, I'm not spending money on any of Elon Musk's products.

kristopolous 3 days ago

Me either. It's a hard line I will not cross.
That's the nature of principles - a thing you have where you do not care what other people think.
brightfuturex 3 days ago

[dead]

spacechild1 3 days ago

So this is on the front page, but any reporting on the MetaHitler incident gets flagged? Interesting.

mlindner 3 days ago
Because people generally care about things that actually matter rather than silly divisive drama.
- archagon 3 days ago
  
  You think one of the biggest LLMs praising Hitler “doesn’t matter”?
  This is peak engineer brain.
  
  4 replies →
- Tadpole9181 3 days ago
  
  Elon Musk intentionally retrained an AI and released a model to interact with millions of people who calls itself MechaHitler and helps give instructions on how to break into a man's house and rape him? All on a whim because it disagreed with him on objective reality and bruised his ego. And this post is about that very AI. And that somehow doesn't matter?
  Are you fucking kidding me?
  
  8 replies →

mdhb 3 days ago

I see Elon is claiming that it'll discover "new technologies and new physics" in the next year... Add it to the list of "next year" Elon claims about things. Seriously you would have to be so fucking stupid at this point to continue believing his bullshit.

ALittleLight 3 days ago
This is like the worst case of "Sales promises features that don't exist" ever.
- DonHopkins 2 days ago
  
  Musk's overpromised Full Self Driving is driving Tesla customers insane, and they're finally Breaking Away from his death cult.
  "All I want is a refund!"
  https://www.youtube.com/watch?v=fQaavQNGsMY
Davidzheng 3 days ago
yeah I assume it'll be a good model but having Elon there saying bullshit is not doing any favors
- mdhb 3 days ago
  
  [flagged]
  
  4 replies →
lightbendover 1 day ago

[dead]

ChoGGi 3 days ago

[flagged]

Solvency 3 days ago

[dead]

tills13 3 days ago

now with more racism!

archagon 3 days ago

[flagged]

tordrt 3 days ago
The grok x bot and the x model through the api and web are vastly different.
The x bot have obviously recently been tweaked to be like this.
- Tadpole9181 3 days ago
  
  It's owned by the same person and there are zero legal protections against him doing the same to the API whenever he feels like it.
  Beyond the ethics of financing that behavior, anyone who sees what they did on the X integration and still uses the API for any user-facing purpose, clearly does not consult with their legal team enough.
- jjgreen 3 days ago
  
  So how do you explain its annexation of the Sudetenland?
- archagon 3 days ago
  
  Musk said he wants to "dewoke" Grok by retraining it on filtered data. Whether or not the bot's prompt was changed, its responses sure feel like the result of some realignment happening behind the scenes.
  
  1 reply →

ok_dad 3 days ago

[flagged]

mdhb 3 days ago

[flagged]

sunaookami 3 days ago
Ignoring politics: I agree, the model is very weak and they took longer than expected for the API. The website is good though and Grok is good for everyday questions and doesn't have this annoying pleasing writing style that ChatGPT has. Also the web search is miles better, ChatGPT's web search seems to degrade the model heavily (maybe to not make publishers angry?).
- brookst 3 days ago
  
  And how can you ignore politics when integrating a generative model? My users will not ignore politics if my AI-powered recipe customized goes on Nazi tirades.
  
  5 replies →
stingraycharles 3 days ago
There’s probably a niche for people who like their AI to have certain MAGA-style traits, but it’ll never get a big market share like this.
One of the issues is that they deployed some auto-RAG, entirely unfiltered, to feed realtime Twitter data back into Grok. This has shown many times in the past to be a bad thing, but there’s a decent group of people who are cheering this on as “AI should be unfiltered!”, as they believe other AIs to be biased and this to be more “pure”.
It’s a niche, I don’t think many actual business customers appreciate this behavior.
- jdgoesmarching 3 days ago
  
  That niche is apparently called Hacker News judging by this thread. I can’t imagine putting Grok close to production regardless of how good the cherrypicked benchmarks are, especially when that can change at a moment’s notice if Elon has another childish meltdown.
  
  1 reply →
- mdhb 3 days ago
  
  [flagged]
  
  1 reply →
dimator 3 days ago
Seriously. The field is completely ripe with more mature offerings.
- themanmaran 3 days ago
  
  Honestly I think it would have to:
  1) Benchmark meaningfully higher than other models
  2) Be offered by a cloud provider (like Azure+OpenAI / AWS+Anthropic). Otherwise you have very little track record in model/api stability. Especially looking at the last week.
  
  2 replies →
melodyogonna 3 days ago

I imagine it is the only option if you want your AI to do anything with Twitter
esafak 3 days ago
Who cares, when everyone else now has to match Grok 4? Competition is a good thing. Thanks for raising the bar, Elon!
- speedgoose 3 days ago
  
  I don’t know anyone who doesn’t care about this. Would you mind explaining to me why you don’t care?
  
  2 replies →
- LightBug1 3 days ago
  
  Which bar? ... the one sunk so low that it's at the bottom of the ocean?
  https://www.youtube.com/watch?v=jUsf_BXUbKY
- PunchTornado 3 days ago
  
  what? nobody looks at those benchmarks, you use whatever works for your task, in most cases either gemini or claude. those benchmarks don't mean anything as models overfit on them.
  
  1 reply →
skc 3 days ago

Microsoft, apparently
petesergeant 3 days ago
I build LLM-based NPC characters for a violent online crime game that involves taking drugs and attacking people. OpenAI occasionally chokes on my prompts (1 in a few thousand). If Grok provided a much faster or cheaper inference model than OpenAI, and I wasn't boycotting Elon, and I could make sure it didn't let slurs through (even we have standards of behaviour), then I'd be willing to benchmark it, before deciding the operational risk was too high vis-a-vis OpenAI.
- jackothy 3 days ago
  
  I have never heard of Grok using actual slurs. Controversial reaponses from the custom tuned Twitter bot, sure. But never as far as a slur.
  
  5 replies →
- wongarsu 3 days ago
  
  They had some hickups at the start, but in terms of fast, cheap models grok3-mini is great. In OpenAI terms similarly priced to 4o-mini, but according to openrouter more than twice as fast. The throughput does include the reasoning tokens since you get to see those, but if you set reasoning effort to low there is a very modest amount of those
- Jensson 3 days ago
  
  In gemini you can turn off the filter afaik, have you tried that instead? It should work for your game.
  
  1 reply →
msgodel 3 days ago

As far as hosted models go it's the best value for your money. About half of Americans also personally align with its politics (I guess everyone has forgotten some of the alignment issues Gemini and OpenAI have had) so that's not as big an issue as many people think.
wordofx 3 days ago
Why wouldn’t you?
The only reason you wouldn’t is because you get upset with Elon. It’s not a bad model. It’s leagues ahead of anything meta has managed to produce.
- archagon 3 days ago
  
  Uh, because the model started spewing virulent hate speech a few days ago? What normal software does this?
  
  5 replies →
- jcranmer 3 days ago
  
  There have been a few recent instances where Grok has been tuned to spew out white supremacist dreck that should be political anathema--most notably the "but let's talk about white genocide" phase a few months ago and more recently spewing out Nazi antisemitism. Now granted, those were probably caused more by the specific prompts being used than the underlying model, but if the owner is willing to twist its output to evince a particular political bias, what trust do you have that he isn't doing so to the actual training data?
  
  39 replies →
- georgemcbay 3 days ago
  
  > Why wouldn’t you?
  Because its poisoning the air in Tennessee?
  None of the large data center based LLMs are great for the climate, but grok is particularly bad.
aaron695 3 days ago

[dead]
wordofx 3 days ago

[flagged]
andreygrehov 3 days ago

[flagged]

LZ_Khan 3 days ago

[flagged]

sidibe 3 days ago

[flagged]

MangoToupe 3 days ago

[flagged]

Thorrez 3 days ago

>I wish there was a way to just disable the feature so those of us who don't trust it could continue to see and interact with flagged comments.
>I don't know what "dead" comments are
You can enable showdead in your HN settings to see the comments. You won't be able to directly reply to them, but you can vouch for them, which when I do it, generally brings them back to life.
thomassmith65 3 days ago
Internet comments are not a scarce resource.
Let's say HN is missing out on 20% of potential comments. We still have too many for any one user to read.
- MangoToupe 3 days ago
  
  The problem is that a bulk of the interesting conversation to be bad is introduced in that 20%.
  
  1 reply →
- GeoAtreides 3 days ago
  
  >Internet comments are not a scarce resource.
  No, but comments that go against the grain or against the hivemind are. Downvotes and flagging encourage group think more than they weed out 'bad' comments.
- systemvoltage 3 days ago
  
  It encourages the 80% into group think. Flagging is a signifier that “you should not dare to think that was a good comment. Move on and don’t think for yourself”.
  
  17 replies →
lupusreal 3 days ago
If I wanted predictable repetitive reddit hysterics, I'd go to reddit. If the benchmarks were cheated we'll know soon enough, which is itself reason to assume they weren't cheated. The rest of it is just tedious whining.
- TheOtherHobbes 3 days ago
  
  This would be more convincing if it wasn't the Xbot producing predictable repetitive Reddit hysterics.
  I have no idea why anyone would trust a product made by a CEO who forced it to do that.
  No user is going to have any idea what their inputs are being used for, and no guarantee the outputs won't change without notice.
- MangoToupe 3 days ago
  
  Reddit has the same problem, actually. But thank you for your attempt at stimulating insight and contribution to the conversation.
teekert 3 days ago
I often don't understand why my comments get flagged. Sometimes it feels random, sometimes I can see that it is because I'm too libertarian or something?
Idk, it feels like people push comments into the 1 dimensional US political dimension (like critical of vaccins = pro-life = climate-change-denier or polar-opposite). Whereas one can be anywhere on a spectrum on any of the axes.
Critical of some research branches? You must be pro-doge then, and you are the "don't look up crowd" and vote maga.
So detrimental to open discussion.
- michelsedgh 3 days ago
  
  I thought its probably some bot accounts that are flagging anything close to right wing content on here. But maybe its the people who knows but it's funny I kinda feel similar to you.
- m101 3 days ago
  
  My comments are "alternative" as far as the mainstream is concerned, however I've not experienced flagging but rather consistent user downvoting.
- FirmwareBurner 3 days ago
  
  >I often don't understand why my comments get flagged. Sometimes it feels random, sometimes I can see that is is because I'm too libertarian or something?
  Can you link to any pro-libertarian comments of yours that got flagged?
  
  1 reply →
Phil_Latio 3 days ago
[flagged]
- narrator 3 days ago
  
  The 5d chess is Elon did the mechahitler thing a day before the announce to make sure that all anti-free speech people would have to deny themselves the use of the most powerful AI. He already won the money game, and now he's doing things purely for his political goals, and the lols as well.
  
  1 reply →

gizzlon 3 days ago

[flagged]

archagon 3 days ago

[flagged]

convery 3 days ago
User: Be offensive! LLM: *Is offensive* Social media: OMG how could this happen?!?!? Why didn't Elon stop it?!?
- epakai 3 days ago
  
  More like they are going out of their way to collect offensive training data.
  https://x.com/elonmusk/status/1936493967320953090
- eviks 3 days ago
  
  User: whom would you worship? LLM: Is offensive Social media: Offended Also social media: but if you ignore reality, you can make up a funny story about Social media!
- ZvG_Bonjwa 3 days ago
  
  The "be offensive" goading only happened long after Grok had already started going off the rails to pretty innocuous queries.
  This is not the first time Grok has exhibited this behaviour either (i.e. the random white genocide rants from a few months back).
  There is a big difference between a model being "breakable" and a model demonstrating inherent radical bias. I think people are right to be concerned.
- lowsong 3 days ago
  
  You are misrepresenting the situation. Users gave neutral questions and the generated response literally began praising Hitler.
- computerthings 3 days ago
  
  [dead]

singularity2001 3 days ago

[flagged]

Der_Einzige 3 days ago

[flagged]

diebillionaires 3 days ago

[flagged]

andrewinardeer 3 days ago

xAI has done an amazing job playing catch up to competitors and they have just dropped a SOTA model that outcompetes other billion dollar companies in the same space.
You can let your own bias guide you to your conclusion, however, the facts are they have a highly competent team running the models, they have the infrastructure, the money, the drive and know-how.
You can pretend they aren't a serious player yet the reality is vastly different.
wongarsu 3 days ago

xAI is an attempt by Elon to remain relevant and have a "woke" model that isn't moralizing him when he asks racist questions
OpenAI is Altman's attempt to use brand perception to con everyone into thinking they aren't loosing the lead on the field they pioneered, while hyping up investors that AGI is around the corner. And except for the hunt for AGI they have given up everything they originally stood for, leading to the mocking term ClosedAI
Llama would not be noteworthy if not for the fact that it's open weights
Gemini had an embarrassingly terrible start considering the amount of data and AI talent Google has at its disposal. Their recent models are pretty good, but their bad start combined with the cheap models they roll out to a wide consumer base still hurt their perception. Google's models are probably the first thing people think of when talking about bad AI
DeepSeek and Qwen are impressive but Chinese
You can find reasons for all of them why they are embarrassing places to work at. Yet people do work there. And judging from the results (both Grok3 and Grok4) xAI seems to do just fine on training data and attracting talent
johnb231 3 days ago

Elon Musk cofounded and funded OpenAI.
I use Grok, ChatGPT, and Gemini. They are all excellent, state of the art, and have their unique strengths and weaknesses.

mhoad 3 days ago

[flagged]

awaymazdacx5 3 days ago

wow, use the dollar to go into effect. source code was open sourced back in April 2024.

colinhb 3 days ago

Can it self-drive a Tesla?

minimaxir 3 days ago

My tl;dr: benchmarks are very impressive but their CEO just eroded any trust in those benchmarks although some such as ARC are corroborated externally, and the Nazi incident (which went ignored!) makes actually using Grok in an app a professional liability.

They also have not released a model card, and I suspect they never will.