It says it's been trained from scratch. I wonder if it will have the same undescribable magic that makes me spend an hour every day with 2.5. I really love the results I can get with 2.5 pro. Google eventually limiting aistudio will be a sad day.
Also I really hoped for a 2M+ context. I'm living on the context edge even with 1M.
They scored a 31.1% on ARC AGI 2 which puts them in first place.
Also notable which models they include for comparison: Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-5.1. That seems like a minor snub against Grok 4 / Grok 4.1.
My impression is that Grok is very rarely used in practice outside of a niche of die-hard users, partly because of very different tuning to other models, and partly the related public reputation around it.
https://firstpagesage.com/reports/top-generative-ai-chatbots... suggests 0.6% of chat use cases, well below the other big names, and I suspect those stats for chat are higher than other scenarios like business usage. Given all that, I can see how Gemini might not be focused on competing with them.
well, there are 3 kind of usages for grok:
- using grok inside X/Twitter: most people interacts with Grok this way.
- using grok on its website: this is really annoying, as you get delayed by cloudflare everytime you access the site. As grok does not provide serious advantage over other services, why bother
- you can also use the app, but it is not as convenient as other services.
I don’t know anyone who uses Grok, but in my peer group everyone uses 1-2 paid services like Gemini or Clause or ChatGPT. They’re probably not as “extremely online” as I am, so I can’t generalize this thought, but anecdotally my impression has been that Grok is just very “right wing influencer” coded.
I would want to hear more detail about prompts, frameworks, thinking time, etc., but they don't matter too much. The main caveat would be that this is probably on the public test set, so could be in pretraining, and there could even be some ARC-focussed post-training - I think we don't know yet and might never know.
But for any reasonable setup, if no egregious cheating, that is an amazing score on ARC 2.
> The training dataset also includes: publicly available datasets that are readily downloadable; data
obtained by crawlers; licensed data obtained via commercial licensing agreements; user data (i.e., data
collected from users of Google products and services to train AI models, along with user interactions
with the model) in accordance with Google’s relevant terms of service, privacy policy, service-specific
policies, and pursuant to user controls, where appropriate; other datasets that Google acquires or
generates in the course of its business operations, or directly from its workforce; and AI-generated
synthetic data.
Well don't complain when you are using Gmail and your emails are being trained to develop Gemini.
It says "pursuant to user controls, where appropriate". We can now sleep peacefully with the knowledge that Google will give us the tools to disable this where it's not inappropriate.
So that's why Google is getting sued for Gemini being enabled by default in Gmail and analyzing emails and our data; completely going against whatever privacy policy they came up with. [0]
I don't expect them to follow their own privacy policies.
it was accidentally pushed a little early, and now it has been taken down.
here’s the archived pdf: https://web.archive.org/web/20251118111103/https://storage.g...
Interesting to see on page 2 the reference to ML pathways [1]. Looks like a multi layer mixture of experts. Is this common ?
[1] https://blog.google/technology/ai/introducing-pathways-next-...
It says it's been trained from scratch. I wonder if it will have the same undescribable magic that makes me spend an hour every day with 2.5. I really love the results I can get with 2.5 pro. Google eventually limiting aistudio will be a sad day.
Also I really hoped for a 2M+ context. I'm living on the context edge even with 1M.
buy a pixel and you get it basically unlimited for free for a year ;)
They scored a 31.1% on ARC AGI 2 which puts them in first place.
Also notable which models they include for comparison: Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-5.1. That seems like a minor snub against Grok 4 / Grok 4.1.
My impression is that Grok is very rarely used in practice outside of a niche of die-hard users, partly because of very different tuning to other models, and partly the related public reputation around it.
https://firstpagesage.com/reports/top-generative-ai-chatbots... suggests 0.6% of chat use cases, well below the other big names, and I suspect those stats for chat are higher than other scenarios like business usage. Given all that, I can see how Gemini might not be focused on competing with them.
well, there are 3 kind of usages for grok: - using grok inside X/Twitter: most people interacts with Grok this way. - using grok on its website: this is really annoying, as you get delayed by cloudflare everytime you access the site. As grok does not provide serious advantage over other services, why bother - you can also use the app, but it is not as convenient as other services.
it is understandable that grok is not popular.
I don’t know anyone who uses Grok, but in my peer group everyone uses 1-2 paid services like Gemini or Clause or ChatGPT. They’re probably not as “extremely online” as I am, so I can’t generalize this thought, but anecdotally my impression has been that Grok is just very “right wing influencer” coded.
Grok seems extremely prone to hallucination in my experience. It also constantly asserts certainty on fuzzy topics.
About ARC 2:
I would want to hear more detail about prompts, frameworks, thinking time, etc., but they don't matter too much. The main caveat would be that this is probably on the public test set, so could be in pretraining, and there could even be some ARC-focussed post-training - I think we don't know yet and might never know.
But for any reasonable setup, if no egregious cheating, that is an amazing score on ARC 2.
Update: it is available at https://aistudio.google.com now!
For the veracity of the link itself: https://storage.googleapis.com/deepmind-media/* has been used by DeepMind itself (e.g. "View tech report" in https://deepmind.google/models/gemini/) so it is a genuine leak.
gone now;
wayback machine still has it: https://web.archive.org/web/20251118111103/https://storage.g...
good benchmark stats except for coding where it looks similar to other SOTA models
Benchmark suggests it is a resounding win for Gemini 3 Pro as the top model.
> The training dataset also includes: publicly available datasets that are readily downloadable; data obtained by crawlers; licensed data obtained via commercial licensing agreements; user data (i.e., data collected from users of Google products and services to train AI models, along with user interactions with the model) in accordance with Google’s relevant terms of service, privacy policy, service-specific policies, and pursuant to user controls, where appropriate; other datasets that Google acquires or generates in the course of its business operations, or directly from its workforce; and AI-generated synthetic data.
Well don't complain when you are using Gmail and your emails are being trained to develop Gemini.
It says "pursuant to user controls, where appropriate". We can now sleep peacefully with the knowledge that Google will give us the tools to disable this where it's not inappropriate.
So that's why Google is getting sued for Gemini being enabled by default in Gmail and analyzing emails and our data; completely going against whatever privacy policy they came up with. [0]
I don't expect them to follow their own privacy policies.
[0] https://www.yahoo.com/news/articles/google-sued-over-gemini-...