← Back to context

Comment by bluegatty

6 hours ago

Yes, this is very true and it speaks strongly to this wayward notion of 'models' - it depends so much on the tuning, the harness, the tools.

I think it speaks to the broader notion of AGI as well.

Claude is definitively trained on the process of coding not just the code, that much is clear.

Codex has the same limitation but not quite as bad.

This may be a result of Anthropic using 'user cues' with respect to what are good completions and not, and feeding that into the tuning, among other things.

Anthropic is winning coding and related tasks because they're focused on that, Google is probably oriented towards a more general solution, and so, it's stuck in 'jack of all trades master of none' mode.

Google are stuck because they have to compete with OpenAI. If they don’t, they face an existential threat to their advertising business.

But then they leave the door open for Anthropic on coding, enterprise and agentic workflows. Sensibly, that’s what they seem to be doing.

That said Gemini is noticeably worse than ChatGPT (it’s quite erratic) and Anthropic’s work on coding / reasoning seems to be filtering back to its chatbot.

So right now it feels like Anthropic is doing great, OpenAI is slowing but has significant mindshare, and Google are in there competing but their game plan seems a bit of a mess.

  • Google might be a mess now, but they have time. OpenAI and Anthropic are on barrowed time, Google has a built in money printer. They just need to outlast the others.

    • Plus they started making AI processors 11 years ago and invented the math behind “GPTs” 9 years ago. Gemini is way cheaper to run for them than it does for everyone else.

      I think Gemini is really built for their biggest market — Google Search. You ask questions and get answers.

      I’m sure they’ll figure out agentic flows. Google is always a mess when it comes to product. Don’t forget the Google chat sagas where it seems as if different parts of the company were making the same product.

      2 replies →

    • They have much much less time than one would think. Their ads business is about to go into freefall, this will cause the whole company to spiral.

      1 reply →

  • In my experience Gemini 3.0 pro is noticeably better than chatgpt 5.2 for non-coding tasks. The latter gives me blatantly wrong information all the time, the former very rarely.

    • Strange that you say that because the general consensus (and my experience) seems to be the opposite, as well as the AA-Omniscience Hallucination Rate Benchmark which puts 3.0 Pro among the higher hallucinating models. 3.1 seems to be a noticeable improvement though.

      1 reply →

  • Yup, you got it. It's a weird situation for sure.

    You know what's also weird: Gem3 'Pro' is pretty dumb.

    OAI has 'thinking levels' which work pretty well, it's nice to have the 'super duper' button - but also - they have the 'Pro' product which is another model altogether and thinks for 20 min. It's different than 'Research'.

    OAI Pro (+ maybe Spark) is the only reason I have OAI sub. Neither Anthropic nor Google seem to want to try to compete.

    I feel for the head of Google AI, they're probably pulled in major different directions all the time ...

    • If you want that level of research I suggest you ask the model to draft a markdown plan with "[ ]" gates for todo items, and plan it in as many steps as needed. Then ask another LLM to review the plan, judge it. In the end use the plan as the execution state tracker, the model solves one by one the checkboxes.

      Using this method I could recreate "deep research" mode on a private collection of documents in a few minutes. A markdown file can be like a script or playbook, just use checkboxes for progress. This works for models that have file storage and edit tools, which is most, starting with any coding agent.

      1 reply →

  • Google is scoring one own goal after another by making people working with their own data wonder how much of that data is sent off to be used to train their AI on. Without proof to the contrary I'm going to go with 'everything'.

    They should have made all of this opt-in instead of force-feeding it to their audience, which they wrongly believe to be captive.

> Claude is definitively trained on the process of coding not just the code

This definitely feels like it.

It's hard to really judge, but Gemini feels like it might actually write better code, but the _process_ is so bad that it doesn't matter. At first I thought it was bad integration by the GitHub Copilot, but I see it elsewhere now.

  • I don’t think Gemini writes better code, not 3.0 at least.

    Maybe with good prompt engineering it does? admittedly I never tried to tell it to not hard code stuff and it just was really messy generally. Whereas Claude somehow can maintain perfect clarity to its code and neatness and readability out of the box.

    Claude’s code really is much easier to understand and immediately orient around. It’s great. It’s how I would write it for myself. Gemini while it may work is just a total mess I don’t want to have in my codebase at all and hate to let it generate my files even if it sometimes finds solutions to problems Claude doesn’t, what’s the use of it if it is unreadable and hard to maintain.

Tell me more about Codex. I'm trying to understand it better.

I have a pretty crude mental model for this stuff but Opus feels more like a guy to me, while Codex feels like a machine.

I think that's partly the personality and tone, but I think it goes deeper than that.

(Or maybe the language and tone shapes the behavior, because of how LLMs work? It sounds ridiculous but I told Claude to believe in itself and suddenly it was able to solve problems it wouldn't even attempt before...)

  • > Opus feels more like a guy to me, while Codex feels like a machine

    I use one to code and the other to review. Every few days I switch who does what. I like that they are different it makes me feel like I'm getting different perspectives.

  • Your intuition is exactly correct - it's not just 'tone' it's 'deeper than that'.

    Codex is a 'poor communicator' - which matters surprisingly a lot in these things. It's overly verbose, it often misses the point - but - it is slightly stronger in some areas.

    Also - Codex now has 'Spark' which is on Cerebras, it's wildly fast - and this absolutely changes 'workflow' fundamentally.

    With 'wait-thinking' - you an have 3-5 AIs going, because it takes time to process but with Cerebras-backed models ... maybe 1 or 2.

    Basically - you're the 'slowpoke' doing the thinking now. The 'human is the limiting factor'. It's a weird feeling!

    Codex has a more adept 'rollover' on it's context window it sort of magically does context - this is hard to compare to Claude because you don't see the rollover points as well. With Claude, it's problematic ... and helpful to 'reset' some things after a compact, but with Codex ... you just keep surfing and 'forget about the rollover'.

    This is all very qualitative, you just have to try it. Spark is only on the Pro ($200/mo) version, but it's worth it for any professional use. Just try it.

    In my workflow - Claude Code is my 'primary worker' - I keep Codex for secondary tasks, second opinions - it's excellent for 'absorbing a whole project fast and trying to resolve an issue'.

    Finally - there is a 'secret' way to use Gemini. You can use gemeni cli, and then in 'models/' there is a way to pick custom models. In order to make Gem3 Pr avail, there is some other thing you have to switch (just ask the AI), and then you can get at Gem3 Pro.

    You will very quickly find what the poster here is talking about: it's a great model, but it's a 'Wild Stallion' on the harness. It's worth trying though. Also note it's much faster than Claude as well.

    • Spark is fun and cool, but it isn't some revolution. It's a different workflow, but not suitable for everything that you're use GPT5.2 for with thinking set to high, for example, it's way more dumb and makes more mistakes, while 5.2 will carefully thread through a large codebase and spend 40 minutes just to validate the change actually didn't break anything, as long as you provide prompts for it.

      Spark on the other hand is a bit faster at reaching a point when it says "Done!", even when there is lots more it could do. The context size is also very limiting, you need to really divide and conquer your tasks, otherwise it'll gather files and context, then start editing one file, trigger the automatic context compaction, then forget what it was doing and begin again, repeating tons of time and essentially making you wait 20 minutes for the change anyways.

      Personally I keep codex GPT5.2 as the everyday model, because most of the stuff I do I only want to do once, and I want it to 100% follow my prompt to the letter. I've played around a bunch with spark this week, and been fun as it's way faster, but also completely different way of working, more hands-on, and still not as good as even the gpt-codex models. Personally I wouldn't get ChatGPT Pro only for Spark (but I would get it for the Pro mode in ChatGPT, doesn't seem to get better than that).

      3 replies →

    • Agree with this except that spark is good or worth it. Absolutely not for $200, it's a step or two below opus 4.6 for actual reasoning.

> Claude is definitively trained on the process of coding not just the code, that much is clear.

Nuance like this is why I don’t trust quantitative benchmarks.