← Back to context

Comment by simonw

6 days ago

"they can't be aware of the latest changes in the frameworks I use, and so force me to use older features, sometimes less efficient"

That's mostly solved by the most recent ones that can run searches. I've had great results from o4-mini for this, since it can search for the latest updates - example here: https://simonwillison.net/2025/Apr/21/ai-assisted-search/#la...

Or for a lot of libraries you can dump the ENTIRE latest version into the prompt - I do this a lot with the Google Gemini 2.5 models since those can handle up to 1m tokens of input.

"they fail at doing clean DRY practices" - tell them to DRY in your prompt.

"they bait me into inexisting apis, or hallucinate solutions or issues" - really not an issue if you're actually testing your code! I wrote about that one here: https://simonwillison.net/2025/Mar/2/hallucinations-in-code/ - and if you're using one of the systems that runs your code for you (as promoted in tptacek's post) it will spot and fix these without you even needing to intervene.

"they cannot properly pick the context and the files to read in a mid-size app" - try Claude Code. It has a whole mechanism dedicated to doing just that, I reverse-engineered it this morning: https://simonwillison.net/2025/Jun/2/claude-trace/

"they suggest to download some random packages, sometimes low quality ones, or unmaintained ones" - yes, they absolutely do that. You need to maintain editorial control over what dependencies you add.

Thanks for the links. You mentioned 2 models in your posts, how should I proceed ? I can't possibly pay 2 subscriptions.. do you have a question for the better one to use ?

  • If you're only going to pay one $20/month subscription I think OpenAI wins at the moment - their search tools are better and their voice chat interface is better too.

    I personally prefer the Claude models but they don't offer quite as rich a set of extra features.

    If you want to save money, consider getting API accounts with them and spending money that way. My combined API bill across OpenAI, Anthropic and Gemini rarely comes to more than about $10/month.

> Or for a lot of libraries you can dump the ENTIRE latest version into the prompt - I do this a lot with the Google Gemini 2.5 models since those can handle up to 1m tokens of input.

See, as someone who is actually receptive to the argument you are making, sometimes you tip your hand and say things that I know are not true. I work with Gemini 2.5 a lot, and while yeah, it theoretically has a large context window, it falls over pretty fast once you get past 2-3 pages of real-world context.

> "they fail at doing clean DRY practices" - tell them to DRY in your prompt.

Likewise here. Simply telling a model to be concise has some effect, to be sure, but it's not a panacea. I tell the latest models do do all sorts of obvious things, only to have them turn around and ignore me completely.

In short, you're exaggerating. I'm not sure why.

  • I stand by both things I said. I've found that dumping large volumes of code I to the Gemini 2.5 models works extremely well. They also score very highly on the various needle in a haystack benchmarks.

    This wasn't true of the earlier Gemini large context models.

    And for DRY: sure, maybe it's not quite as easy as "do DRY". My longer answer is that these things are always a conversation: if it outputs code that you don't like, reply and tell it how to fix it.

    • Yeah, I'm aware of the benchmarks. Thomas (author of TFA) is also using Gemini 2.5, and his comments are much closer to what I experience:

      > For the last month or so, Gemini 2.5 has been my go-to (because it can hold 50-70kloc in its context window). Almost nothing it spits out for me merges without edits.

      I realize this isn't the same thing you're claiming, but it's been consistently true for me that the model hallucinates stuff in my own code, which shouldn't be possible, given the context window and the size of the code I'm giving to it.

      (I'm also using it for other, harder problems, unrelated to code, and I can tell you factually that the practical context window is much smaller than 2M tokens. Also, of course, a "token" is not a word -- it's more like 1/3 of a word.)

      1 reply →