Comment by nl

1 month ago

I think it's useful to be realistic about what you can do with a local model, especially something as small as the 9B the author is using. A 9B model is around the level of Sonnet 3.6 - it can do autocomplete and small functions but it loses track trying to understand large problems.

But the are interesting and fun to play with! I do a LOT of work on local agent harnesses etc, mostly for fun.

My current project is a zero install agent: https://gemma-agent-explainer.nicklothian.com/ - Python, SQL and React all run completely in browser. Gemma E4B is recommended for the best experience!

This is under heavy development, needs Chrome for both HTML5 Filesystem API support and LiteRT (although most Chromium based browsers can be made to work with it)

It's different to most agents because it is zero install: the model runs in the browser using LiteRT/LiteLLM (which gives better performance than Transformers.js), and Filesystem API gives it optional sandbox access to a directory to read from.

It is self documenting - you can ask questions like "How is the system prompt used" in the live help pane and it has access to its own source code.

There's quite a lot there: press "Tour" to see it all.

Will be open source next week.

But I was doing a lot more than autocomplete and small functions with Sonnet 3.5.

  • I agree, earlier Sonnet wasn't that great, but Sonnet 3.5 is where things really came together. The difference was night-and-day. Sonnet 3.7, 4.0, 4.5, etc... didn't have as drastic of a change to me.

    • I remember even after 3.7 was released I kept using 3.5 in Cursor because it just did exactly what I wanted

Not to be nitpicky, but many of the 4-12b models are somewhere between GPT-3.5 and GPT-4o-mini. It's hard to find a good comparison though, because the benchmarks people score models against change so often. For reference, Sonnet 3.6 came out about a year after GPT 3.5

  • Don't worry about being nitpicky! I'm going to out-nitpick you....

    Actually....

    I write and publish my own benchmark for this stuff. It's an agentic SQL benchmark which isn't in the training data yet and I've found can separate frontier models from close-followers (the only models to get 100% are Opus 4.6 and GPT 5.5).

    The best small model I've found is a fine-tune of Opus-3.5 9B which scores 18/25: https://sql-benchmark.nicklothian.com/?highlight=Jackrong_Qw...

    Haiku 4.5 scores 20/25, and Haiku is certainly better than Sonnet 3.6. GPT 3.5 scores 13/25.

    • Neat! It seems like Qwen 9b took the same amount of time as gemma4-e4b too, which is interesting. I haven't been able to get Qwen to stop thinking so much

[flagged]

  • I think knowledge is power.

    I think that the more people who try local models (especially the larger ones) the better.

    I sometimes get the impression that many people claiming that local models are as good as frontier models work in "token poor" environments. If you can't build large-scale programs using at least Opus 4.5+ then it's difficult to compare. They compare something like Qwen 27B with Sonnet and see that it is nearly as good, but miss that the frontier models are a lot better.

    That knowledge is power, too.

    I personally can help making local models more accessible. I can't make Opus cheaper.

    • > I sometimes get the impression that many people claiming that local models are as good as frontier models work in "token poor" environments. If you can't build large-scale programs using at least Opus 4.5+ then it's difficult to compare.

      I sometimes get the impression that people posting comments on HN don't realize that LLMs do more than vibe coding.

      4 replies →