← Back to context

Comment by fragmede

5 days ago

What are some of the models people are using? (Rather than naming the ones they aren't.)

GLM 4.7 is new and promising. MinMax 2.1 is good for agents. Of course the qwen3 family, vl versions are spectacular. NVIDIA Nemotron Nano 3 excels at long context and the unsloth variant has been extended to 1m tokens.

I thought the last one was a toy, until I tried with a full 1.2 megabyte repomix project dump. It actually works quite well for general code comprehension across the whole codebase, CI scripts included.

Gpt-oss-120 is good too, altough I'm yet to try it out for coding specifically

  • Since I'm just a pleb with a 5090, I run GPT-OSS 20B a lot, since it fits comfortably in VRAM with max context size. I find it quite decent for a lot of things, especially after I set reasoning effort to high and disabled top-k and top-p and set min-p to something like 0.05.

    For the Qwen3-VL, I recently read that someone got significantly better results by using F16 or even F32 versions of the vision model part, while using a Q4 or similar for the text model part. In llama.cpp you can specify these separately[1]. Since the vision model part is usually quite small in comparison, this isn't as rough as it sounds. Haven't had a chance to test that yet though.

    [1]: https://github.com/ggml-org/llama.cpp/blob/master/tools/serv... (using --mmproj AFAIK)

  • Does GLM 4.7 run well on the spark? I thought I read it didn’t but it wasn’t clear.