Comment by b89kim

4 months ago

I’ve been benchmarking GGUF quants for Python tasks under some hardware configs.

  - 4090 : 27b-q4_k_m
  - A100: 27b-q6_k
  - 3*A100: 122b-a10b-q6_k_L

Using the Qwen team's "thinking" presets, I found that non-agentic coding performance doesn't feel significant leap over unquantized GPT-OSS-120B. It shows some hallucination and repetition for mujoco codes with default presence penalty. 27b-q4_k_m with 4090 generates 30~35 tok/s in good quality.

3 comments

b89kim

dryarzeg 4 months ago

That's quite a specific task for local models like these though (I mean mujoco), so it might be underrepresented in the training data or RL. I'm not sure if you will be able to see a significant leap in this direction in the next 0.5-2 years, although it's still possible.

b89kim 4 months ago
I’ve been testing these on other tasks—IK, Kalman filters, and UI/DB boilerplate. Qwen3.5 is multimodal and specialized for js/webdev or agentic coding. It’s not surprising MoE model have some limitations in specific area. I understand most LLM have limited ability in mathematical/physical reasoning. And I don't think these tasks represent general performance. I'm just sharing personal experiences for those curious.
- dryarzeg 4 months ago
  
  For me, the main issue with all kinds of recent "advancements" in LLMs is their lack in ability to generalize and extrapolate existing knowledge; they often can be quite weak when it comes to complex associations. Because while many LLMs can demonstrate sufficient theoretical knowledge in maths and physics - and by "sufficient" I mean at least postgraduate level - they often simply fail to apply this knowledge in fields that are closer to real life. At least, that's what I've seen from my experience. They're fine with theory, but once it comes to application, it's all messed up - even in their main "specializations" such as web development or other software-related tasks. And that's... kinda disappointing for me and even makes me a bit sad. We have a powerful tool, but we can't use it's true potential either because we're using it in the wrong way or because it's architecture cannot support this true potential.