← Back to context

Comment by anotherpaulg

12 days ago

Llama 4 Maverick scored 16% on the aider polyglot coding benchmark [0].

  73% Gemini 2.5 Pro (SOTA)
  60% Sonnet 3.7 (no thinking)
  55% DeepSeek V3 0324
  22% Qwen Max
  16% Qwen2.5-Coder-32B-Instruct
  16% Llama 4 Maverick

[0] https://aider.chat/docs/leaderboards/?highlight=Maverick

Did they not target code tasks for this LLM, or is it genuinely that bad? Pretty embarrassing when your shiny new 400B model barely ties a 32B model designed to be run locally. Or maybe is this a strong indication that smaller, specialized LLMs have much more potential for specific tasks than larger, general purpose LLMs.

Side note: `highlight` query param doesn't seem to have any effect on that table (at least for me on Firefox)