Comment by pattt

17 hours ago

Do we have any solid evidence these models can outperform Western models in terms of quality? Or is it more: because they are forbidden, they can't get enough training data, visibility etc. to compete?

Scroll down to the leaderboard - https://arcprize.org/leaderboard

Spoiler alert - they are all towards the bottom of the leaderboard. People come up with a wide variety of excuses for why they are not used despite being offered for significantly lower cost, but the answer is simply because they don't perform well enough for now.

  • There isn't even deepseek V4.

    I'd rather trust LLM arena leaderboard, which puts it on par with sonnet.

    • LM Arena uses human side by side voting, which limits its applicability to complex tasks.

      The ARCPrize leaderboard does have Deepseek V3.2, which only scored 4% on ARC-AGI 2 (while the top models score over 80%). It also Kimi and Qwen, but they also didn't perform well.