Comment by applfanboysbgon
9 hours ago
> Deep seek 3.2 is 4% on Arc-AGI 2
Why are you bringing up an outdated Chinese model from 6 months ago to compare to a US model from 6 months ago? The outdated Chinese model will have performance from ~12 months ago, obviously. But today's Chinese model DeepSeek 4 has performance not far from the US model 6 months ago; 46% compared to 52% from 5.2.
Because Deepseek 4.0 is not yet there, but the jump isn't expected to be large. Kimi 2.5 is there and is also scoring low.
Deepseek V4 came out three weeks ago: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro
Kimi K2.5 has also been superseded by a finer tuned Kimi K2.6 three weeks ago. Moonshot's Kimi models appear to be the favored Chinese model, at least for coding, and not Deepseek V4. z.AI's GLM 5.1 is also worth mentioning as rather competent for coding, also released in April.
Those models too will not be beating US AI labs by your metrics (although for coding, Kimi K2.6 might beat the very uneven Gemini depending on the situation), but in your critism at least consider the state of the art in your comparisons.
I have been using Deepseek v4 pro for personal projects and home infra related work for last couple of weeks. It's quality of work is not bad at all, it is fairly fast and given the fraction of the cost compared to Claude, I can keep going which makes it a very compelling option. Looking forward to trying out Kimi 2.6, thanks for the recommendation.
Also they have a pretty big token discount running this month: https://api-docs.deepseek.com/quick_start/pricing/
Even without the discount, I'll have to think about whether I need the 100 EUR tier of Anthropic Max, or whether downgrading to Pro and using DeepSeek is good enough. And they're also up on OpenRouter and other places.
Been using those models, not quite comparable with Opus 4.6/4.7 but with max reasoning, pretty good for a variety of dev tasks! Only big problem is no ability to process images, so can't really do browser use for some semi-automated testing, I'd have to write Playwright tests even when I don't want to.
2 replies →
Hum, I'm using it [0] with my Ollama Cloud subscription since the last two weeks and I love it. Never reached the 5 hours usage limits of the $20 plan (on side projects) where I would reach it sometimes in ONE prompt with Opus.
[0]: https://ollama.com/library/deepseek-v4-pro