← Back to context

Comment by nthypes

10 hours ago

https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main...

Model was released and it's amazing. Frontier level (better than Opus 4.6) at a fraction of the cost.

I don't think we need to compare models to Opus anymore. Opus users don't care about other models, as they're convinced Opus will be better forever. And non-Opus users don't want the expense, lock-in or limits.

As a non-Opus user, I'll continue to use the cheapest fastest models that get my job done, which (for me anyway) is still MiniMax M2.5. I occasionally try a newer, more expensive model, and I get the same results. I have a feeling we might all be getting swindled by the whole AI industry with benchmarks that just make it look like everything's improving.

  • Which model's best depends on how you use it. There's a huge difference in behaviour between Claude and GPT and other models which makes some poor substitutes for others in certain use cases. I think the GPT models are a bad substitute for Claude ones for tasks such as pair-programming (where you want to see the CoT and have immediate responses) and writing code that you actually want to read and edit yourself, as opposed to just letting GPT run in the background to produce working code that you won't inspect. Yes, GPT 5.4 is cheap and brilliant but very black-box and often very slow IME. GPT-5.4 still seems to behave the same as 5.1, which includes problems like: doesn't show useful thoughts, can think for half an hour, says "Preparing the patch now" then thinks for another 20 min, gives no impression of what it's doing, reads microscopic parts of source files and misses context, will do anything to pass the tests including patching libraries...

  • Agree with your assessment, I think after models reached around Opus 4.5 level, its been almost indistinguishable for most tasks. Intelligence has been commoditized, what's important now is the workflows, prompting, and context management. And that is unique to each model.

    • Same for me. There are tasks when I want the smartest model. But for a whole lot of tasks I now default to Sonnet, or go with cheaper models like GLM, Kimi, Qwen. DeepSeek hasn't been in the mix for a while because their previous model had started lagging, but will definitely test this one again.

      The tricky part is that the "number of tokens to good result" does absolutely vary, and you need a decent harness to make it work without too much manual intervention, so figuring out which model is most cost-effective for which tasks is becoming increasingly hard, but several are cost-effective enough.

    • This is not true for some cases e.g. there are stark differences in the correctness of answers in certain type of case work.

  • I found Opus 4.7 to be actually worse than Opus 4.6 for my use case

    Substantially worse at following instructions and overoptimized for maximizing token usage

  • This resonates with me a lot.

    I do some stuff with gemini flash and Aider, but mostly because I want to avoid locking myself into a walled garden of models, UIs and company

  • What do you run these on? I've gotten comfortable with Claude but if folks are getting Opus performance for cheaper I'll switch.

    • You can just use Claude Code with a few env vars, most of these providers offer an Anthropic compatible API

    • Try Charm Crush first, it's a native binary. If it's unbearable, try opencode, just with the knowledge your system will probably be pwned soon since it's JS + NPM + vibe coding + some of the most insufferable devs in the industry behind that product.

      If you're feeling frisky, Zed has a decent agent harness and a very good editor.

      1 reply →

  • actually this is not the reason - the harness is significantly better. There is no comparable harness to Claude Code with skills, etc.

    Opencode was getting there, but it seems the founders lost interest. Pi could be it, but its very focused on OpenClaw. Even Codex cli doesnt have all of it.

    which harness works well with Deepseek v4 ?

    • What's the issue with OC? I tried it a bit over 2 months ago, when I was still on Claude API, and it actually liked more that CC (i.e. the right sidebar with the plan and a tendency at asking less "security" questions that CC). Why is it so bad nowadays?

  • eh idk. until yesterday opus was the one that got spatial reasoning right (had to do some head pose stuff, neither glm 5.1 nor codex 5.3 could "get" it) and codex 5.3 was my champion at making UX work.

    So while I agree mixed model is the way to go, opus is still my workhorse.

  • I don’t know what people are doing but Minimax produced 16 bugreports which of 15 was false positives (literally a mistake).

    In contrast ChatGPT 5.3 and also Opus has a 90% rate at least on this same project. (Embedded)

    All other tests were the same. What are you doing with these models?

How does it compare to Opus 4.7? I've been immersed in 4.7 all week participating in the Anthropic Opus 4.7 hackathon and it's pretty impressive even if it's ravenous from a token perspective compared to 4.6

  • The thing is, it doesnt need to beat 4.7. it just needs to do somewhat well against it.

    This is free... as in you can download it, run it on your systems and finetune it to be the way you want it to be.

    • > you can download it, run it on your systems

      In theory, sure, but as other have pointed out you need to spend half a million on GPUs just to get enough VRAM to fit a single instance of the model. And you’d better make sure your use case makes full 24/7 use of all that rapidly-depreciating hardware you just spent all your money on, otherwise your actual cost per token will be much higher than you think.

      In practice you will get better value from just buying tokens from a third party whose business is hosting open weight models as efficiently as possible and who make full use of their hardware. Even with the small margin they charge on top you will still come out ahead.

      4 replies →

    • Completely agree, not suggesting it needs ot just genuinely curious. Love that it can be run locally though. Open source LLMs punching back pretty hard against proprietary ones in the cloud lately in terms of performance.

  • Tbh I was more productive with 4.6 than ever before and if AI progress locks in permanently at 4.6 tier, I’d be pretty happy

  • It is more than good enough and has effectively caught up with Opus 4.6 and GPT 5.4 according to the benchmarks.

    It's about 2 months behind GPT 5.5 and Opus 4.7.

    As long as it is cheap to run for the hosting providers and it is frontier level, it is a very competitive model and impressive against the others. I give it 2 years maximum for consumer hardware to run models that are 500B - 800B quantized on their machines.

    It should be obvious now why Anthropic really doesn't want you to run local models on your machine.

    • Vibes > Benchmarks. And it's all so task-specific. Gemini 3 has scored very well in benchmarks for very long but is poor at agentic usecases. A lot of people prefering Opus 4.6 to 4.7 for coding despite benchmarks, much more than I've seen before (4.5->4.6, 4->4.5).

      Doesn't mean Deepseek v4 isn't great, just benchmarks alone aren't enough to tell.

    • With the ability of the Qwen3.6 27B, I think in 2 years consumers will be running models of this capability on current hardware.

No, the Deepseek V4 paper itself says that DS-V4-Pro-Max is close to Opus 4.5 in their staff evaluations, not better than 4.6:

> In our internal evaluation, DeepSeek-V4-Pro-Max outperforms Claude Sonnet 4.5 and approaches the level of Opus 4.5.

Is it honestly better than Opus 4.6 or just benchmaxxed? Have you done any coding with an agent harness using it?

If its coding abilities are better than Claude Code with Opus 4.6 then I will definitely be switching to this model.

  • Apparently glm5.1 and qwen coder latest is as good as opus 4.6 on benchmarks. So I tried both seriously for a week (glm Pro using CC) and qwen using qwen companion. Thought I could save $80 a month. Unfortunately after 2 days I had switched back to Max. The speed (slower on both although qwen is much faster) and errors (stupid layout mistakes, inserting 2 footers then refusing to remove one, not seeing obvious problems in screenshots & major f-ups of functionality), not being able to view URLs properly, etc. I'll give deepseek a go but I suspect it will be similar. The model is only half the story. Also been testing gpt5.4 with codex and it is very almost as good as CC... better on long running tasks running in background. Not keen on ChatGPT codex 'personality' so will stick to CC for the most part.

  • Their Chinese announcement says that, based on internal employee testing, it is not as good as Opus 4.6 Thinking, but is slightly better than Opus 4.6 without Thinking enabled.

> (better than Opus 4.6)

There we go again :) It seems we have a release each day claiming that. What's weird is that even deepseek doesn't claim it's better than opus w/ thinking. No idea why you'd say that but anyway.

Dsv3 was a good model. Not benchmaxxed at all, it was pretty stable where it was. Did well on tasks that were ood for benchmarks, even if it was behind SotA.

This seems to be similar. Behind SotA, but not by much, and at a much lower price. The big one is being served (by ds themselves now, more providers will come and we'll see the median price) at 1.74$ in / 3.48$ out / 0.14$ cache. Really cheap for what it offers.

The small one is at 0.14$ in / 0.28$ out / 0.028$ cache, which is pretty much "too cheap to matter". This will be what people can run realistically "at home", and should be a contender for things like haiku/gemini-flash, if it can deliver at those levels.

  • Anthropic fans would claim God itself is behind Opus by 3-6 months and then willingly be abused by Boris and one of his gaslighting tweets.

    LMAO

    • > Anthropic fans ...

      I have no idea why you'd think that, but this is straight from their announcement here (https://mp.weixin.qq.com/s/8bxXqS2R8Fx5-1TLDBiEDg):

      > According to evaluation feedback, its user experience is better than Sonnet 4.5, and its delivery quality is close to Opus 4.6's non-thinking mode, but there is still a certain gap compared to Opus 4.6's thinking mode.

      This is the model creators saying it, not me.

For the curious, I did some napkin math on their posted benchmarks and it racks up 20.1 percentage point difference across the 20 metrics where both were scored, for an average improvement of about 2% (non-pp). I really can't decide if that's mind blowing or boring?

Claude4.6 was almost 10pp better at at answering questions from long contexts ("corpuses" in CorpusQA and "multiround conversations" in MRCR), while DSv4 was a staggering 14pp better at one math challenge (IMOAnswerBench) and 12pp better at basic Q&A (SimpleQA-Verified).

The dragon awakes yet again!

  • There appears a flight of dragons without heads. Good fortune.

    That's literally what the I Ching calls "good fortune."

    Competition, when no single dragon monopolizes the sky, brings fortune for all.