Comment by nthypes

10 hours ago

https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main...

Model was released and it's amazing. Frontier level (better than Opus 4.6) at a fraction of the cost.

77 comments

nthypes

I don't think we need to compare models to Opus anymore. Opus users don't care about other models, as they're convinced Opus will be better forever. And non-Opus users don't want the expense, lock-in or limits.

As a non-Opus user, I'll continue to use the cheapest fastest models that get my job done, which (for me anyway) is still MiniMax M2.5. I occasionally try a newer, more expensive model, and I get the same results. I have a feeling we might all be getting swindled by the whole AI industry with benchmarks that just make it look like everything's improving.

versteegen 9 hours ago

Which model's best depends on how you use it. There's a huge difference in behaviour between Claude and GPT and other models which makes some poor substitutes for others in certain use cases. I think the GPT models are a bad substitute for Claude ones for tasks such as pair-programming (where you want to see the CoT and have immediate responses) and writing code that you actually want to read and edit yourself, as opposed to just letting GPT run in the background to produce working code that you won't inspect. Yes, GPT 5.4 is cheap and brilliant but very black-box and often very slow IME. GPT-5.4 still seems to behave the same as 5.1, which includes problems like: doesn't show useful thoughts, can think for half an hour, says "Preparing the patch now" then thinks for another 20 min, gives no impression of what it's doing, reads microscopic parts of source files and misses context, will do anything to pass the tests including patching libraries...
ind-igo 9 hours ago
Agree with your assessment, I think after models reached around Opus 4.5 level, its been almost indistinguishable for most tasks. Intelligence has been commoditized, what's important now is the workflows, prompting, and context management. And that is unique to each model.
- vidarh 7 hours ago
  
  Same for me. There are tasks when I want the smartest model. But for a whole lot of tasks I now default to Sonnet, or go with cheaper models like GLM, Kimi, Qwen. DeepSeek hasn't been in the mix for a while because their previous model had started lagging, but will definitely test this one again.
  The tricky part is that the "number of tokens to good result" does absolutely vary, and you need a decent harness to make it work without too much manual intervention, so figuring out which model is most cost-effective for which tasks is becoming increasingly hard, but several are cost-effective enough.
- wuschel 7 hours ago
  
  This is not true for some cases e.g. there are stark differences in the correctness of answers in certain type of case work.
sandos 8 hours ago
Is Opus nerfed somehow in Copilot? Ive tried it numerous times, it has never reallt woved me. They seem to have awfully small context windows, but still. Its mostly their reasoning which has been off
Codex is just so much better, or the genera GPT models.
- specproc 7 hours ago
  
  Opus just got killed in Copilot. I always found it great, FWIW.
  https://github.blog/news-insights/company-news/changes-to-gi...
spaceman_2020 7 hours ago

I found Opus 4.7 to be actually worse than Opus 4.6 for my use case
Substantially worse at following instructions and overoptimized for maximizing token usage
kmarc 9 hours ago

This resonates with me a lot.
I do some stuff with gemini flash and Aider, but mostly because I want to avoid locking myself into a walled garden of models, UIs and company
post-it 9 hours ago
What do you run these on? I've gotten comfortable with Claude but if folks are getting Opus performance for cheaper I'll switch.
- oceanplexian 8 hours ago
  
  You can just use Claude Code with a few env vars, most of these providers offer an Anthropic compatible API
- slopinthebag 9 hours ago
  
  Try Charm Crush first, it's a native binary. If it's unbearable, try opencode, just with the knowledge your system will probably be pwned soon since it's JS + NPM + vibe coding + some of the most insufferable devs in the industry behind that product.
  If you're feeling frisky, Zed has a decent agent harness and a very good editor.
  
  1 reply →
sandGorgon 8 hours ago
actually this is not the reason - the harness is significantly better. There is no comparable harness to Claude Code with skills, etc.
Opencode was getting there, but it seems the founders lost interest. Pi could be it, but its very focused on OpenClaw. Even Codex cli doesnt have all of it.
which harness works well with Deepseek v4 ?
- darkwater 8 hours ago
  
  What's the issue with OC? I tried it a bit over 2 months ago, when I was still on Claude API, and it actually liked more that CC (i.e. the right sidebar with the plan and a tendency at asking less "security" questions that CC). Why is it so bad nowadays?
avereveard 8 hours ago
eh idk. until yesterday opus was the one that got spatial reasoning right (had to do some head pose stuff, neither glm 5.1 nor codex 5.3 could "get" it) and codex 5.3 was my champion at making UX work.
So while I agree mixed model is the way to go, opus is still my workhorse.
- gunalx 5 hours ago
  
  I find gemini pretty good ob spatial reasoning.
  
  1 reply →
szundi 9 hours ago

I don’t know what people are doing but Minimax produced 16 bugreports which of 15 was false positives (literally a mistake).
In contrast ChatGPT 5.3 and also Opus has a 90% rate at least on this same project. (Embedded)
All other tests were the same. What are you doing with these models?

onchainintel 10 hours ago

How does it compare to Opus 4.7? I've been immersed in 4.7 all week participating in the Anthropic Opus 4.7 hackathon and it's pretty impressive even if it's ravenous from a token perspective compared to 4.6

greenknight 10 hours ago
The thing is, it doesnt need to beat 4.7. it just needs to do somewhat well against it.
This is free... as in you can download it, run it on your systems and finetune it to be the way you want it to be.
- libraryofbabel 8 hours ago
  
  > you can download it, run it on your systems
  In theory, sure, but as other have pointed out you need to spend half a million on GPUs just to get enough VRAM to fit a single instance of the model. And you’d better make sure your use case makes full 24/7 use of all that rapidly-depreciating hardware you just spent all your money on, otherwise your actual cost per token will be much higher than you think.
  In practice you will get better value from just buying tokens from a third party whose business is hosting open weight models as efficiently as possible and who make full use of their hardware. Even with the small margin they charge on top you will still come out ahead.
  
  4 replies →
- p1esk 10 hours ago
  
  Do you think a lot of people have “systems” to run a 1.6T model?
  
  14 replies →
- onchainintel 10 hours ago
  
  Completely agree, not suggesting it needs ot just genuinely curious. Love that it can be run locally though. Open source LLMs punching back pretty hard against proprietary ones in the cloud lately in terms of performance.
- kelseyfrog 10 hours ago
  
  What's the hardware cost to running it?
  
  5 replies →
- johnmaguire 10 hours ago
  
  ... if you have 800 GB of VRAM free.
  
  1 reply →
spaceman_2020 7 hours ago

Tbh I was more productive with 4.6 than ever before and if AI progress locks in permanently at 4.6 tier, I’d be pretty happy
rvz 10 hours ago
It is more than good enough and has effectively caught up with Opus 4.6 and GPT 5.4 according to the benchmarks.
It's about 2 months behind GPT 5.5 and Opus 4.7.
As long as it is cheap to run for the hosting providers and it is frontier level, it is a very competitive model and impressive against the others. I give it 2 years maximum for consumer hardware to run models that are 500B - 800B quantized on their machines.
It should be obvious now why Anthropic really doesn't want you to run local models on your machine.
- deaux 9 hours ago
  
  Vibes > Benchmarks. And it's all so task-specific. Gemini 3 has scored very well in benchmarks for very long but is poor at agentic usecases. A lot of people prefering Opus 4.6 to 4.7 for coding despite benchmarks, much more than I've seen before (4.5->4.6, 4->4.5).
  Doesn't mean Deepseek v4 isn't great, just benchmarks alone aren't enough to tell.
- snovv_crash 9 hours ago
  
  With the ability of the Qwen3.6 27B, I think in 2 years consumers will be running models of this capability on current hardware.
- colordrops 9 hours ago
  
  What's going to change in 2 years that would allow users to run 500B-800B parameter models on consumer hardware?
  
  2 replies →

creamyhorror 7 hours ago

No, the Deepseek V4 paper itself says that DS-V4-Pro-Max is close to Opus 4.5 in their staff evaluations, not better than 4.6:

> In our internal evaluation, DeepSeek-V4-Pro-Max outperforms Claude Sonnet 4.5 and approaches the level of Opus 4.5.

doctoboggan 10 hours ago

Is it honestly better than Opus 4.6 or just benchmaxxed? Have you done any coding with an agent harness using it?

If its coding abilities are better than Claude Code with Opus 4.6 then I will definitely be switching to this model.

bokkies 8 hours ago

Apparently glm5.1 and qwen coder latest is as good as opus 4.6 on benchmarks. So I tried both seriously for a week (glm Pro using CC) and qwen using qwen companion. Thought I could save $80 a month. Unfortunately after 2 days I had switched back to Max. The speed (slower on both although qwen is much faster) and errors (stupid layout mistakes, inserting 2 footers then refusing to remove one, not seeing obvious problems in screenshots & major f-ups of functionality), not being able to view URLs properly, etc. I'll give deepseek a go but I suspect it will be similar. The model is only half the story. Also been testing gpt5.4 with codex and it is very almost as good as CC... better on long running tasks running in background. Not keen on ChatGPT codex 'personality' so will stick to CC for the most part.
madagang 10 hours ago
Their Chinese announcement says that, based on internal employee testing, it is not as good as Opus 4.6 Thinking, but is slightly better than Opus 4.6 without Thinking enabled.
- mchusma 10 hours ago
  
  I appreciate this, makes me trust it more than benchmarks.
- ibic 8 hours ago
  
  In case people wonder where the announcement is (you can easily translate it via browser if you don't read Chinese): https://mp.weixin.qq.com/s/8bxXqS2R8Fx5-1TLDBiEDg
  It's still a "preview" version atm.
- anentropic 3 hours ago
  
  Who uses Opus without thinking though...?
- deaux 9 hours ago
  
  That's super interesting, isn't Deepseek in China banned from using Anthropic models? Yet here they're comparing it in terms of internal employee testing.
  
  2 replies →

NitpickLawyer 9 hours ago

> (better than Opus 4.6)

There we go again :) It seems we have a release each day claiming that. What's weird is that even deepseek doesn't claim it's better than opus w/ thinking. No idea why you'd say that but anyway.

Dsv3 was a good model. Not benchmaxxed at all, it was pretty stable where it was. Did well on tasks that were ood for benchmarks, even if it was behind SotA.

This seems to be similar. Behind SotA, but not by much, and at a much lower price. The big one is being served (by ds themselves now, more providers will come and we'll see the median price) at 1.74$ in / 3.48$ out / 0.14$ cache. Really cheap for what it offers.

The small one is at 0.14$ in / 0.28$ out / 0.028$ cache, which is pretty much "too cheap to matter". This will be what people can run realistically "at home", and should be a contender for things like haiku/gemini-flash, if it can deliver at those levels.

slopinthebag 9 hours ago
Anthropic fans would claim God itself is behind Opus by 3-6 months and then willingly be abused by Boris and one of his gaslighting tweets.
LMAO
- NitpickLawyer 9 hours ago
  
  > Anthropic fans ...
  I have no idea why you'd think that, but this is straight from their announcement here (https://mp.weixin.qq.com/s/8bxXqS2R8Fx5-1TLDBiEDg):
  > According to evaluation feedback, its user experience is better than Sonnet 4.5, and its delivery quality is close to Opus 4.6's non-thinking mode, but there is still a certain gap compared to Opus 4.6's thinking mode.
  This is the model creators saying it, not me.

bbor 9 hours ago

For the curious, I did some napkin math on their posted benchmarks and it racks up 20.1 percentage point difference across the 20 metrics where both were scored, for an average improvement of about 2% (non-pp). I really can't decide if that's mind blowing or boring?

Claude4.6 was almost 10pp better at at answering questions from long contexts ("corpuses" in CorpusQA and "multiround conversations" in MRCR), while DSv4 was a staggering 14pp better at one math challenge (IMOAnswerBench) and 12pp better at basic Q&A (SimpleQA-Verified).

Quasimarion 9 hours ago

FWIW it's also like 10x cheaper.

sergiotapia 10 hours ago

The dragon awakes yet again!

kindkang2024 9 hours ago

There appears a flight of dragons without heads. Good fortune.
That's literally what the I Ching calls "good fortune."
Competition, when no single dragon monopolizes the sky, brings fortune for all.

rapind 10 hours ago

Pop?