DSpark: Speculative decoding accelerates LLM inference [pdf]

3 days ago (github.com)

387 comments

aurenvale

DeepSeek continues to not only push the boundaries but also publish these incredible papers explaining how they achieved their gains - something the American labs no longer do unfortunately. Chinese labs are doing the most interesting work in AI right now.

sigmar 3 days ago
>publish these incredible papers explaining how they achieved their gains - something the American labs no longer do unfortunately.
Google is still releasing a lot of llm architecture research. They introduced speculative decoding of LLMs in 2022[1], then released the code to perform sceculative decoding for their Gemma 4 model this year[2]
[1] https://arxiv.org/abs/2211.17192
[2] https://github.com/google-gemma/cookbook/blob/main/docs/mtp/...
- kamranjon 3 days ago
  
  Thanks for the clarification - Google does publish more than others - and I actually really appreciate the work they are doing with the Gemma models, which are truly competitive open models. I do wish they’d publish more in depth papers on their Gemma models but appreciate that they are open weights.
- DiabloD3 3 days ago
  
  They weren't the first to do MTP like this, and arguably did it wrong: the MTP heads are kept in a separate file and have to be welded in by the inference engine.
  Qwen 3.6 shipped with working MTP first, and had working MTP in llama.cpp first.
  
  7 replies →
- janalsncm 3 days ago
  
  They also shipped Gemma models with their new Matformer architecture which allows for dynamic computation.
  https://arxiv.org/pdf/2310.07707v2
- sieabahlpark 3 days ago
  
  [dead]
tomalaci 3 days ago
Probably because American AI companies are on the hook for quite a lot of investment money. I think they are trying to find the magical moat to justify their valuation.
Revealing optimizations similar to these would pretty much reduce their competitive position.
- lwansbrough 3 days ago
  
  Chinese labs are also still behind, so they’re incentivized to collaborate and have no reason to do it in private.
  I suspect their tune will change if they ever take the lead..
  
  118 replies →
- davedx 3 days ago
  
  I don't really see the moat for frontier AI labs being "more efficient models" although that could help their margins - I think moats will be built by expanding the horizontal and vertical market expansion - like Anthropic is doing the most at the moment
- cromka 3 days ago
  
  I seriously am far from fear mongering and doomsday mentality, but I just can't see how OpenAI and Anthropic can have a successful IPO if the quality gap between the free and paid continues to narrow like that...
  
  6 replies →
- baxtr 3 days ago
  
  Who is financing DeepSeek and what are they expecting in return?
  
  17 replies →
- janalsncm 3 days ago
  
  Chinese labs are also forced to find performance optimizations since they are aren’t allowed to buy the best chips.
- bluerooibos 3 days ago
  
  > Probably because American AI companies are on the hook for quite a lot of investment money
  That's a lot of words to say it's just capitalist greed.
- spacebacon 3 days ago
  
  [dead]
- budsniffer952 3 days ago
  
  [flagged]
  
  6 replies →
herodoturtle 3 days ago
Publishing by necessity I wonder? American labs on the cutting edge pioneering the way forward, so Deepseek open sourcing what they’ve got is to help even the playing field.
Hopefully the experts here can offer insight. The above is just my hunch and I’m not a specialist in this field.
- try-working 3 days ago
  
  Yes, challenger Labs publish out of necessity. It is a marketing strategy. People assuming open source means giving something up, but the reality is that Z.ai has a revenue of some $100M and it would be about $0M if they never open sourced their models.
- jonplackett 3 days ago
  
  Wouldn’t that just help the American labs anyway though? Or do they assume they’ve actually already figured this stuff out and kept it secret?
  
  18 replies →
- _0ffh 3 days ago
  
  I'm afraid I'm even balking at the word "pioneering" in context with US frontier labs. They are probably doing a few new things, right, but they are not blazing any trails for others to follow along, the Chinese are.
  
  2 replies →
- epolanski 3 days ago
  
  Chinese papers and techniques have been very influential and copied by US labs.
  Multi-head Latent Attention (MLA), Multi-Token prediction, MoE architecture are some of the most famous examples.
  
  3 replies →
- skeledrew 3 days ago
  
  > Publishing by necessity
  It's more a cultural thing. Sharing progress is just in their blood.
  
  2 replies →
rvz 3 days ago
Exactly. They did not have to open up their research up and this is what happens when smart researchers are forced to squeeze performance gains out of existing hardware.
They don't have TPUs or access to the latest Vera Rubin GPUs either to get performance gains for free. All of the optimizations Deepseek have done are in software and it goes down to the PTX assembly level.
Compared to Anthropic who are celebrating in fixing a flickering issue in a terminal app which took months to fix.
- HarHarVeryFunny 3 days ago
  
  > All of the optimizations Deepseek have done are in software and it goes down to the PTX assembly level
  DeepSeek are still using NVIDIA (PTX) to train on, but for inference have already transitioned to Huawei Ascend chips, and inference speed is what this paper is addressing.
- yorwba 3 days ago
  
  Anthropic almost certainly also has optimized software down to the assembly level, considering this take-home interview challenge they published: https://github.com/anthropics/original_performance_takehome/... which is all about instruction-level performance optimizations. That they don't prioritize UI fixes just means they consider other things more important.
  
  7 replies →
- vidarh 3 days ago
  
  > Compared to Anthropic who are celebrating in fixing a flickering issue in a terminal app which took months to fix.
  It's funny, because if you ran Claude Code on a slow terminal, the cause of the flicker was obvious: They kept dumping the entire history of the chat back into the terminal in a number of situations, and relied on the terminal to them end up in the correct state.
- saagarjha 3 days ago
  
  All frontier labs are working down to the PTX level (and lower)
gmerc 3 days ago

Deepseek is commoditizing the performance gains US labs rely on to make their investors money.
jmyeet 3 days ago
Chinese companies (and labs) operate in conjunction with the CCP so whatever they're doing, it's because it's Chinese state policy.
What became clear when DeepSeek came onto the scene was that China was seeking to commoditize LLMs. They consider it an issue of national security not to be beholden to US tech companies when it comes to AI. And I, for one, fully endorse this policy.
Another data point on this is the black market for Claude tokens in China [1]. The chat logs themselves are a commodity to train models.
I believe that OpenAI in particular is a bet on a trillion dollar pot of gold that doesn't exist. Google, Microsoft, Amazon and Meta will all be fine. Anthropic is in a far better position than OpenAI (IMHO) but if DeepSeek or some other Chinese open weight model gets as good at coding, they're in real trouble too.
[1]: https://news.ycombinator.com/item?id=48667495
- anon373839 3 days ago
  
  I don’t see how Anthropic is in a better position. They have a slight edge in model quality right at a time when we’re getting a taste of what cheap, “good enough” AI looks like. They don’t own their own compute. And their own arrogance and lies have alienated a huge chunk of their customer base and alerted everyone to the dangers of being dependent on them.
  
  5 replies →
- tw1984 3 days ago
  
  > Another data point on this is the black market for Claude tokens in China [1]. The chat logs themselves are a commodity to train models.
  anyone with IQ higher than 130 (thus qualified for actual AI R&D) would be questioning something obvious here -
  if they are already doing such dodgy stuff with the aim to maximize profits, why would those resellers have large amount of logs with actual American model responses to sell to those AI labs in the first place. shouldn't they just post train & customize some leading Chinese open source models to pretend to be Opus or GPT for the vast majority of their users (as classified by some models) who don't know much about expected Opus behaviours & not skilled enough to tell the differences?
  that is actually the interesting bit not covered in your censored version of the story line, it is also what happens on the ground. your censored version of the story implies that those dodgy resellers using stolen credit cards, pooling accounts with stolen IDs and illegally selling very personal logs would somehow be honest enough to spend extra $ to ensure their victims (aka paying users) can actually use real Opus and GPT. LOL
  dude, you failed this IQ test miserably.
  
  4 replies →
janalsncm 3 days ago

Their R1 paper was really well-done. But I think it leaves out a few details necessary for stable training.
https://cameronrwolfe.substack.com/p/grpo-tricks
garn810 3 days ago
Yep. It's about time western world realized Chinese are not the "very bad guys under dictatorship"
- 3abiton 3 days ago
  
  Honestly it's just a hierarchy difference between the two countries. In the US, tech/fin/military companies have the upper hand compared to the government (fragmented between 2 parties). Despite the sharades with Anthropic, Tech-fluencers are in control. Compared to china, the government (dictatorship) has more control over Tech companies (take any example from the past 10 years). For them, undermining the US AI supremacy is an objective, and releasing open weight models is the way, and I'm all for it.
- idiotsecant 3 days ago
  
  Let's not get crazy here. You can acknowledge that the Chinese AI industry has some structural advantages right now without trying to claim anything else. China is still a brutal autocracy.
- cloudfudge 3 days ago
  
  I don't think it's very common to believe the Chinese people are bad guys. It's the government and its control of the people that's the problem. And no, I don't think the US is immune to that sort of problem either.
epolanski 3 days ago
R1 was very influential on US models development.
teekert 3 days ago

I'm deep seeking for that open in OpenAI indeed. It’s clear who’s the most anthropocentric in this space.
thesmtsolver2 3 days ago

This is so out of touch. Go to Neurips or the top AI conferences to see what is happening.
SubiculumCode 3 days ago
If American labs aren't publishing, it doesn't mean they aren't doing even more interesting work.
- etdznots 2 days ago
  
  So fascinating, cant wait to never hear about or be affected by this research until it’s discovered elsewhere.
  I genuinely wonder how it feels to be working your whole life, actual flesh and blood and heart and mind pouring 40 to make something that is a dead-end on the tree of human progress because it’s miserly masters are terrified of sharing knowledge.
  Days and nights spent playing pretend human pioneer, when you are a lunatic on an island building towers of coconuts.
- californical 3 days ago
  
  You could also come up with a cure for cancer, but if nobody knows what you’ve done then there’s not a whole lot we can say about it
dakolli 3 days ago
Its because our culture worships pieces of paper the government tells us is worth something.
- IAmGraydon 3 days ago
  
  Money is just a physical representation of the ability to get what you want. The problem is not money. It’s the fact that we live in a “me” society.
- mordae 3 days ago
  
  Nope, people seek it out because government tells them to pay taxes _or else_.
utopiah 3 days ago

It's almost as if ... they were what OpenAI was when it started. Sad to see but glad someone is doing is.
OtomotO 3 days ago

The difference between greed and power
nelox 3 days ago

Doing work ≠ publishing work
pmarreck 3 days ago
They push the boundaries, alright. Of obtaining the results of work without doing the work themselves, which I hate to say it but this is classic Chinese machiavellianist business behavior:
https://www.cnbc.com/2026/06/24/anthropic-alibaba-distillati...
- etdznots 2 days ago
  
  You mean like training off of pirated copyrighted works for example that Anthropic, OpenAI, and Google stole from the internet?
resters 3 days ago

Thank you so much to everyone at DeepSeek who is working on this and who have the courage and generosity to open source this for humanity.
We in the United States will never forget!
For all the harm Trump does to the US at least he is helping China!
godwinson__4-8 3 days ago
The idea that America is going to stay ahead of China is I think at this point clearly delusional. It's also just such silly framing. Why should 350 million people stay ahead of 1 billion people on the other side of the world? If an AI lab in China cures cancer or something do Americans lose?
So many Americans seem to (at least in theory) be ready to sign up for this ongoing confrontation with China. Does anyone think it isn't America who is poking the bear when it comes to the Thucydides trap? Why not try to get along? It occurs to me the only people more Chinese innovation would hurt are the mega cap class in the United States. Elon Musk certainly doesn't want BYD in the United States. Same story all the way down with these super capitalized AI companies. Most average Americans would probably be better off in a world where the United States and China got along. But its those Americans who will be called upon to suffer most of the burden if that trap ever springs.
- thesmtsolver2 3 days ago
  
  By this population-only logic, you should concede that India will overtake China.
  Why not talk about how China shut out American companies for decades before complaining about BYD?
  As an Indian immigrant, the PRC China has engaged in conflict with almost all its neighbors and stated wars in its short history.
  China is not so benevolent when they get to the #1 spot:
  https://m.economictimes.com/industry/renewables/china-wto-co...
  
  3 replies →
darkoob12 3 days ago
Google and Microsoft publish more than enough and American universities are publishing the science beyond DeepSeek's engineering. That fact that you don't know about them means you're not following the science only reading hacker news.
- kamranjon 3 days ago
  
  Google hasn’t published much in depth ML work since T5 (which was hugely influential at the time) - most Gemma releases are 1-3 page model card pdfs these days with no in depth analysis. Even TurboQuant is shaking out to have basically been a rehash of previous work without proper attribution. I do think Microsoft is doing some interesting things with smaller models but haven’t read much research, interested in any refs you might have to share!
  
  1 reply →
DivingForGold 3 days ago
Sure, in part by "stealing" from American AI companies with Distillation attacks:
https://yipzap.com/anthropic-accuses-alibaba-of-largest-ai-d...
- pennomi 3 days ago
  
  If your moat is “please don’t copy my outputs”, you don’t have a moat. There is no such thing as a distillation “attack”.
  
  11 replies →
- Jonnerz 3 days ago
  
  US AI companies trained their own models on vast amounts of copyrighted and publicly available content without obtaining permission. There's no moral high ground here.
- NitpickLawyer 3 days ago
  
  While I don't agree with your comment being downvoted, I don't think distillation is either an "attack" nor is it "stealing". The idea that someone else gets to decide how I use tokens that I pay for is ludicrous.
  Imagine if your casio calculator would come with a ToS that says you can't use it to develop a competitor calculator or any other tools. Or that your hammer can't be used to make other tools. Or, closer to the HN crowd, imagine MS in the 90s saying that you can't use their OS to build competing services to MS. They'd be laughed at and be split immediately if they tried that.
  The only thing they can do is to refuse serving tokens (and even that's debatable, if we get to tokens being commoditised). But that's gonna be a game of whack-a-mole, and they know it.
- orbital-decay 3 days ago
  
  Besides "attack" being a ludicrous name for distillation, note how your article says "accuses", also it's mostly about Alibaba, not DeepSeek (although it's mentioned there). Both Dario Amodei and Sam Altman publicly claimed that DS used their outputs to train their models, and knowing the differences between all these models by heart, I believe they're simply lying through their teeth to sway the public opinion and/or the policy. These models are absolutely nothing alike, and distillation necessarily makes student's outputs similar to teacher's. This is very visible in Z.ai models (which were trained on Gemini outputs to the point that they repeated Google's conditional prompt injections in the CoT, and later on Claude where it started repeating their CoT as well) and certain Google models which were trained on Claude's outputs in a roundabout way. Distillation always shows up in the result.
  And certainly they have no idea whether these outputs (assuming they ever existed and it wasn't made up) were used for training. The article mentions that DS made 150k requests. This isn't much and might have been just an eval or a benchmark to compare their own model against. It's really hard to believe DeepSeek had any Claude outputs anywhere in their training schedule, since it's just too different. Besides training on random vibecode of course, which is mostly written by Claude.
- pmarreck 3 days ago
  
  You know what, if someone wants to downvote this guy by claiming distillation attacks are not "attacks" or don't cross some ethical bound (especially since I just posted a similar comment), then go right ahead, but if you're combining it with any notion of "leadership", that's like saying that the person in 2nd place in a bike race who is drafting behind the person actually in 1st place is exhibiting "leadership".
  There's no "leader" if, absent someone whose results you're copying, you are an emperor without clothes

kamranjon 3 days ago

The hugging face models are already up and seem to be the original models with the speculative decoding module built in which is very cool:

Flash: https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-DSpark

Pro: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro-DSpark

Excited to see if this makes it into DwarfStar for local inference, have been using the flash model extensively since the 2-bit quants were made available by antirez.

ilaksh 3 days ago
Any chance they will have this for Qwen 27 b also?
- kamranjon 3 days ago
  
  The paper actually references testing their DSpark speculative decoding strategy with Qwen 3 4b, 8b and 14b models so while I doubt they will release builds themselves, they’ve open sourced (DeepSpec) their training pipeline for this so we will likely see folks adopting for other models.

StizzurpXDD 3 days ago

DeepSeek is, as I feel currently, the sole AI company which is actually trying to innovate rather than top mere benchmarks. Others like OpenAI, Anthropic and Google are mostly just competeing with each rather than keep innovating around the clock.

Alifatisk 3 days ago
> DeepSeek is, as I feel currently, the sole AI company which is actually trying to innovate rather than top mere benchmarks.
I'd also include the other Chinese labs like Moonshot (behind Kimi) and Z.ai (behind GLM). They are innovating and continue openly sharing their research to the public. I believe the founder of Moonshot even shared 40 minute video on Twitter where he goes through techniques that powers Kimi.
- alecco 2 days ago
  
  Isn't GLM-5.2 mostly DeepSeek V3 architecture?
  More and more I suspect Z.ai just has deeper pockets and access to the Claude traces while DeepSeek is punching way above their class.
  
  2 replies →
otterley 3 days ago

> Others like OpenAI, Anthropic and Google are mostly just competeing with each rather than keep innovating around the clock.
They compete with each other by innovating. The innovations result in more utility for the customer, but the technology isn't made public. Trade secrets are secret for a reason.
The reason people may think that DeepSeek is the "most innovative" is because of what they can observe from the outside, much like people may mistakenly conclude models are the "prettiest of the population" because not everyone is photographed for public consumption.
nicce 3 days ago
> Others like OpenAI, Anthropic and Google are mostly just competeing with each rather than keep innovating around the clock.
The strategy for the most companies in the US has been for a long time to capture the social audience, whatever the mean is. Quality and innovation is the second factor. Capture the market, lock in the users, influence regulation and lobbying to keep the power.
- stymaar 3 days ago
  
  > Capture the market, lock in the users, influence regulation and lobbying to keep the power.
  “Buy every new players threatening their business” should be at #3 in your list.
spongebobstoes 3 days ago
the big labs have already been doing this for at least a year
- kcb 3 days ago
  
  Yes, all the closed providers are probably doing this already. As well as open models like Gemma and Nemotron.
smcleod 3 days ago
Qwen as well.
- kamranjon 3 days ago
  
  There was a recent exodus from Qwen of researchers who supported their open source efforts, I’m not sure we will see many new open models from them past the 3.6 series.
  
  3 replies →
FuckButtons 2 days ago

One presumes that in order to compete they must also be innovating, if the only innovation they had was how much money they could set on fire then OpenAi / Google would have walked all over Anthropic by now.
jimmydoe 3 days ago
Besides the founder, the only real external investor for DeepSeek is Chinese govt. there are literally zero revenue pressure compare to O, A & G.
To compete in that direction, USG needs to learn from CCP to "seize the means of production", which they are sort of doing, but in such an incompetent way that I'm afraid we will probably end up mixing the worst of both communism and capitalism.
- altcognito 3 days ago
  
  China is just taking a lot of ideas from the USG when it was doing things correctly and is using those for innovation.
  In this case, it feels like they are just funding multiple independent pure research projects and letting the chips fall where they may.
  Doesn't even really seem like Europe can coordinate that.
- otterley 3 days ago
  
  > To compete in that direction, USG needs to learn from CCP to "seize the means of production"
  No they don't. The U.S. Government is free to launch their own AI labs if they wish -- and even compete with the private sector -- but that doesn't mean they have to confiscate existing investments and capital. But Congress is unlikely to do that, because we've learned in the course of history that in well-functioning competitive markets, publicly-operated services tend to be worse than private ones across multiple dimensions.
  Chinese companies are largely where they are not because they're state funded, but because they operate in ways that would be considered criminal in the U.S. If they didn't constantly trespass on OpenAI and Anthropic to try to achieve product and technological parity, they would be too far behind to produce innovative research.
pmarreck 3 days ago
Please explain how distillation == innovation
Especially since your 5-day-old account is sus, and thus likely not yet proven not to be a Chinese bot
You can't lead by following the actual leader LOL
The only real innovation I've seen from Deepseek is the out-loud reasoning thing in R1
- jst1fthsdys 3 days ago
  
  You call yourself a philosopher in your profile.
  
  3 replies →

piterrro 3 days ago

I’ve been using DeepSeek v4 pro for a month now in Kilo Code and its great. Fast, reliable, large context window and cheap as… Did 1,5B tokens this month and cost me 40usd (majority cached, but still).

redman25 1 day ago

I've been preferring Mimo recently. Same price as deekseek, more reliable tool calling (subjectively), and has some nice qualities in terms of prose, etc.
I've heard others say that Deepseek tends to be smarter on specific problems but that Mimo tends to more well-rounded.
spiderfarmer 3 days ago
Is there a way to see how many tokes one does with claude code (pro)?
- bpavuk 3 days ago
  
  the casino has no clocks, as one HN user put it some time ago.
  I second ccusage, it's nice
- cptchaos 3 days ago
  
  https://ccusage.com/
- edg5000 3 days ago
  
  It's in the JSONs in ~/.claude, but last 30 days only I think. You can have the model analyze history. So for correct history you'd need to run history analysis on a cron job or something. Kinda hacky.
  
  1 reply →
- O_H_E 3 days ago
  
  https://github.com/kenn-io/agentsview
  > Local-first session search, analytics, insights, and token use statistics for coding agents, supporting Claude Code, Codex, and more than 20 other agents.
  solid piece of software
richardlblair 3 days ago
I've been using omp with deepseek as my task and quicktask agents, and sonnet as everything else.
It's drastically reduced my AI spend. I went from spending $40/day to $10/day.
- throwa356262 3 days ago
  
  Have you tried reasonix?
  https://github.com/esengine/deepseek-reasonix
fer 3 days ago
Which provider? I went through 40 bucks on it on openrouter. It was not a lot of back and forth, context ended at around 300k, 15kloc output. I was using opencode, unsure if I can make the total token count visible.
- peheje 3 days ago
  
  OpenRouter sometimes chooses a very expensive provider. Try the floor slug or choose directly the provider. I moved to just putting 5 dollars directly on deepseek instead of going through OR.
apitman 3 days ago

Have you compared Kilo to Pi or OpenCode? Those are the two I'm most familiar with but always looking for alternatives.

rvz 3 days ago

This is just one of many papers DeepSeek have released to be able to serve models at extremely cheap prices, unlike the others taking on >$100B+ of debt in building data centers for the same thing.

> As with V4-Flash, we treat this point as an indication that DSpark sustains useful throughput under an interactivity target that the baseline cannot efficiently support. At matched system capacities, DSpark delivers 57% to 78% faster per-user generation.

Reminds me of the flawed solution in scaling servers in 2017 that use memory-intensive technologies by adding even more servers to solve the problem. (It just increases costs.)

Rather than doing that, think about which critical parts of your app can be written in a more performant technology.

Fast forward to 2026, now you can see who is just throwing more money at the problem to create even more problems where as DeepSeek is giving us optimized solutions.

I know exactly who I would pay attention to, and it is absolutely not Anthropic.

denverllc 3 days ago

For so long American companies have operated under the assumption that servers are cheaper than developers, and that was used to justify all sorts of inefficient practices.
The last year has shown that’s not true anymore (even for web servers).
simianwords 3 days ago
...... are you really suggesting OpenAI and Anthropic don't have access to these techniques?
- sourcecodeplz 3 days ago
  
  if they didn't, they do now. as deepseek published the howto

Havoc 3 days ago

Nice.

Guessing the timing isn't accidental. Demonstrated openness vs harsh regulation

cr125rider 3 days ago
China = Open. US = Harsh Regulation
Strange timeline, though this only works because it’s aligned with Xi’s goals.
- Havoc 3 days ago
  
  Yeah can definitely see a world where china pivots and we're stuck with closed/closed
  Mistral...don't fumble this
  
  1 reply →
declan_roberts 3 days ago

Nobody forced anthropic to go on a media blitz loudly proclaiming the dangers their new AI model. Serves them right honestly.

xnx 3 days ago

Is this newer/better than the speculative decoding from 2022? https://arxiv.org/abs/2211.17192

tiahura 3 days ago

Seems like they focus on improving the drafter and the verification policy so speculation keeps producing net speedups rather than wasted verification work at deepseek scale.
alok-g 3 days ago

That paper is cited in the 'introduction' and 'background' sections. This paper is improving by removing some bottlenecks.

articlepan 3 days ago

Title is bad, it's the first line of the abstract instead of the paper title. Speculative decoding for LLM inference was published in 2022: https://arxiv.org/abs/2211.17192

This paper seems to be an improvement to speculative decoding but I haven't read it yet.

bflesch 3 days ago

At this point why can't someone produce a fridge or container-sized AI appliance based on legacy chips (12nm)? I imagine this would cover 80% of corporate use cases where you need to "google-in-a-box" functionality.

The state-of-the-art nanometer are impossible to achieve but if you have infinite solar energy during business hours does it really matter? Every company has a parking spot so this ASIC-like appliance could be as big as a shipping container.

If it could just run recent open models for a handful of users it would be such a nobrainer to buy.

scrlk 3 days ago
See "exabox" from George Hotz: https://tinycorp.myshopify.com/products/exabox-preorder
- flipped 3 days ago
  
  No one's buying that shitbox.
  
  1 reply →
sixhobbits 3 days ago

Nvidia is already selling exactly this I think, not sure when it's expected to ship
benjiro29 3 days ago
The issue is that there are only so many fabs in the world that make memory. And if you want the good stuff, your easily going into 400 ~ 750b parameter models. That means at FP4 400 to 750GB memory.
Did i mention there are only so many memory makers and they are all busy printing money with HBM memory?
Intel is trying with Crescent Island, to make a 160GB GPU that uses LPDDR5X memory.
HBM takes multiple times the resources to make vs basic DDR5 memory. So by going this route, you have more memory, with the disadvantage that its only 700GB/s. VS HBM pumping out Terrabyte numbers like its nothing.
These cards is reasonably priced, may be good alternative to $10k 96GB Nvidia Blackwells... You give up on token generation (heavily memory dependent), for more memory to run larger models at home/office/company servers.
The problem is, again, there are only so many memory makers and its not like the market is flooded with DDR5 memory anymore, as the big 3 moved a lot of production to HBM.
Another approach is Sandisk making HBF ... Flash memory, like your typical NVME but designed around maximum speed. So instead of loading the models into expensive HBM memory, you use the benefits of density in Flash memory, to offload models into that. Cheaper, but slower... But it leaves your expensive HBM memory free for things like KV Cache, Active parameters, etc... So your model will be slower, but your hybrid using it. As in, faster then running a model from system memory with normal DDR memory, but not as fast as HBM.
So yea, there is a lot in development to reduce the dependance of that resource eating HBM memory. For the wafer cost of 1GB HBM, you normally got 4GB normal memory. That is why the world supply of memory dropped. Not just the insane buying but be HBM is just very inefficient in wafer usage.
Can we not use DDR4 production and create some kind of hybrid solution? Sure, but the big 3 moved away from DDR4 in favor of DDR5 a long time ago. We have competition from China with a mix of DDR4/DDR5, but they also need to scale up. Nobody expected to see a large part of the world production vanish into HBM...
Even if its about DDR4 and older nodes, ironically, most companies had been moving away from DDR4. There is only so much wafer capability in the world, to the point that companies are moving to using DDR2 ... Yea, not a typo, like 2007 DDR2! for IOT devices etc, stuff that does not need fast memory. Because even DDR3 got too expensive for them.
Its not like the old nodes are not used anymore ... Like that capacity was sitting idle. It was still in production making other stuff. The only real solution is that we need more fabs, and those take years to build. And the big 3 delayed investing in new fabs for a long time, unsure about the whole AI bubble stuff. Aka, they did not want to make a ton of fabs to end up with over capacity if the AI growth collapsed.
- bradfa 3 days ago
  
  With MoE models like Deepseek’s and with multiple Crescent Island accelerators, the aggregate memory throughput actually doesn’t look that bad. Two Crescent Island gets roughly 1400GB/s and Deepseek-v4-flash with 13B parameters active nets roughly 100t/s which is decent for a small team or great for a single user.
  More Crescent Island scale up, although not likely entirely linearly.
  But all GPU inference work like this, it’s not specific to Intel. Just Intel promises more affordable cards with big memory so they’re attractive.

ricardobeat 3 days ago

Presumably this has been in production for a while, and is one of the reasons they were able to dramatically lower prices a month ago?

chronogram 3 days ago

Yes. Section 5 talks about real-world deployment: 5.1: "The DSpark draft models are co-deployed with the preview versions of DeepSeek-V4-Flash and DeepSeek-V4-Pro"; 5.4: "MTP-1 represents the former production setup, having been superseded by DSpark two weeks following the DeepSeek-V4-preview release."
_0ffh 3 days ago

Lookahead Sparse Attention should be playing a big role as well, as it dramatically slashes memory consumption.
sourcecodeplz 3 days ago

good catch, they reduced the prices 75% seems like exactly in line with the speed/inference optimizations gains?

Jackobrien 3 days ago

I see a world soon where there’s an extremely wide variety of small models for speculative decoding, unique to use cases, companies, and even individuals.

nicce 3 days ago

Hopefully that is the case and hardware does not get impossible to get.
pydry 3 days ago

yes, heavily constrained by sophisticated guardrails.
this is definitely where things are going. the enormous "eat the world" models have extreme diminishing returns by comparison.
Der_Einzige 3 days ago

You clearly didn't read the recent speculative decoding papers because it's been possible to use any model to speculate for any other model for awhile. They solved the tokenization problems that prevented this in the past.

pokot0 3 days ago

I am wondering if this is why they can offer their pro model at ~1/4th of the price compared to the other providers offering the same model, and if other providers will be able to do the same in a short timeframe.

sfifs 3 days ago

Inference I estimate runs 90% plus gross margins. Just work out the math on these servers. I am pretty sure any player can price down. It wouldn't look good on an IPO prospectus.
sschueller 3 days ago
I have been heavily using DeepSeek V4 Pro at Max for a month now and I would say it is 100x cheaper. If I pay for Claude I will hit that limit so fast I am always waiting 5 hours. Using the frontier models at Kilo I go through dollars while doing the same thing via DeepSeek it is pennies.
- ddxv 3 days ago
  
  I believe the comment you replied to was talking about the cost on providers like OpenCode vs Deepseek API. Deepseek API is even cheaper than the other providers for the same deepseek models.
vidarh 3 days ago
It'd presumably help a lot, but also when you use their endpoint they get more training data.
- nicce 3 days ago
  
  This applies to every provider. OpenAI seems to be the worst hoarder.
  
  6 replies →
- epolanski 3 days ago
  
  US labs do it too.
  
  2 replies →
- flipped 3 days ago
  
  US labs are the biggest data broker in the current history. They collect everything, dumb fuck.

danielabinav160 3 days ago

Would love to see these numbers reproduced on consumer GPUs, not just A100s.

wolttam 3 days ago

This is an efficiency improvement that significantly lowers the amount of RAM you have to look at, on average, during decode.
It should improve performance on most hardware because most LLMs are memory bandwidth bound during decode.
tommica 3 days ago
Maybe somaday an 8gb videocard can be used for coding...
- romanusrome 3 days ago
  
  [dead]

segmondy 3 days ago

As we can see again, this has nothing to do with distillation, yet for every gain Chinese labs make, the US labs will accuse them of theft. Yet they are constantly innovating.

lelanthran 3 days ago

These companies providing tokens, whether SOTA or not, that want to IPO are so fucked as time goes on.

Can't sell their SOTA models, only slightly better than the open source models for the models they can sell, cost 20x to 50x for good models, a TAM that consists almost solely of developers, with no customer of theirs actually boasting increased profits as a result of AI...

I fear their time to IPO may have passed.

utopiah 3 days ago
The question is even, was there EVER a time for an IPO?
If the business model requires hundreds of billions to get the required quality (R&D but also infrastructure to collect data and train, either purchased or rented to 3rd party) while "only" dozens of billions can be earned back (as costs still exist to earn, it's not free once models are trained), then maybe there NEVER was nor till be a good time for an IPO in a rational market.
- notnullorvoid 3 days ago
  
  > in a rational market.
  Unfortunately the market is often not rational in this way.
  Hype within retail market means there are suckers willing to buy. Institutional market knows there are suckers when the hype is high. Both would drive the price up, and retail investors the ones left when it falls.
- 2838383838 3 days ago
  
  IPOs with massive bags can be wework or spacex, it all depends on vibes. If they buy a couple more articles doomposting and glazing AI on the financial times right before exit they will def find a bunch of boomers to buy their bags. If the narrative changes before they IPO its over.

dnchdnd 3 days ago

Interesting side effect that this seems to place a downward pressure margins of their western competitors: by sharing not only models but serving optimisations, even those third parties serving the DeepSeek models can do so more efficiently multiplying the effect

porphyra 3 days ago

I thought this had something to do with the DGX Spark at first from the name haha. (Incidentally, a lot of recent work has gone into making the DGX Spark better at inference, like MTP yielded a 50-100% speedup, so DSpark will likely be very helpful to that end as well)

lightedman 3 days ago

Anyone want to bet that much like speculative execution, speculative decoding is going to introduce a whole slew of vulnerabilities in the ways LLMs work?

skirmish 3 days ago
Don't think so because all tokens predicted speculatively are still validated against the main model (which is faster than predicting them from scratch) and only accepted if they match exactly.
- lightedman 1 day ago
  
  If your main model is inherently-busted does validation actually matter?
  
  1 reply →

zftnb666 3 days ago

AI making AI faster. Next up: AI writing papers about how AI makes AI faster

wg0 3 days ago

That's why I pay them. Regularly. Without fail. Despite my token usage isn't that much.

But I vote for these heroes with my wallet. Just yesterday did again.

noIdeaTheSecond 3 days ago

Cudos to you!If people realized how much power we had we's have a better world

2838383838 3 days ago

Must be wonderful to be on the board of OpenAi et al & their PE investors whilst China keeps blowing up these mines under their feet lmao. Luckily Korean pension funds will buy all the trash as usual but goddamn you gotta start moving quick or you are gonna need some serious AGI to show you how to offload those bonds

throwa356262 3 days ago

Why do you think they have started accusing Chinese labs of stealing and distillation?
A&O no longer have the most to justify their high valuation. The only thing they can do now is to get the government forbid the Chinese models.
ForHackernews 3 days ago
"We will build the machine-god and pray for it to pay for itself."
- FridgeSeal 3 days ago
  
  Every day, the rate of “could post a picture of 40k tech priests and have it taken unironically” goes up, and it’s starting to get concerning.
ozgrakkurt 3 days ago

Don’t worry they will sell all the hardware and data they acquired with their grift