I run Deepseek V3 locally as my daily driver and I find it affordable, fast and effective. The article assumes GPU which in my opinion is not the best way to serve large models like this locally. I run a mid-range EPYC 9004 series based home server on a supermicro mobo which cost all-in around $4000. It's a single CPU machine with 384GB RAM (you could get 768GB using 64GB sticks but this costs more). No GPU means power draw is less than a gaming desktop. With the RAM limitation I run an Unsloth Dynamic GGUF which, quality wise in real-world use performs very close to the original. It is around 270GB which leaves plenty of room for context - I run 16k context normally as I use the machine for other things too but can up it to 24k if I need more. I get about 9-10 tokens per second, dropping to 7 tokens/second with a large context. There are plenty of people running similar setups with 2 CPUs who run the full version at similar tokens/second.
> Unsloth Dynamic GGUF which, quality wise in real-world use performs very close to the original
How close are we talking?
I’m not calling you a liar OP, but in general I wish people perpetuating such broad claims would be more rigorous.
Unsloth does amazing work, however as far as I’m aware even they themselves do not publish head to head evals with the original unquantized models.
I have sympathy here because very few people and companies can afford to run the original models, let alone engineer rigorous evals.
However I felt compelled to comment because my experience does not match. For relatively simple usage the differences are hard to notice, but they become much more apparent in high complexity and long context tasks.
Oh hey :) Thanks for the kind words - we did provide benchmarks (MMLU, KLD, Perplexity) for Llama 4 Scout, Gemma 3 27B using our methodology - https://news.ycombinator.com/item?id=39671146 etc are much more important :)
We also provide Q8_0 and Q8_K_XL quants, which are mostly equivalent to FP8 - you can also use the magical `-ot ".ffn_.*_exps.=CPU"` incantation to offload MoE layers to RAM!
You are right that I haven't been rigorous - it's easy to benchmark tokens/second but quality of output is more difficult to nail down. I couldn't find any decent comparisons for Unsloth either. So I just tried a few of their models out, looking for something that was 'good enough' i.e. does all I need: coding, summarizing documents, troubleshooting anything and everything. I would like to see head to head comparisons too - maybe I will invest in more RAM at some stage but so far I have no need for it. I ran some comparisons between the smaller and larger versions of the Unsloth models and interestingly (for me anyway) didn't notice a huge amount of difference in quality between them. But, the smaller models didn't run significantly faster so I settled for the biggest model I could fit in RAM with a decent context. For more complex coding I use Deepseek R1 (again the Unsloth) but since it's a reasoning model it isn't real-time so no use as my daily driver.
I am impressed. Your personal website is down. HN doesn't allow private messages.
I'm Jeff Carr. I co-founded digital ocean. I assume I can't post email addresses here, but I will try. lets see how smart things are from banning me. I am: wit AT wit com
For example, look into https://github.com/kvcache-ai/ktransformers, which achieve >11 tokens/s on a relatively old two socket Xeon servers + retail RTX 4090 GPU. Even more interesting is prefill speed at more than 250 tokens/s. This is very useful in use cases like coding, where large prompts are common.
The above is achievable today. In the mean time Intel guys are working on something even more impressive. In https://github.com/sgl-project/sglang/pull/5150 they claim that they achieve >15 tokens/s generation and >350 tokens/s prefill. They don't share what exact hardware they run this on, but from various bits and pieces over various PRs I reverse-engineered that they use 2x Xeon 6980P with MRDIMM 8800 RAM, without GPU. Total cost of such setup will be around $10k once cheap Engineering samples hit eBay.
The latest V3 strikes me as a really practical go-to among open-weights models. Lots of tasks don't need the reasoning tokens, and not having to wait for them is nice. (If something does need it you can always switch.) If you're not running it yourself a couple providers have it with full context, 80tps, and a promise not to use your data.
Whats your prompt processing speed? That’s more important in this situation than output TPS. If you have to wait minutes to start getting an answer, that makes it much worse than a cloud-hosted version.
Prompt eval time varies a lot with context but it feels real-time for short prompts - approx 20 tokens per second but I haven't done much benchmarking of this. When there is a lot of re-prompting in a long back and forth it is still quite fast - I do use KV cache which I assume helps and also quantize the KV cache to Q8 if I am running contexts above 16k. However, if I want it to summarize a document of say 15,000 words it does take a long time - here I walk away and come back in about 20 minutes and it will be complete.
If he is doing multiturn conversations, he can reuse the kv cache from the last turn and skip the prompt processing on the history that would make time to first token too slow, by only doing prompt processing on his actual prompt for the current turn. This turns a quadratic amount of tokens to process into a linear number. I am not sure if this is what he is doing, but that is what I would do if I had his hardware.
I use a dual-socket 18-core (so 36 total) xeon with 768GB of DDR4, and get about 1.5-2 tokens/sec with a 4-bit quantized version of the full deepseek models. It really is wild to be able to run a model like that at home.
It might be 5 to 10 times slower than a hosted provider but that doesn't really matter when the output is still faster than a person can read. Context wise, for troubleshooting I have never needed over 16k and for the rare occasion when I need to summarise a very large document I can change up the model to something smaller and get a huge context. I have never needed more than 32k though.
Dude he's running locally, and I think this setup is the best bang for the buck if you wanna run locally, we're not comparing to data centers, you gotta keep it in perspective. That's very impressive results for running local. Thanks for the numbers you saved me a chatgpt search :)
So, in your opinion, hardware wise, as a general purpose tinkering/learning self lab hardware, how would you grade the decked out framework desktop for 2.7k?
I thought GPUs with a lot of extremely fast memory was required for inference. Are you saying that we can accomplish inference with just a large amount of system memory that is non-unified and no GPU? How is that possible?
Basically it comes down to memory bandwidth of server CPUs being decent. A bit of oversimplification here but... The model and context have to be pulled through RAM (or VRAM) every time a new token is generated. CPUs that are designed for servers with lots of cores have decent bandwidth - up to 480GB/s with the EPYC 9 series and they can use 16 channels simultaneously to process memory. So, in theory they can pull 480GB through the system every second. GPUs are faster but you also have to fit the entire model and context into RAM (or VRAM) so for larger models they are extremely expensive because a decent consumer GPU only has 24GB of VRAM and costs silly money, if you need 20 of them. Whereas you get a lot of RDIMM RAM for a couple thousand bucks so you can run bigger models and 480GB/s gives output faster than most people can read.
Do you have hard numbers on the idle/average/max power draw? I assumed that server machines are built as if they are going to red-lined constantly so put less effort into low-utilization optimizations.
No hard numbers I'm afraid in that I don't monitor the power draw. But the machine uses a standard ATX power supply: a Corsair RM750e 750W PSU and the default TDP of the CPU is 280W - I have my TDP set at 300W. It is basically built like a desktop - ATX form factor, fans spin down at idle etc.
Can we run Deepseek using Ollama or something similar for code generation like Github copilot on a 40 core CPU with about 256GB RAM say 200 GB usable for the model?
I've always wondered this as well, and never seem to get an answer. Why would someone want to do this when they can get a better result either renting in the cloud, or just using a subscription?
Obviously I see the value in having something local from a control and privacy perspective, but it's surely always a net loss in terms of quality and capability of output, right?
Coding, my own proprietary code hence my desire for local hosting, a decent amount of legacy code. General troubleshooting of anything and everything from running Linux servers to fixing my car. Summarizing and translation of large documents occasionally. Also, image generation and other automations but obviously not LLMs for this.
This is an interesting blogpost. While the general conclusion ("We need batching") is true, inference of mixture of experts (MoE) models is actually a bit more nuanced.
The main reason we want big batches is because LLM inference is not limited by the compute, but my loading every single weight out of VRAM. Just compare the number of TFLOPS of an H100 with the memory bandwidth, there's basically room for 300 FLOP per byte loaded. So that's why we want big batches: we can perform a lot of operations per parameter/weight that we load from memory. This limit is often referred to as the "roofline model".
As models become bigger, this does not scale anymore because the model weights will not fit into GPU memory anymore and you need to distribute them across GPUs or across nodes. Even with NVLink and Infiniband, these communications are slower than loading from VRAM. NVlink is still fine for tensor parallelism, but across nodes this is quite slow.
So what MoE allows is expert parallelism, where different nodes keep different experts in memory and don't need to communicate as much between nodes. This only works if there are enough nodes to keep all experts in VRAM and have enough overhead for other stuff (KV cache, other weights, etc). So naturally the possible batch size becomes quite large. And of course you want to maximize this to make sure all GPUs are actually working.
You could load different "experts" in a round-robin way on a single node and only aggregate "batches" opportunistically, when you just have multiple requests in-flight that all happen to rely on the same "expert". The difference being that instead of "batches", you would only really have queues. Of course this would come with a sizeable increase in latency, but that's acceptable for many applications (such as for "deep research" workflows)
This is very much like Erlang's actor model. The same compute can be run in parallel, or managed via queues. With Erlang's strong support for FFI and process control, I wonder if it's being used as a dispatcher for these sorts of workloads.
> As models become bigger, this does not scale anymore because the model weights will not fit into GPU memory anymore and you need to distribute them across GPUs or across nodes. Even with NVLink and Infiniband, these communications are slower than loading from VRAM. NVlink is still fine for tensor parallelism, but across nodes this is quite slow.
Inference works by computing layers and then have a very small vector that you send to the next layer as input. When a model does not fit in a single GPU, you just divide it into layers and send the vector over a fabric to the GPU holding the next layer. The transfer happens so quickly that there is a negligible amount of idle time and then the next layer can be computed. The fastest inference on the planet at Cerebras uses this technique to do 2500T/sec on Llama 4 Maverick.
Groq and Cerebras both take a big chip approach to architecture and, at least in the case of Groq, they only make economic sense under high batch loads.
Distributing inference per layer, instead of splitting each layer across gpus, is indeed another approach, called pipeline parallelism. However, per batch there is less compute (only 1 gpu at a time), so inference is slower. In addition, the orchestration of starting the next batch on gpu #0 while gpu #1 starts is quite tricky. For this reason, tensor parallelism as I described is way more common in LLM inference.
And this is the investment case for AMD, models fit entirely in a single chassis, and side benefit: less tariffed network equipment to interconnect compute. Map/reduce instead of clustered compute.
Edit: when downvoting, please offer some insight why you disagree
For those looking to save time, the answer is batched inference. Pretty much running multiple people's "prompts" through a model instance at the same time instead of just really tightly timesharing each model instance.
This is also why you may experience a variance in replies when using these services, even when you set the temperature to 0 and the seed to a fixed value. It's cause you don't control the other prompts yours get batched with. Could this be a data exfiltration attack vector? Probably, I didn't "research" that far.
> Pretty much running multiple people's "prompts" through a model instance at the same time instead of just really tightly timesharing each model instance.
I naively assumed providers did that with all models. Or does it only work for this (family of?) model(s)?
It works for a lot of families but not all. You need a high enough degree of sharing of model weights between different queries for that to make sense (memory access being the usual bottleneck nowadays, though smaller models see something similar with matmul batch efficiencies for CPU related reasons).
Fully connected transformers trivially work (every weight for every query). MoE works beyond a certain size or with certain types of mixing (still using every weight, or using a high enough fraction that there's some sharing with batches of 20+ queries). As you push further that direction though (lots of techniques, but the key point being accessing less of the model at once and bypassing some of it for each query), you need larger and larger batches for those efficiency gains to materialize. At some point it becomes untenable because of latency waiting for batches of data, and past that it becomes untenable because of the volume of query data.
And one thing it can help you locally is when you rate certain content and want to make sure it didn’t hallucinate. So you toss 3 or 5 times or… batch_size times .)
Curious that batch if has been there from day one, but it takes a while for people to see/grasp/grok it.
Depending on the shape of the data a slightly different kernel implementation (for e.g. matrix multiplication, etc.) will be the most optimal, and those will give slightly different results. There could also be other sources of non-determinism depending on the implementation (e.g. some kernels are inherently not entirely deterministic as they use tricks to go faster).
Attention doesn't get batched and the runtime of attention for a given users token depends on the total context length. Hence even in the ideal scenario of you getting a dedicated attention calculating GPU, the MLP calculating GPU doing batching will have to wait for the slowest user.
In the worst case scenario you are sharing a single attention calculating GPU with someone who has a super long context window, then that guy will be hogging most of the memory bandwidth of the GPU, even though you both are generating the same quantity of tokens.
This means that in the distributed setting, you will not only need dedicated GPUs for the model and attention calculations, you will also need to duplicate the whole setup for a variety of context lengths, so that long contexts are batches alongside other long contexts and short contexts are batches alongside other short contexts.
In some mixture-of-experts approaches, samples or tokens are being distributed among experts. The experts are selected by trying to predict what is a good expert-sample match. Depending on your neighbors in the batch, you might be assigned different experts.
- High sparsity means you need a very large batch size (number of requests being processed concurrently) so that each matrix multiplication is of sufficient arithmetic intensity to get good utilization.
- At such a large batch size, you’ll need a decent number of GPUs — 8-16 or so depending on the type — just to fit the weights and MLA/KV cache in HBM. But with only 8-16 GPUs your aggregate throughput is going to be so low that each of the many individual user requests will be served unacceptably slowly for most applications. Thus you need more like 256 GPUs for a good user experience.
I’m serving it on 16 H100s (2 nodes). I get 50-80 tok/s per request, and in aggregate I’ve seen several thousand. TTFT is pretty stable. Is faster than any cloud service we can use.
H200s are pretty easy to get now. If you switched I'm guessing you'd get a nice bump because the nccl allreduce on the big mlps wouldn't have to cross infiniband.
You're presumably using a very small batch size compared to what I described, thus getting very low model FLOP utilization (MFU) and high dollar cost per token.
It's sparse as in only a small fraction of tokens are multiplied by a given expert's weight matrices (this is standard terminology in the MoE literature). So to properly utilize the tensor cores (hence serve DeepSeek cheaply, as the OP asks about) you need to serve enough tokens concurrently such that the per-matmul batch dimension is large.
I'm not a ML research or engineer, so take this with a grain of salt, but I'm a bit confused by this post.
Deepseek V3/R1 are expensive to run locally because they are so big compared to the models people usually run locally. The number of active parameters is obviously lower than the full model size, but that basically just helps with the compute requirements, not the memory requirements. Unless you have multiple H100s lying around, V3/R1 are only run locally as impractical stunts with some or all the model being stored on low bandwidth memory.
We can't compare the size of Deepseek V3 to that of any proprietary frontier models because we don't know the size of those models at all (or even their architecture). The models being compared to that are "expensive at scale" you can't run locally at all, but surely we have no reason to believe that they'd somehow be cheap to run locally?
But I thought you'd typically expect exactly the opposite effect than is claimed here? MoE should be the better tradeoff for the local/single-user scenario since the downside of batching being harder / less efficient doesn't matter.
> Bigger batches raise latency because user tokens might be waiting up to 200ms before the batch is full enough to run, but they boost throughput by allowing larger (and thus more efficient) GEMMs in the feed-forward step
Is it really that the matrixes being multiplied are larger? My mental model is that the purpose of batching isn't to get larger input matrices. It's to move the bottleneck from memory bandwidth to compute. The matrices are already sharded to a much smaller size than the size of the entire model or even layer. So you'll basically load some slice of the weights from the HBM to SRAM, do the multiplication for that slice, and then aggregate the results once all tiles have been processed. Batching lets you do multiple separate computations with the same weights, meaning you get more effective FLOPS per unit of memory bandwidth.
> The fact that OpenAI and Anthropic’s models are quick to respond suggests that either:
Is that actually a fact? The post has no numbers on the time to first token for any of the three providers.
Hi, I wrote the post! Also not a ML researcher, just an interested engineer, so I'm sure I got some things wrong.
> MoE should be the better tradeoff for the local/single-user scenario since the downside of batching being harder / less efficient doesn't matter.
What I meant was that the single-user scenario is going to get dramatically worse throughput-per-GPU, because they're not able to reap the benefits of multi-user batching (unless they're somehow doing massively parallel inference requests, I suppose).
> Is it really that the matrixes being multiplied are larger? My mental model is that the purpose of batching isn't to get larger input matrices. It's to move the bottleneck from memory bandwidth to compute.
As I understand it, you want larger input matrices in order to move the bottleneck from memory to compute: if you do no batching at all, your multiplications will be smaller (the weights will be the same, of course, but the next-token data you're multiplying with the weights will be 1xdim instead of batch-size x dim), so your GPUs will be under-utilized and your inference will spend more time doing memory operations and less time multiplying.
> The post has no numbers on the time to first token for any of the three providers.
I probably should have hunted down specific numbers, but I think people who've played with DeepSeek and other models will notice that DeepSeek is noticeably more sluggish.
Imagine an FPGA big enough to hold the whole model in LUTS (and not RAM) with latches in appropriate places to keep race conditions in check. Even a 100 Mhz clock cycle would beat almost anything else in the world running it. Even if there's 500 stages of pipeline involved, you could still get 200,000 tokens per second for a single stream and have 499 streams ready for other uses.
With an FPGA like that, you could translate all of the matrix multiplies and weights directly into binary logic, optimizing out every multiply or add of a zero bit. This alone could cut the number of gates and computations, and power consumption in half.
Because you wouldn't need to throw data to/from RAM, you'd save a huge percentage of the usual latency and eliminate memory bandwidth issues. The effective equivalent memory bandwidth would likely be measured in exabytes per second.
This is the type of compute load that would perfectly match a bit level systolic array.
Thanks to gigabit SERDES links, it should be reasonably easy to send the vectors between chips if you need to distribute the work to fit available FPGA hardware.
Note this could also be done if you're just emulating a systolic array on cheap hardware, like Raspberry pi picos, using the built-in PIOs to handle the much lower signal rates.
There's still a lot of opportunity for software optimizations here. Trouble is that really only two classes of systems get optimizations for Deepseek, namely 1 small GPU + a lot of RAM (ktransformers) and the system that has all the VRAM in the world.
A system with say 192GB VRAM and rest standard memory (DGX station, 2xRTX Pro 6000, 4xB60 Dual, etc.) could still in theory run Deepseek @4bit quite quickly because of the power law type usage of the experts.
If you aren't prompting Deepseek in Chinese, a lot of the experts don't activate.
This would be an easier job for pruning, but still I think enthusiast systems are going to trend in a way the next couple years that makes these types of software optimizations useful on a much larger scale.
There's a user on Reddit with a 16x 3090 system (PCIE 3.0 x4 interconnect which doesn't seem to be using full bandwidth during tensor parallelism) that gets 7 token/s in llama.cpp. A single 3090 has enough VRAM bandwidth to scan over its 24GB of memory 39 times per second, so there's something else going on limiting performance.
Or merge the bottom 1/8 (or whatever) experts together and (optionally) do some minimal training with all other weights frozen. Would need to modify the MoE routers slightly to map old -> new expert indices so you don't need to retrain the routers.
Sad reality is that the MI300x isn't a monolithic die, so the chiplets have internal bandwidth limitations (ofc less severe that using PCIe/nvlink).
In AMD own parlance, the "Modular Chiplet Platform" presents itself as either single-I-don't-care-about-speed-or-latency "Single Partition X-celerator" mode or in multiple-I-actually-totally-do-care-about-speed-and-latency-NUMA-like "Core Partitioned X-celerator" mode.
This is a great explainer from an LLM perspective, and it would be interesting to see a computational scheduling explanation in depth. I presume that hyperscale LLM companies extensively examine the computation trace to identify bottlenecks and idle bubbles, and develop load balancers, pipeline architectures and schedulers in order to optimise their workload.
The batching requirement for efficiency makes high security applications quite difficult, because the normal technique of isolating unrelated queries would become very expensive. The nVidia vGPU GPU virtualisation time shares GPU memory, and every switch requires unload/reload context switches, doubtful they have deduplication. Multi-Instance GPU (MIG) splits GPU memory between users, but it is a fixed partitioning scheme (you have to reboot the GPU to change), and nobody wants to split their 96GB GPU into 4x24GB GPUs.
Makes me wonder what the tradeoff is for putting second level memory on the GPU board (i.e. normal DRAM), so that different matrix data can be loaded in faster than over PCIe, i.e. the HBM becomes a cache.
(I'm also really liking the honesty in the authors book on Software Engineering, not in the dry IEEE sense, but as a survival guide in a large enterprise.
https://www.seangoedecke.com/book/ )
Or apple silicon for low batch size (=1 ideally). The unified memory allows for running larger models on the expense of them running slower, because of lower bandwidth/flops than a normal gpu. But MoEs require computing only few parameters every time, so the computational needs are low. I have seen people reporting decent speeds for deepseek for single batch inference on macs. It is still expensive though to many people's standards because it requires a lot of $$$ to get enough memory.
In some ways, MoE models are perfect fit for macs (or any similar machines that may come out). In contrast, ordering a mac with upgraded ram size and running dense models that just fit in the vram can be very painful.
I was talking with a colleague the other day and we came to the conclusion that in our experience if you're using llms as a programming help models are really being optimised for the wrong things.
At work I often compare locallly run 4-30B models against various GPTs (we can only use non-local models for few things, because of confidentiality issues). While e.g. GPT-4o gives better results on average, the chances of it making parts of the response up is high enough that one has to invest significant amount to check and iterate over results. So the difference in effort is not much lower compared to the low parameter models.
The problem is both are just too slow to really iterate quickly, which makes things painful. I'd rather have a lower quality model (but with large context) that gives me near instant responses instead of a higher quality model that is slow. I guess that's not giving you the same headlines as the improved score on some evaluation.
It is not "slow and expensive", although it could be "or". You can get 3 tokens / second running on DDR4 memory on a two generation old workstation system that costs ~1K, via llama.cpp .
Workstations routinely accommodate much more than that. The "under $1K" price referred to a 768gb build (12x 64gb sticks on a Skylake based system), you could also do a dual-socket version with twice that, at the cost of messing with NUMA (which could be a pro or a con for throughput depending on how you're spreading bandwidth between nodes).
>It’s a peculiar feature of transformer-based LLMs that computing a batch of completions at the same time is almost as fast as computing a single completion. Why is that?
Incorrect. Transformers usually contain a classical MLP layer. Only the MLP layer can be batched. Hence all classical neural networks including convolutional networks (via im2col) can be batched.
If there's anything that the transformer architecture changes, it is that the attention layer cannot be batched.
Yeah this part was confusing, because it's only mentioned halfway through the article that the attention step can only be batched across matching context-window sizes.
If I understand it correctly, the effect of experts is a weighted sum of the individual calculation of each token meeting each expert, where experts to be met by a token are selected on an individual basis. Since a sum is commutative, though, it should be possible to send a large batch of tokens copied to multiple GPUs, where experts are streamed into VRAM, partitioned across GPUs. Then the bottleneck is your PCI-E bandwidth. With 2 GPUs at Gen 4 x16, you should have 60 GB/s of TX bandwidth, allowing you to upload a half precision quant of DeepSeek (about 360 GB) in about 6 seconds.
Then you just optimize your batch size to match the compute time to the upload time of each GPU. The expert calculation results can be retrieved from the GPUs and summed up.
Do the individual requests in a batch influence each-other?
Not in a floating point non-deterministic kind of way, where exact ordering might introduce some non-determinism (begin position 5th versus being position 10th in the batch lets say).
I'm asking in a semantic way, can context from one request leak into another because they are in the same batch?
I don't know the exact cost-breakdown, but they've come up with a few really inspiring and qualitatively high value papers that demonstrate how they further increased efficiency at their scale. Along with it they also published quite a few repositories with fully open-source code.
I stopped using ChatGPT as it was just reinforcing my prompts and not ever giving deeper insights, except something I call manipulative behaviour.
DeepSeek was seriously cool, but it started behaving similar to Google Gemini Pro, which just tries to be lazy, if you give it a hard task to chew on. It basically gives you patch-files instead of printing out the whole code, which is more tedious doing manually, than c/p the code.
It also started indexing our private repository and some corporate repositories that were on GitHub behind MFA and stringent lock. Definitely illegal.
> It also started indexing our private repository and some corporate repositories that were on GitHub behind MFA and stringent lock. Definitely illegal.
What is "it" in this context, the DeepSeek weights? Sounds like you're talking about some application, but AFAIK, DeepSeek doesn't maintain any applications, only their API + released weights.
> DeepSeek was seriously cool, but it started behaving similar to Google Gemini Pro
You should be able to use the version of DeepSeek that you prefer indefinitely if you host it yourself or choose that specific version with your preferred provider.
>It basically gives you patch-files instead of printing out the whole code
I've noticed on the Aider leaderboard that Google Gemini Pro has an "Edit Format" listed as "diff-fenced" and things like ChatGPT have "architect" edit format where Aider asks separate "architect" and "code" models. Seems like Gemini Pro prefers the diff format.
You should self host not trust a third party application if you run into either of those things. The weights are open. DeepSeek didn’t change, the application you’re accessing it through did.
Or use an enterprise-ready service. Bedrock, firecracker, etc
Had Gemini 2.5 Pro preview running in agent mode in VSCode on a 3000+ line file. It patched it to about 200 lines with a comment in the middle: "// the rest of the code is unchanged".
Depends on who you think its competitors are - deepseek-chat ($0.27/M in; $1.10/M out) is twice as expensive as Gemini 2.5 Flash ($0.15; $0.60) but far cheaper than Claude Sonnet 4 ($3; $15).
That was a pretty good back to reality flex. There really isn't much of a market for expensive products. An inexpensive product that has a few tradeoffs will probably have the advantage. Given how proficient China is at accessing technology resources, it seems likely to me that any chip sanctions against them will probably not be effective.
This reminded me that the economies of scale in AI, especially inference, is huge.
When people say LLMs will be commoditised, I am not sure that means that the market is going to be super competitive. As the economies of scale of AI get even bigger (larger training costs + batch inference etc.) it just seems likely only around 3 companies will dominate LLMs.
For inference cost, I don't see how this is different from cloud providers vs dedicated server providers, where AWS is 5-10x more expensive than hetzner.
Somehow cloud providers manage to add lots of extra-cost on offering.
Isn’t this an arbitrage opportunity? Offer to pay a fraction of the cost per token but accept that your tokens will only be processed when the batch window isn’t big enough, then resell that for a markup to people who need non-time sensitive inference?
MoE is in general kind of a stupid optimization. It seems to require around 5x more total parameters for the same modeling power as a dense model in exchange for around 2x less memory bandwidth needs.
The primary win of MoE models seems to be that you can list an enormous parameter count in your marketing materials.
Stupid? By paying 5x (normally 2-4x, but whatever) of a thing you don't care about at inference you can gain 2x in the primary thing you care about at inference. It's like handing out 4 extra bricks and getting back an extra lump of gold.
The general rule of thumb when assessing MoE <-> Dense model intelligence is SQRT(Total_Params*Active_Params). For Deepseek, you end up with ~158B params. The economics of batch inferencing a ~158B model at scale are different when compared to something like Deepseek (it is ~4x more FLOPS per inference after all), particularly if users care about latency.
That’s fair. My thought was, when there is an interesting new technology, it usually takes time to figure out how to monetize it. Figuring out how to monetize LLMs took no time at all.
I don't think it's obvious that any of these model providers are even profitable right now. I'm also not sure what there is to "figure out" - it's an expensive technology where the cost scales per token, so they charge per token? would you rather they burned even more money giving it away for free until everyone was dependent on it and then hyper enshittified to try and not go broke like so much of the rest of tech?
I run Deepseek V3 locally as my daily driver and I find it affordable, fast and effective. The article assumes GPU which in my opinion is not the best way to serve large models like this locally. I run a mid-range EPYC 9004 series based home server on a supermicro mobo which cost all-in around $4000. It's a single CPU machine with 384GB RAM (you could get 768GB using 64GB sticks but this costs more). No GPU means power draw is less than a gaming desktop. With the RAM limitation I run an Unsloth Dynamic GGUF which, quality wise in real-world use performs very close to the original. It is around 270GB which leaves plenty of room for context - I run 16k context normally as I use the machine for other things too but can up it to 24k if I need more. I get about 9-10 tokens per second, dropping to 7 tokens/second with a large context. There are plenty of people running similar setups with 2 CPUs who run the full version at similar tokens/second.
> Unsloth Dynamic GGUF which, quality wise in real-world use performs very close to the original
How close are we talking?
I’m not calling you a liar OP, but in general I wish people perpetuating such broad claims would be more rigorous.
Unsloth does amazing work, however as far as I’m aware even they themselves do not publish head to head evals with the original unquantized models.
I have sympathy here because very few people and companies can afford to run the original models, let alone engineer rigorous evals.
However I felt compelled to comment because my experience does not match. For relatively simple usage the differences are hard to notice, but they become much more apparent in high complexity and long context tasks.
Oh hey :) Thanks for the kind words - we did provide benchmarks (MMLU, KLD, Perplexity) for Llama 4 Scout, Gemma 3 27B using our methodology - https://news.ycombinator.com/item?id=39671146 etc are much more important :)
We also provide Q8_0 and Q8_K_XL quants, which are mostly equivalent to FP8 - you can also use the magical `-ot ".ffn_.*_exps.=CPU"` incantation to offload MoE layers to RAM!
1 reply →
You are right that I haven't been rigorous - it's easy to benchmark tokens/second but quality of output is more difficult to nail down. I couldn't find any decent comparisons for Unsloth either. So I just tried a few of their models out, looking for something that was 'good enough' i.e. does all I need: coding, summarizing documents, troubleshooting anything and everything. I would like to see head to head comparisons too - maybe I will invest in more RAM at some stage but so far I have no need for it. I ran some comparisons between the smaller and larger versions of the Unsloth models and interestingly (for me anyway) didn't notice a huge amount of difference in quality between them. But, the smaller models didn't run significantly faster so I settled for the biggest model I could fit in RAM with a decent context. For more complex coding I use Deepseek R1 (again the Unsloth) but since it's a reasoning model it isn't real-time so no use as my daily driver.
4 replies →
I am impressed. Your personal website is down. HN doesn't allow private messages.
I'm Jeff Carr. I co-founded digital ocean. I assume I can't post email addresses here, but I will try. lets see how smart things are from banning me. I am: wit AT wit com
State of the art of local models is even further.
For example, look into https://github.com/kvcache-ai/ktransformers, which achieve >11 tokens/s on a relatively old two socket Xeon servers + retail RTX 4090 GPU. Even more interesting is prefill speed at more than 250 tokens/s. This is very useful in use cases like coding, where large prompts are common.
The above is achievable today. In the mean time Intel guys are working on something even more impressive. In https://github.com/sgl-project/sglang/pull/5150 they claim that they achieve >15 tokens/s generation and >350 tokens/s prefill. They don't share what exact hardware they run this on, but from various bits and pieces over various PRs I reverse-engineered that they use 2x Xeon 6980P with MRDIMM 8800 RAM, without GPU. Total cost of such setup will be around $10k once cheap Engineering samples hit eBay.
4 replies →
Pretty sure you can post email addresses here, this is mine: saagar@saagarjha.com. It's more about avoiding spam.
You can post emails fine, you just might get spammed (because it's a public forum).
You can put your email in your profile
fyi, your website is also down... wit.com doesn't resolve for me
3 replies →
The latest V3 strikes me as a really practical go-to among open-weights models. Lots of tasks don't need the reasoning tokens, and not having to wait for them is nice. (If something does need it you can always switch.) If you're not running it yourself a couple providers have it with full context, 80tps, and a promise not to use your data.
9004 home server is awesome!
Impressive. I need to look more into this. I'm doing my best to limit my LLM usage to what I can run locally.
Whats your prompt processing speed? That’s more important in this situation than output TPS. If you have to wait minutes to start getting an answer, that makes it much worse than a cloud-hosted version.
Prompt eval time varies a lot with context but it feels real-time for short prompts - approx 20 tokens per second but I haven't done much benchmarking of this. When there is a lot of re-prompting in a long back and forth it is still quite fast - I do use KV cache which I assume helps and also quantize the KV cache to Q8 if I am running contexts above 16k. However, if I want it to summarize a document of say 15,000 words it does take a long time - here I walk away and come back in about 20 minutes and it will be complete.
If he is doing multiturn conversations, he can reuse the kv cache from the last turn and skip the prompt processing on the history that would make time to first token too slow, by only doing prompt processing on his actual prompt for the current turn. This turns a quadratic amount of tokens to process into a linear number. I am not sure if this is what he is doing, but that is what I would do if I had his hardware.
I assume KV caching makes this a non issue, but I'm also curious.
2 replies →
I use a dual-socket 18-core (so 36 total) xeon with 768GB of DDR4, and get about 1.5-2 tokens/sec with a 4-bit quantized version of the full deepseek models. It really is wild to be able to run a model like that at home.
Dumb question: would something like this have a graphics card too? I assume not
1 reply →
impressive, but that's 1/5 to 1/10 of the throughput that you'd get with a hosted provider, with 1/4 to 1/8 the supported context
It might be 5 to 10 times slower than a hosted provider but that doesn't really matter when the output is still faster than a person can read. Context wise, for troubleshooting I have never needed over 16k and for the rare occasion when I need to summarise a very large document I can change up the model to something smaller and get a huge context. I have never needed more than 32k though.
Dude he's running locally, and I think this setup is the best bang for the buck if you wanna run locally, we're not comparing to data centers, you gotta keep it in perspective. That's very impressive results for running local. Thanks for the numbers you saved me a chatgpt search :)
5 replies →
So, in your opinion, hardware wise, as a general purpose tinkering/learning self lab hardware, how would you grade the decked out framework desktop for 2.7k?
I thought GPUs with a lot of extremely fast memory was required for inference. Are you saying that we can accomplish inference with just a large amount of system memory that is non-unified and no GPU? How is that possible?
Basically it comes down to memory bandwidth of server CPUs being decent. A bit of oversimplification here but... The model and context have to be pulled through RAM (or VRAM) every time a new token is generated. CPUs that are designed for servers with lots of cores have decent bandwidth - up to 480GB/s with the EPYC 9 series and they can use 16 channels simultaneously to process memory. So, in theory they can pull 480GB through the system every second. GPUs are faster but you also have to fit the entire model and context into RAM (or VRAM) so for larger models they are extremely expensive because a decent consumer GPU only has 24GB of VRAM and costs silly money, if you need 20 of them. Whereas you get a lot of RDIMM RAM for a couple thousand bucks so you can run bigger models and 480GB/s gives output faster than most people can read.
I’m confused as to why you think a GPU is necessary? It’s just linear algebra.
3 replies →
Do you have hard numbers on the idle/average/max power draw? I assumed that server machines are built as if they are going to red-lined constantly so put less effort into low-utilization optimizations.
No hard numbers I'm afraid in that I don't monitor the power draw. But the machine uses a standard ATX power supply: a Corsair RM750e 750W PSU and the default TDP of the CPU is 280W - I have my TDP set at 300W. It is basically built like a desktop - ATX form factor, fans spin down at idle etc.
1 reply →
Can we run Deepseek using Ollama or something similar for code generation like Github copilot on a 40 core CPU with about 256GB RAM say 200 GB usable for the model?
Just curious what your use cases are? What type of texts are you producing?
Thank you.
I've always wondered this as well, and never seem to get an answer. Why would someone want to do this when they can get a better result either renting in the cloud, or just using a subscription?
Obviously I see the value in having something local from a control and privacy perspective, but it's surely always a net loss in terms of quality and capability of output, right?
Coding, my own proprietary code hence my desire for local hosting, a decent amount of legacy code. General troubleshooting of anything and everything from running Linux servers to fixing my car. Summarizing and translation of large documents occasionally. Also, image generation and other automations but obviously not LLMs for this.
1 reply →
CPUs are quietly becoming very well-balanced machines for BS 1 inference. The latest Intel Xeons should be at ~20 TPS.
A base Mac Mini is ~20 :)
1 reply →
This is an interesting blogpost. While the general conclusion ("We need batching") is true, inference of mixture of experts (MoE) models is actually a bit more nuanced.
The main reason we want big batches is because LLM inference is not limited by the compute, but my loading every single weight out of VRAM. Just compare the number of TFLOPS of an H100 with the memory bandwidth, there's basically room for 300 FLOP per byte loaded. So that's why we want big batches: we can perform a lot of operations per parameter/weight that we load from memory. This limit is often referred to as the "roofline model".
As models become bigger, this does not scale anymore because the model weights will not fit into GPU memory anymore and you need to distribute them across GPUs or across nodes. Even with NVLink and Infiniband, these communications are slower than loading from VRAM. NVlink is still fine for tensor parallelism, but across nodes this is quite slow.
So what MoE allows is expert parallelism, where different nodes keep different experts in memory and don't need to communicate as much between nodes. This only works if there are enough nodes to keep all experts in VRAM and have enough overhead for other stuff (KV cache, other weights, etc). So naturally the possible batch size becomes quite large. And of course you want to maximize this to make sure all GPUs are actually working.
You could load different "experts" in a round-robin way on a single node and only aggregate "batches" opportunistically, when you just have multiple requests in-flight that all happen to rely on the same "expert". The difference being that instead of "batches", you would only really have queues. Of course this would come with a sizeable increase in latency, but that's acceptable for many applications (such as for "deep research" workflows)
This is very much like Erlang's actor model. The same compute can be run in parallel, or managed via queues. With Erlang's strong support for FFI and process control, I wonder if it's being used as a dispatcher for these sorts of workloads.
> As models become bigger, this does not scale anymore because the model weights will not fit into GPU memory anymore and you need to distribute them across GPUs or across nodes. Even with NVLink and Infiniband, these communications are slower than loading from VRAM. NVlink is still fine for tensor parallelism, but across nodes this is quite slow.
Inference works by computing layers and then have a very small vector that you send to the next layer as input. When a model does not fit in a single GPU, you just divide it into layers and send the vector over a fabric to the GPU holding the next layer. The transfer happens so quickly that there is a negligible amount of idle time and then the next layer can be computed. The fastest inference on the planet at Cerebras uses this technique to do 2500T/sec on Llama 4 Maverick.
Groq and Cerebras both take a big chip approach to architecture and, at least in the case of Groq, they only make economic sense under high batch loads.
https://x.com/swyx/status/1760065636410274162?s=46
1 reply →
Distributing inference per layer, instead of splitting each layer across gpus, is indeed another approach, called pipeline parallelism. However, per batch there is less compute (only 1 gpu at a time), so inference is slower. In addition, the orchestration of starting the next batch on gpu #0 while gpu #1 starts is quite tricky. For this reason, tensor parallelism as I described is way more common in LLM inference.
1 reply →
could such a network with all its nodes and weights be deployed to an analog circuit and be superfast?
Do you mean something like this? https://www.etched.com/
Please go into more detail about this proposal, this piqued my interest in a really strange way.
3 replies →
And this is the investment case for AMD, models fit entirely in a single chassis, and side benefit: less tariffed network equipment to interconnect compute. Map/reduce instead of clustered compute.
Edit: when downvoting, please offer some insight why you disagree
How is the a unique advantage for AMD?
32 replies →
> when downvoting, please offer some insight why you disagree
And remind that (down)voting is not for (dis)agreement.
For those looking to save time, the answer is batched inference. Pretty much running multiple people's "prompts" through a model instance at the same time instead of just really tightly timesharing each model instance.
This is also why you may experience a variance in replies when using these services, even when you set the temperature to 0 and the seed to a fixed value. It's cause you don't control the other prompts yours get batched with. Could this be a data exfiltration attack vector? Probably, I didn't "research" that far.
> Pretty much running multiple people's "prompts" through a model instance at the same time instead of just really tightly timesharing each model instance.
I naively assumed providers did that with all models. Or does it only work for this (family of?) model(s)?
It works for a lot of families but not all. You need a high enough degree of sharing of model weights between different queries for that to make sense (memory access being the usual bottleneck nowadays, though smaller models see something similar with matmul batch efficiencies for CPU related reasons).
Fully connected transformers trivially work (every weight for every query). MoE works beyond a certain size or with certain types of mixing (still using every weight, or using a high enough fraction that there's some sharing with batches of 20+ queries). As you push further that direction though (lots of techniques, but the key point being accessing less of the model at once and bypassing some of it for each query), you need larger and larger batches for those efficiency gains to materialize. At some point it becomes untenable because of latency waiting for batches of data, and past that it becomes untenable because of the volume of query data.
Batching. Yes.
And one thing it can help you locally is when you rate certain content and want to make sure it didn’t hallucinate. So you toss 3 or 5 times or… batch_size times .)
Curious that batch if has been there from day one, but it takes a while for people to see/grasp/grok it.
> other prompts yours get batched with
Why would batching lead to variance?
> Why would batching lead to variance?
Depending on the shape of the data a slightly different kernel implementation (for e.g. matrix multiplication, etc.) will be the most optimal, and those will give slightly different results. There could also be other sources of non-determinism depending on the implementation (e.g. some kernels are inherently not entirely deterministic as they use tricks to go faster).
4 replies →
Attention doesn't get batched and the runtime of attention for a given users token depends on the total context length. Hence even in the ideal scenario of you getting a dedicated attention calculating GPU, the MLP calculating GPU doing batching will have to wait for the slowest user.
In the worst case scenario you are sharing a single attention calculating GPU with someone who has a super long context window, then that guy will be hogging most of the memory bandwidth of the GPU, even though you both are generating the same quantity of tokens.
This means that in the distributed setting, you will not only need dedicated GPUs for the model and attention calculations, you will also need to duplicate the whole setup for a variety of context lengths, so that long contexts are batches alongside other long contexts and short contexts are batches alongside other short contexts.
Batching can lead to variance with things like batchnorm but most transformers use layer norm to avoid this problem
1 reply →
Because these models are context-sensitive. Every token can influence the output.
6 replies →
In some mixture-of-experts approaches, samples or tokens are being distributed among experts. The experts are selected by trying to predict what is a good expert-sample match. Depending on your neighbors in the batch, you might be assigned different experts.
Sounds like an amazing attack vector if your prompts get mixed with other's.
What's the average batch size?
Wow, almost like Deepseek’s impressive performance is the result of optimisation by smart engineers.
Not sure why the snarky tone, didn't say or imply otherwise, nor did anyone else in the thread so far that I could see.
1 reply →
[dead]
Here’s a concise explanation:
- High sparsity means you need a very large batch size (number of requests being processed concurrently) so that each matrix multiplication is of sufficient arithmetic intensity to get good utilization.
- At such a large batch size, you’ll need a decent number of GPUs — 8-16 or so depending on the type — just to fit the weights and MLA/KV cache in HBM. But with only 8-16 GPUs your aggregate throughput is going to be so low that each of the many individual user requests will be served unacceptably slowly for most applications. Thus you need more like 256 GPUs for a good user experience.
I’m serving it on 16 H100s (2 nodes). I get 50-80 tok/s per request, and in aggregate I’ve seen several thousand. TTFT is pretty stable. Is faster than any cloud service we can use.
H200s are pretty easy to get now. If you switched I'm guessing you'd get a nice bump because the nccl allreduce on the big mlps wouldn't have to cross infiniband.
You're presumably using a very small batch size compared to what I described, thus getting very low model FLOP utilization (MFU) and high dollar cost per token.
1 reply →
You could do it on one node of 8xMI300x and cut your costs down.
Using vllm?
1 reply →
> High sparsity means you need a very large batch size
I don't understand what connection you're positing here? Do you think sparse matmul is actually a matmul with zeros lol
It's sparse as in only a small fraction of tokens are multiplied by a given expert's weight matrices (this is standard terminology in the MoE literature). So to properly utilize the tensor cores (hence serve DeepSeek cheaply, as the OP asks about) you need to serve enough tokens concurrently such that the per-matmul batch dimension is large.
2 replies →
I'm not a ML research or engineer, so take this with a grain of salt, but I'm a bit confused by this post.
Deepseek V3/R1 are expensive to run locally because they are so big compared to the models people usually run locally. The number of active parameters is obviously lower than the full model size, but that basically just helps with the compute requirements, not the memory requirements. Unless you have multiple H100s lying around, V3/R1 are only run locally as impractical stunts with some or all the model being stored on low bandwidth memory.
We can't compare the size of Deepseek V3 to that of any proprietary frontier models because we don't know the size of those models at all (or even their architecture). The models being compared to that are "expensive at scale" you can't run locally at all, but surely we have no reason to believe that they'd somehow be cheap to run locally?
But I thought you'd typically expect exactly the opposite effect than is claimed here? MoE should be the better tradeoff for the local/single-user scenario since the downside of batching being harder / less efficient doesn't matter.
> Bigger batches raise latency because user tokens might be waiting up to 200ms before the batch is full enough to run, but they boost throughput by allowing larger (and thus more efficient) GEMMs in the feed-forward step
Is it really that the matrixes being multiplied are larger? My mental model is that the purpose of batching isn't to get larger input matrices. It's to move the bottleneck from memory bandwidth to compute. The matrices are already sharded to a much smaller size than the size of the entire model or even layer. So you'll basically load some slice of the weights from the HBM to SRAM, do the multiplication for that slice, and then aggregate the results once all tiles have been processed. Batching lets you do multiple separate computations with the same weights, meaning you get more effective FLOPS per unit of memory bandwidth.
> The fact that OpenAI and Anthropic’s models are quick to respond suggests that either:
Is that actually a fact? The post has no numbers on the time to first token for any of the three providers.
Hi, I wrote the post! Also not a ML researcher, just an interested engineer, so I'm sure I got some things wrong.
> MoE should be the better tradeoff for the local/single-user scenario since the downside of batching being harder / less efficient doesn't matter.
What I meant was that the single-user scenario is going to get dramatically worse throughput-per-GPU, because they're not able to reap the benefits of multi-user batching (unless they're somehow doing massively parallel inference requests, I suppose).
> Is it really that the matrixes being multiplied are larger? My mental model is that the purpose of batching isn't to get larger input matrices. It's to move the bottleneck from memory bandwidth to compute.
As I understand it, you want larger input matrices in order to move the bottleneck from memory to compute: if you do no batching at all, your multiplications will be smaller (the weights will be the same, of course, but the next-token data you're multiplying with the weights will be 1xdim instead of batch-size x dim), so your GPUs will be under-utilized and your inference will spend more time doing memory operations and less time multiplying.
> The post has no numbers on the time to first token for any of the three providers.
I probably should have hunted down specific numbers, but I think people who've played with DeepSeek and other models will notice that DeepSeek is noticeably more sluggish.
Imagine an FPGA big enough to hold the whole model in LUTS (and not RAM) with latches in appropriate places to keep race conditions in check. Even a 100 Mhz clock cycle would beat almost anything else in the world running it. Even if there's 500 stages of pipeline involved, you could still get 200,000 tokens per second for a single stream and have 499 streams ready for other uses.
With an FPGA like that, you could translate all of the matrix multiplies and weights directly into binary logic, optimizing out every multiply or add of a zero bit. This alone could cut the number of gates and computations, and power consumption in half.
Because you wouldn't need to throw data to/from RAM, you'd save a huge percentage of the usual latency and eliminate memory bandwidth issues. The effective equivalent memory bandwidth would likely be measured in exabytes per second.
This is the type of compute load that would perfectly match a bit level systolic array.
You'd need an insanely big FPGA for this.
Thanks to gigabit SERDES links, it should be reasonably easy to send the vectors between chips if you need to distribute the work to fit available FPGA hardware.
Note this could also be done if you're just emulating a systolic array on cheap hardware, like Raspberry pi picos, using the built-in PIOs to handle the much lower signal rates.
1 reply →
There's still a lot of opportunity for software optimizations here. Trouble is that really only two classes of systems get optimizations for Deepseek, namely 1 small GPU + a lot of RAM (ktransformers) and the system that has all the VRAM in the world.
A system with say 192GB VRAM and rest standard memory (DGX station, 2xRTX Pro 6000, 4xB60 Dual, etc.) could still in theory run Deepseek @4bit quite quickly because of the power law type usage of the experts.
If you aren't prompting Deepseek in Chinese, a lot of the experts don't activate.
This would be an easier job for pruning, but still I think enthusiast systems are going to trend in a way the next couple years that makes these types of software optimizations useful on a much larger scale.
There's a user on Reddit with a 16x 3090 system (PCIE 3.0 x4 interconnect which doesn't seem to be using full bandwidth during tensor parallelism) that gets 7 token/s in llama.cpp. A single 3090 has enough VRAM bandwidth to scan over its 24GB of memory 39 times per second, so there's something else going on limiting performance.
> 16x 3090 system
That's about 5KW of power
> that gets 7 token/s in llama.cpp
Just looking at electricity bill it's cheaper to use API of any major providers.
> If you aren't prompting Deepseek in Chinese, a lot of the experts don't activate.
That's interesting, it means the model can be cut and those token routed to another closest expert, just in case they happened.
Or merge the bottom 1/8 (or whatever) experts together and (optionally) do some minimal training with all other weights frozen. Would need to modify the MoE routers slightly to map old -> new expert indices so you don't need to retrain the routers.
A single MI300x has 192GB of vram.
Sad reality is that the MI300x isn't a monolithic die, so the chiplets have internal bandwidth limitations (ofc less severe that using PCIe/nvlink).
In AMD own parlance, the "Modular Chiplet Platform" presents itself as either single-I-don't-care-about-speed-or-latency "Single Partition X-celerator" mode or in multiple-I-actually-totally-do-care-about-speed-and-latency-NUMA-like "Core Partitioned X-celerator" mode.
So you kinda still need to care what-loads-where.
1 reply →
This is a great explainer from an LLM perspective, and it would be interesting to see a computational scheduling explanation in depth. I presume that hyperscale LLM companies extensively examine the computation trace to identify bottlenecks and idle bubbles, and develop load balancers, pipeline architectures and schedulers in order to optimise their workload.
The batching requirement for efficiency makes high security applications quite difficult, because the normal technique of isolating unrelated queries would become very expensive. The nVidia vGPU GPU virtualisation time shares GPU memory, and every switch requires unload/reload context switches, doubtful they have deduplication. Multi-Instance GPU (MIG) splits GPU memory between users, but it is a fixed partitioning scheme (you have to reboot the GPU to change), and nobody wants to split their 96GB GPU into 4x24GB GPUs.
Makes me wonder what the tradeoff is for putting second level memory on the GPU board (i.e. normal DRAM), so that different matrix data can be loaded in faster than over PCIe, i.e. the HBM becomes a cache.
(I'm also really liking the honesty in the authors book on Software Engineering, not in the dry IEEE sense, but as a survival guide in a large enterprise. https://www.seangoedecke.com/book/ )
> mixture of experts requires higher batch sizes
Or apple silicon for low batch size (=1 ideally). The unified memory allows for running larger models on the expense of them running slower, because of lower bandwidth/flops than a normal gpu. But MoEs require computing only few parameters every time, so the computational needs are low. I have seen people reporting decent speeds for deepseek for single batch inference on macs. It is still expensive though to many people's standards because it requires a lot of $$$ to get enough memory.
In some ways, MoE models are perfect fit for macs (or any similar machines that may come out). In contrast, ordering a mac with upgraded ram size and running dense models that just fit in the vram can be very painful.
I was talking with a colleague the other day and we came to the conclusion that in our experience if you're using llms as a programming help models are really being optimised for the wrong things.
At work I often compare locallly run 4-30B models against various GPTs (we can only use non-local models for few things, because of confidentiality issues). While e.g. GPT-4o gives better results on average, the chances of it making parts of the response up is high enough that one has to invest significant amount to check and iterate over results. So the difference in effort is not much lower compared to the low parameter models.
The problem is both are just too slow to really iterate quickly, which makes things painful. I'd rather have a lower quality model (but with large context) that gives me near instant responses instead of a higher quality model that is slow. I guess that's not giving you the same headlines as the improved score on some evaluation.
It is not "slow and expensive", although it could be "or". You can get 3 tokens / second running on DDR4 memory on a two generation old workstation system that costs ~1K, via llama.cpp .
You’re most likely confusing the real deepseek with a distilled version. Unless you have more than 192Gb of RAM.
Workstations routinely accommodate much more than that. The "under $1K" price referred to a 768gb build (12x 64gb sticks on a Skylake based system), you could also do a dual-socket version with twice that, at the cost of messing with NUMA (which could be a pro or a con for throughput depending on how you're spreading bandwidth between nodes).
>It’s a peculiar feature of transformer-based LLMs that computing a batch of completions at the same time is almost as fast as computing a single completion. Why is that?
Incorrect. Transformers usually contain a classical MLP layer. Only the MLP layer can be batched. Hence all classical neural networks including convolutional networks (via im2col) can be batched.
If there's anything that the transformer architecture changes, it is that the attention layer cannot be batched.
Yeah this part was confusing, because it's only mentioned halfway through the article that the attention step can only be batched across matching context-window sizes.
If I understand it correctly, the effect of experts is a weighted sum of the individual calculation of each token meeting each expert, where experts to be met by a token are selected on an individual basis. Since a sum is commutative, though, it should be possible to send a large batch of tokens copied to multiple GPUs, where experts are streamed into VRAM, partitioned across GPUs. Then the bottleneck is your PCI-E bandwidth. With 2 GPUs at Gen 4 x16, you should have 60 GB/s of TX bandwidth, allowing you to upload a half precision quant of DeepSeek (about 360 GB) in about 6 seconds.
Then you just optimize your batch size to match the compute time to the upload time of each GPU. The expert calculation results can be retrieved from the GPUs and summed up.
Do the individual requests in a batch influence each-other?
Not in a floating point non-deterministic kind of way, where exact ordering might introduce some non-determinism (begin position 5th versus being position 10th in the batch lets say).
I'm asking in a semantic way, can context from one request leak into another because they are in the same batch?
I haven't looked for awhile but is deepseek online still about 1/100th the cost of its competitors?
I don't know the exact cost-breakdown, but they've come up with a few really inspiring and qualitatively high value papers that demonstrate how they further increased efficiency at their scale. Along with it they also published quite a few repositories with fully open-source code.
I stopped using ChatGPT as it was just reinforcing my prompts and not ever giving deeper insights, except something I call manipulative behaviour.
DeepSeek was seriously cool, but it started behaving similar to Google Gemini Pro, which just tries to be lazy, if you give it a hard task to chew on. It basically gives you patch-files instead of printing out the whole code, which is more tedious doing manually, than c/p the code.
It also started indexing our private repository and some corporate repositories that were on GitHub behind MFA and stringent lock. Definitely illegal.
> It also started indexing our private repository and some corporate repositories that were on GitHub behind MFA and stringent lock. Definitely illegal.
What is "it" in this context, the DeepSeek weights? Sounds like you're talking about some application, but AFAIK, DeepSeek doesn't maintain any applications, only their API + released weights.
1 reply →
> as it was just reinforcing my prompts and not ever giving deeper insights, except something I call manipulative behaviour.
Try telling Deepseek you want to murder political dissidents. In my experiments Deepseek will start enthusiastically reinforcing your prompts.
15 replies →
> DeepSeek was seriously cool, but it started behaving similar to Google Gemini Pro
You should be able to use the version of DeepSeek that you prefer indefinitely if you host it yourself or choose that specific version with your preferred provider.
How did it have access to your private repo and how did you find out?
3 replies →
>It basically gives you patch-files instead of printing out the whole code
I've noticed on the Aider leaderboard that Google Gemini Pro has an "Edit Format" listed as "diff-fenced" and things like ChatGPT have "architect" edit format where Aider asks separate "architect" and "code" models. Seems like Gemini Pro prefers the diff format.
3 replies →
You should self host not trust a third party application if you run into either of those things. The weights are open. DeepSeek didn’t change, the application you’re accessing it through did.
Or use an enterprise-ready service. Bedrock, firecracker, etc
4 replies →
Had Gemini 2.5 Pro preview running in agent mode in VSCode on a 3000+ line file. It patched it to about 200 lines with a comment in the middle: "// the rest of the code is unchanged".
1 reply →
ChatGPT is reinforcing your prompts, DeepSeek is cool but starts acting lazy like Gemini.
So what are you working with now? Deepseek or something else?
Depends on who you think its competitors are - deepseek-chat ($0.27/M in; $1.10/M out) is twice as expensive as Gemini 2.5 Flash ($0.15; $0.60) but far cheaper than Claude Sonnet 4 ($3; $15).
1/10-20th is a more realistic ratio.
That was a pretty good back to reality flex. There really isn't much of a market for expensive products. An inexpensive product that has a few tradeoffs will probably have the advantage. Given how proficient China is at accessing technology resources, it seems likely to me that any chip sanctions against them will probably not be effective.
This reminded me that the economies of scale in AI, especially inference, is huge.
When people say LLMs will be commoditised, I am not sure that means that the market is going to be super competitive. As the economies of scale of AI get even bigger (larger training costs + batch inference etc.) it just seems likely only around 3 companies will dominate LLMs.
For inference cost, I don't see how this is different from cloud providers vs dedicated server providers, where AWS is 5-10x more expensive than hetzner.
Somehow cloud providers manage to add lots of extra-cost on offering.
Isn’t this an arbitrage opportunity? Offer to pay a fraction of the cost per token but accept that your tokens will only be processed when the batch window isn’t big enough, then resell that for a markup to people who need non-time sensitive inference?
You may have already noticed that many providers have separate, much lower, prices for offline inference.
MoE is in general kind of a stupid optimization. It seems to require around 5x more total parameters for the same modeling power as a dense model in exchange for around 2x less memory bandwidth needs.
The primary win of MoE models seems to be that you can list an enormous parameter count in your marketing materials.
Stupid? By paying 5x (normally 2-4x, but whatever) of a thing you don't care about at inference you can gain 2x in the primary thing you care about at inference. It's like handing out 4 extra bricks and getting back an extra lump of gold.
The general rule of thumb when assessing MoE <-> Dense model intelligence is SQRT(Total_Params*Active_Params). For Deepseek, you end up with ~158B params. The economics of batch inferencing a ~158B model at scale are different when compared to something like Deepseek (it is ~4x more FLOPS per inference after all), particularly if users care about latency.
It's not expensive to run locally at all if you know how big of GPT4.
this statement holds true for all large parameter open weight models.
[dead]
[flagged]
I am so sincerely amused that “we” figured out how to monetize LLMs from the jump using tokens.
It isn’t tech for techs sake, it’s a money grab. Reminds me of paying to send a text message or buying minutes for a phone plan. Purely rent-seeking.
Can you explain how this is rent seeking? It seems to be straightforwardly not rent seeking.
1. Company develops model, invests in research, hardware, and software.
2. Company sells access to the model.
(1) is the step that makes this not rent seeking.
Rent seeking is when you profit from something you didn't earn - land rent, monopoly profits, protectionism.
That’s fair. My thought was, when there is an interesting new technology, it usually takes time to figure out how to monetize it. Figuring out how to monetize LLMs took no time at all.
3 replies →
It's likely that no one who makes base models is currently making money from LLMs. More likely losing it at a crazy rate.
These prices are almost certainly "introductory offer" prices to get people/devs to integrate AI into their lives/workflow/product.
In a few years is when we will see what the actual cost is.
I don't think it's obvious that any of these model providers are even profitable right now. I'm also not sure what there is to "figure out" - it's an expensive technology where the cost scales per token, so they charge per token? would you rather they burned even more money giving it away for free until everyone was dependent on it and then hyper enshittified to try and not go broke like so much of the rest of tech?
My point, poorly made, was that I can run it myself for “free” without caring about tokens at all. Tokens are an artificial construct.
3 replies →