How to Run DeepSeek R1 671B Locally on a $2000 EPYC Server

1 year ago (digitalspaceport.com)

294 comments

walterbell

This runs the 671B model in Q4 quantization at 3.5-4.25 TPS for $2K on a single socket Epyc server motherboard using 512GB of RAM.

This [1] X thread runs the 671B model in the original Q8 at 6-8 TPS for $6K using a dual socket Epyc server motherboard using 768GB of RAM. I think this could be made cheaper by getting slower RAM but since this is RAM bandwidth limited that would likely reduce TPS. I’d be curious if this would just be a linear slowdown proportional to the RAM MHz or whether CAS latency plays into it as well.

[1] https://x.com/carrigmat/status/1884244369907278106?s=46&t=5D...

nielsole 1 year ago
I've been running the unsloth 200GB dynamic quantisation with 8k context on my 64GB Ryzen 7 5800G. CPU and iGPU utilization were super low, because it basically has to read the entire model from disk. (Looks like it needs ~40GB of actual memory that it cannot easily mmap from disk) With a Samsung 970 Evo Plus that gave me 2.5GB/s read speed. That came out at 0.15 tps Not bad for completely underspecced hardware.
Given the model has only so few active parameters per token (~40B), it is likely that just being able to hold it in memory absolve the largest bottleneck. I guess with a single consumer PCIe4.0x16 graphics card you could get at most 1tps just because of the PCIe transfer speed? Maybe CPU processing can be faster simply because DDR transfer is faster than transfer to the graphics card.
- TeMPOraL 1 year ago
  
  To add another datapoint, I've been running the 131GB (140GB on disk) 1.58 bit dynamic quant from Unsloth with 4k context on my 32GB Ryzen 7 2700X (8 cores, 3.70 GHz), and achieved exactly the same speed - around 0.15 tps on average, sometimes dropping to 0.11, tps occasionally going up to 0.16 tps. Roughly 1/2 of your specs, roughly 1/2 smaller quant, same tps.
  I've had to disable the overload safeties in LM Studio and tweak with some loader parameters to get the model to run mostly from disk (NVMe SSD), but once it did, it also used very little CPU!
  I tried offloading to GPU, but my RTX 4070 Ti (12GB VRAM) can take at most 4 layers, and it turned out to make no difference in tps.
  My RAM is DDR4, maybe switching to DDR5 would improve things? Testing that would require replacing everything but the GPU, though, as my motherboard is too old :/.
  
  3 replies →
- conor_mc 1 year ago
  
  I wonder if the, now abandoned, Intel Optane drives could help with this. They had very low latency, high IOPS, and decent throughput. They made RAM modules as well. A ram disk made of them might be faster.
  
  1 reply →
- smcleod 1 year ago
  
  I get around 4-5t/s with the unsloth 1.58bit quant on my home server that has 2x3090 and 192GB of DDR5 Ryzen 9, usable but slow.
  
  3 replies →
- baobun 1 year ago
  
  I imagine you can get more by striping drives. Depending on what chipset you have, the CPU should handle at least 4. Sucks that no AM4 APU supports PCIe 4 while the platform otherwise does.
geertj 1 year ago

> I’d be curious if this would just be a linear slowdown proportional to the RAM MHz or whether CAS latency plays into it as well.
Per o3-mini, the blocked gemm (matrix multiply) operations have very good locality and therefore MT/s should matter much more than CAS latency.
iwontberude 1 year ago
I have been doing this with an Epyc 7402 and 512GB of DDR4 and its been fairly performant, you dont have to wait very long to get pretty good results. It's still LLM levels of bad, but at least I dont have to pay $20/mo to OpenAI.
- whatevaa 1 year ago
  
  I don't think the cost of such machine will ever be a better than $20/mo, though. Capital costs are too high.
  
  2 replies →
3abiton 1 year ago
3x the price for less than 2x the speed increase. I don't think the price justifies the upgrade.
- phonon 1 year ago
  
  Q4 vs Q8.
  
  6 replies →
- bee_rider 1 year ago
  
  I mean, nothing ever actually scales linearly, right?
- TacticalCoder 1 year ago
  
  TFA says it can bump the spec to 768 GB but that it's then more like $2500 than $2000. At 768 GB that'd be the full, 8 bit, model.
  Seems indeed like a good price compared to $6000 for someone who wants to hack a build.
  I mean: $6 K is doable but I take it take many who'd want to build such a machine for fun would prefer to only fork $2.5K.
  
  1 reply →
plagiarist 1 year ago
Is there a source that unrolls that without creating an account?
- bhaak 1 year ago
  
  https://xcancel.com/carrigmat/status/1884244369907278106?s=4...
  
  1 reply →

isoprophlex 1 year ago

Online, R1 costs what, $2/MTok?

This rig does >4 tok/s, which is ~15-20 ktok/hr, or $0.04/hr when purchased through a provider.

You're probably spending $0.20/hr on power (1 kW) alone.

Cool achievement, but to me it doesn't make a lot of sense (besides privacy...)

rightbyte 1 year ago
> Cool achievement, but to me it doesn't make a lot of sense (besides privacy...)
I would argue that is enough and that this is awesome. It was a long time ago I wanted to do a tech hack like this much.
- isoprophlex 1 year ago
  
  Well thinking about it a bit more, it would be so cool if you could
  A) somehow continuously interact with the running model, ambient-computing style. Say have the thing observe you as you work, letting it store memories.
  B) allowing it to process those memories when it chooses to/whenever it's not getting any external input/when it is "sleeping" and
  C) (this is probably very difficult) have it change it's own weights somehow due to whatever it does in A+B.
  THAT, in a privacy friendly self-hosted package, i'd pay serious money for
  
  2 replies →
codetrotter 1 year ago
> doesn't make a lot of sense (besides privacy...)
Privacy is worth very much though.
- onlyrealcuzzo 1 year ago
  
  What privacy benefit do you get running this locally vs renting a baremetal GPU and running it there?
  Wouldn't that be much more cost-effective?
  Especially when you inevitably want to run a better / different model in the near future that would benefit from different hardware?
  You can get similar Tok/sec on a single RTX 4090 - which you can rent for <$1/hr.
  
  1 reply →
- infecto 1 year ago
  
  Definitely but when you can run this in places like Azure with tight contracts it makes little sense except for the ultra paranoid.
  
  15 replies →
jpc0 1 year ago
You could absolutely install 2kw of solar for probably around 2-4k and then at worst it turns your daytime usage into 0$. I also would be surprised if this was pulling 1kw in reality, I would want to see an actual measurement of what it is realistically pulling at the wall.
I believe it was an 850w PSU on the spec sheet?
- dboreham 1 year ago
  
  Quick note that solar power doesn't have zero cost.
  
  5 replies →
- killingtime74 1 year ago
  
  Marginal cost $0, 2kw solar + inverter + battery + install is worth more than this rig
  
  2 replies →
- sieabahlpark 1 year ago
  
  [dead]
rufus_foreman 1 year ago
Privacy, for me, is a necessary feature for something like this.
And I think your math is off, $0.20 per kWh at 1 kW is is $145 a month. I pay $0.06 per kWh. I've got what, 7 or 8 computers running right now and my electric bill for that and everything else is around $100 a month, at least until I start using AC. I don't think the power usage of something like this would be significant enough for me to even shut it off when I wasn't using it.
Anyway, we'll find out, just ordered the motherboard.
- michpoch 1 year ago
  
  > I pay $0.06 per kWh
  That is like, insanely cheap. In Europe I'd expect prices between $0.15 - 0.25 per kWh. $0.06 sounds like you live next to some solar farm or large hydro installation? Is that a total price, with transfer?
  
  1 reply →
- rodonn 1 year ago
  
  Depends on where you live. The average in San Francisco is $0.29 per kWh.
  
  1 reply →
magic_hamster 1 year ago

This gets you the (arguably) most powerful AI in the world running completely privately, under your control, in around $2000. There are many use cases for when you wouldn't want to send your prompts and data to a 3rd party. A lot of businesses have a data export policy where you are just not allowed to use company data anywhere but internal services. This is actually insanely useful.
api 1 year ago
How is it that cloud LLMs can be so much cheaper? Especially given that local compute, RAM, and storage are often orders of magnitude cheaper than cloud.
Is it possible that this is an AI bubble subsidy where we are actually getting it below cost?
Of course for conventional compute cloud markup is ludicrous, so maybe this is just cloud economy of scale with a much smaller markup.
- NeutralCrane 1 year ago
  
  My guess is two things:
  1. Economies of scale. Cloud providers are using clusters in the tens of thousands of GPUs. I think they are able to run inference much more efficiently than you would be able to in a single cluster just built for your needs.
  2. As you mentioned, they are selling at a loss. OpenAI is hugely unprofitable, and they reportedly lose money on every query.
  
  1 reply →
- thijson 1 year ago
  
  I think batch processing of many requests is cheaper. As each layer of the model is loaded into cache, you can put through many prompts. Running it locally you don't have that benefit.
- michpoch 1 year ago
  
  > Especially given that local compute, RAM, and storage are often orders of magnitude cheaper than cloud
  He uses old, much less efficient GPUs.
  He also did not select his living location based on the electricity prices, unlikely the cloud providers.
- realusername 1 year ago
  
  It's cheaper because you are unlikely to run your local AI at top capacity 24/7 so you have unused capacity which you are paying for.
  
  2 replies →
- octacat 1 year ago
  
  It is shared between users and better utilized and optimized.
  
  4 replies →
- agieocean 1 year ago
  
  Isn't that just because they can get massive discounts on hardware buying in bulk (for lack of a proper term) + absorb losses?
  
  1 reply →
matja 1 year ago
How would it use 1kW? Socket SP3 tops at 280W and the system in the article has a 850W PSU so I'm not sure what I'm missing.
- falcor84 1 year ago
  
  I assume that the parent just rounded 850W up to 1kW, no?
  
  1 reply →
topbanana 1 year ago

The point is running locally, not efficiently
onlyrealcuzzo 1 year ago

> You're probably spending $0.20/hr on power (1 kW) alone.
For those that aren't following - means you're spending ~$10/MTok on power alone (compared to $2/MTok hosted).
bloomingkales 1 year ago
"besides privacy"
lol.
Yeah, just besides that one little thing. We really are a beaten down society aren't we.
- Aurornis 1 year ago
  
  Most people value privacy, but they’re practical about it.
  The odds of a cloud server leaking my information is non-zero, but it’s very small. A government entity could theoretically get to it, but they would be bored to tears because I have nothing of interest to them. So practically speaking, the threat surface of cloud hosting is an acceptable tradeoff for the speed and ease of use.
  Running things at home is fun, but the hosted solutions are so much faster when you actually want to get work done. If you’re doing some secret sensitive work or have contract obligations then I could understand running it locally. For most people, trying to secure your LLM interactions from the government isn’t a priority because the government isn’t even interested.
  Legally, the government could come and take your home server too. People like to have fantasies about destroying the server during a raid or encrypting things, but practically speaking they’ll get to it or lock you up if they want it.
  
  2 replies →
- 7thpower 1 year ago
  
  There is something about this comment that is so petty that I had to re-read it. Nice dunk, I guess.
- brookst 1 year ago
  
  Privacy is a relatively new concept, and the idea that individuals are entitled to complete privacy is a very new and radical concept.
  I am as pro-privacy as they come, but let’s not pretend that government and corporate surveillance is some wild new thing that just appeared. Read Horace’s Satires for insight into how non-private private correspondence often was in Ancient Rome.
  
  1 reply →
WiSaGaN 1 year ago

I think the main point of local model is privacy set aside hobby and tinkering.
carbonbioxide 1 year ago

I think the privacy should be the whole point. There's always a price to pay. I'm optimistic that soon you'll be able to get better speeds with less hardware.
ekianjo 1 year ago

> (besides privacy...)
that's the whole point of local models
spaceport 1 year ago

The system idles at 60w and running hits 260w.
jaggs 1 year ago

I think you may be underestimating future enshitification? (e.g. it's going to be trivially easy for the cloud suppliers to cram ads into all the chat responses at will).

huijzer 1 year ago

What is a bit weird about AI currently is that you basically always want to run the best model, but the price of the hardware is a bit ridiculous. In the 1990s, it was possible to run Linux on scrappy hardware. You could also always run other “building blocks” like Python, Docker, or C++ easily.

But the newest AI models require an order of magnitude more RAM than my system or the systems I typically rent have.

So I’m curious to people here, has this in the history of software happened before? Maybe computer games are a good example. There people would also have to upgrade their system to run the latest games.

spamizbad 1 year ago
Like AI, there were exciting classes of applications in the 70s, 80s and 90s that mandated pricier hardware. Anything 3D related, running multi-user systems, higher end CAD/EDA tooling, and running any server that actually got put under “real” load (more than 20 users).
If anything this isn’t so bad: $4K in 2025 dollars is an affordable desktop computer from the 90s.
- lukeschlather 1 year ago
  
  The thing is I'm not that interested in running something that will run on a $4K rig. I'm a little frustrated by articles like this, because they claim to be running "R1" but it's a quantized version and/or it has a small context window... it's not meaningfully R1. I think to actually run R1 properly you need more like $250k.
  But it's hard to tell because most of the stuff posted is people trying to do duct tape and bailing wire solutions.
  
  2 replies →
- handzhiev 1 year ago
  
  Indeed, even design and prepress required quite expensive hardware. There was a time when very expensive Silicone Graphics workstations were a thing.
Keyframe 1 year ago

Of course it has. Coughs in SGI and advanced 3D and video software like PowerAnimator, Softimage, Flame. Hardware + software combo starting around 60k of 90's dollars, but to do something really useful with it you'd have to enter 100-250k of 90's dollars range.
tarruda 1 year ago
> What is a bit weird about AI currently is that you basically always want to run the best model,
I think the problem is thinking that you always need to use the best LLM. Consider this:
- When you don't need correct output (such as when writing a blog post, there's no right/wrong answer), "best" can be subjective.
- When you need correct output (such as when coding), you always need to review the result, no matter how good the model is.
IMO you can get 70% of the value of high end proprietary models by just using something like Llama 8b, which is runnable on most commodity hardware. That should increase to something like 80% - 90% when using bigger open models such as the newly released "mistral small 3"
- lukeschlather 1 year ago
  
  With o1 I had a hairy mathematical problem recently related to video transcoding. I explained my flawed reasoning to o1, and it was kind of funny in that it took roughly the same amount of time to figure out the flaw in my reasoning, but it did, and it also provided detailed reasoning with correct math to correct me. Something like Llama 8b would've been worse than useless. I ran the same prompt by ChatGPT and Gemini, and both gave me sycophantic confirmation of my flawed reasoning.
  > When you don't need correct output (such as when writing a blog post, there's no right/wrong answer), "best" can be subjective.
  This is like, everything that is wrong with the Internet in a single sentence. If you are writing a blog post, please write the best blog post you can, if you don't have a strong opinion on "best," don't write.
  
  1 reply →
- lurking_swe 1 year ago
  
  for coding insights / suggestions as you type, similar to copilot, i agree.
  for rapidly developing prototypes or working on side projects, i find llama 8b useless. it might take 5-6 iterations to generate something truly useful. compared to say 1-shot with claude sonnet 3.5 or open ai gpt-4o. that’s a lot less typing and time wasted.
NegativeK 1 year ago
I'm not sure Linux is the best comparison; it was specifically created to run on standard PC hardware. We have user access to AI models for little or no monetary cost, but they can be insanely expensive to run.
Maybe a better comparison would be weather simulations in the 90s? We had access to their outputs in the 90s but running the comparable calculations as a regular Joe might've actually been impossible without a huge bankroll.
- bee_rider 1 year ago
  
  Or 3D rendering, or even particularly intense graphic design-y stuff I think, right? In the 90’s… I mean, computers in the $1k-$2k range were pretty much entry level, right?
detourdog 1 year ago

The early 90's and digital graphic production. Computer upgrades could make intensive alterations interactive. This was true of photoshop and excel. There were many bottle necks to speed. Upgrade a network of graphic machines from 10mbit networking to 100mbit did wonders for server based workflows.
evilduck 1 year ago
Adjusting for inflation, $2000 is about the same price as the first iMac, an entry level consumer PC at the time. Local AI is still pretty accessible to hobbyist level spending.
- diffeomorphism 1 year ago
  
  Not adjusting at all, this is not "entry level" but rather "enthusiast"
  https://www.logicalincrements.com/
  Still accessible but only for dedicated hobbyists with deeper pockets.
svilen_dobrev 1 year ago

well, if there was e.g. a model trained for coding - i.e. specialization as such, having models trained mostly for this or that - instead of everything incl. Shakespeare, the kitchen sink and the cockroaches biology under it, that would make those runable on much low level hardware. But there is only one, The-Big-Deal.. in many incarnations.
ant6n 1 year ago

Read “masters of doom”, they go into quite some detail on how Carmack got himself a very expensive work station to develop Doom/Quake.
holoduke 1 year ago

We finally enter an era where the demand for more memory is really needed. Small local ai models will be used for many things in the near future. Requiring lots of memory. Even phones will be in the need for terabytes of fast memory in the future.
Loic 1 year ago

In the 90's it was really expensive to run 3D Studio or POVray. It could take days to render a single image. Silicon Graphics workstations could do it faster but were out of the budget of non professionals.
qingcharles 1 year ago

Raytracing decent scenes was a big CPU hog in the 80s/90s for me. I'd have to leave single frames running overnight.
cactusplant7374 1 year ago
How were you running Docker in the 1990s?
- mdp2021 1 year ago
  
  > you basically always want to run the best model, but the price of the hardware is a bit ridiculous. In the 1990s, it was possible to run Linux on scrappy hardware. You could also always run other “building blocks” like Python, Docker, or C++ easily
  = "When you needed to run common «building blocks» (such as, in other times, «Python, Docker, or C++» - normal fundamental software you may have needed), even scrappy hardware would suffice in the '90s"
  As a matter of facts, people would upgrade foremostly for performance.
- buescher 1 year ago
  
  Heh. I caught that too, and was going to say "I totally remember running Docker on Slackware on my 386DX40. I had to upgrade to 8MB of RAM. Good times."

notsylver 1 year ago

I think it would be more interesting doing this with smaller models (33b-70b) and see if you could get 5-10 tokens/sec on a budget. I've desperately wanted something locally thats around the same level of 4o, but I'm not in a hurry to spend $3k on an overpriced GPU or $2k on this

gliptic 1 year ago
Your best bet for 33B is already having a computer and buying a used RTX 3090 for <$1k. I don't think there's currently any cheap options for 70B that would give you >5. High memory bandwidth is just too expensive. Strix Halo might give you >5 once it comes out, but will probably be significantly more than $1k for 64 GB RAM.
- ants_everywhere 1 year ago
  
  With used GPUs do you have to be concerned that they're close to EOL due to high utilization in a Bitcoin or AI rig?
  
  3 replies →
- pmarreck 1 year ago
  
  M4 Mac with unified GPU RAM
  Not very cheap though! But you get a quite usable personal computer with it...
  
  1 reply →
- jjallen 1 year ago
  
  How does inference happen on a GPU with such limited memory compared with the full requirements of the model? This is something I’ve been wondering for a while
  
  2 replies →
- ynniv 1 year ago
  
  Umm, two 3090's? Additional cards scale as long as you have enough PCIe channels.
  
  1 reply →
api 1 year ago

Apple M chips with their unified GPU memory are not terrible. I have one of the first M1 Max laptops with 64G and it can run up to 70B models at very useful speeds. Newer M series are going to be faster and they offer more RAM now.
Are there any other laptops around other than the larger M series Macs that can run 30-70B LLMs at usable speeds that also have useful battery life and don’t sound like a jet taxiing to the runway?
For non-portables I bet a huge desktop or server CPU with fast RAM beats the Mac Mini and Studio for price performance, but I’d be curious to see benchmarks comparing fast many core CPU performance to a large M series GPU with unified RAM.
jenny91 1 year ago
As a data point: you can get an RTX 3090 for ~$1.2k and it runs deepseek-r1:32b perfectly fine via Ollama + open webui at ~35 tok/s in an OpenAI-like web app and basically as fast as 4o.
- kevinak 1 year ago
  
  You mean Qwen 32b fine-tuned on Deepseek :)
  There is only one model of Deepseek (671b), all others are fine-tunes of other models
- driverdan 1 year ago
  
  > you can get an RTX 3090 for ~$1.2k
  If you're paying that much you're being ripped off. They're $800-900 on eBay and IMO are still overpriced.
bick_nyers 1 year ago

It will be slower for a 70b model since Deepseek is an MoE that only activates 37b at a time. That's what makes CPU inference remotely feasible here.
firtoz 1 year ago
Would it be something like this?
> OpenAI's nightmare: DeepSeek R1 on a Raspberry Pi
https://x.com/geerlingguy/status/1884994878477623485
I haven't tried it myself or haven't verified the creds, but seems exciting at least
- gliptic 1 year ago
  
  That's 1.2 t/s for the 14B Qwen finetune, not the real R1. Unless you go with the GPU with the extra cost, but hardly anyone but Jeff Geerling is going to run a dedicated GPU on a Pi.
- etra0 1 year ago
  
  it's using a Raspberry Pi with a.... USD$1k gpu, which kinda defeat the purpose of using the RPI in the first place imo.
  or well, I guess you save a bit on power usage.
  
  2 replies →
spaceport 1 year ago

I put together a $350 build with a 3060 12GB and its still my favorite build. I run llama 3.2 11b q4 on it and its a really efficient way to get started and the tps is great.
Svoka 1 year ago

You can run smaller models on MacbookPro with ollama with those speeds. Even with several 3k GPUs it won't come close to 4o level.

spaceport 1 year ago

Hi HN, Garage youtuber here. Wanted to add in some stats on the wattages/ram.

Idle wattage: 60w (well below what I expected, this is w/o GPUs plugged in)

Loaded wattage: 260w

RAM Speed I am running currently: 2400 (V likely 3200 has a decent perf impact)

brunohaid 1 year ago

Still surprised that the $3000 NVIDIA Digits doesn’t come up more often in that and also the gung-ho market cap discussion.

I was an AI sceptic until 6 months ago, but that’s probably going to be my dev setup from spring onwards - running DeepSeek on it locally, with a nice RAG to pull in local documentation and datasheets, plus a curl plugin.

https://www.nvidia.com/en-us/project-digits/

fake-name 1 year ago
It'll probably be more relevant when you can actually buy the things.
It's just vaporware until then.
- brunohaid 1 year ago
  
  Call me naive, but I somehow trust them to deliver in time/specs?
  It’s also a more general comment around „AI desktop appliance“ vs homebuilts. I’d rather give NVIDIA/AMD $3k for a well adjusted local box than tinkering too much or feeding the next tech moloch, and have a hunch I’m not the only one feeling that way. Once it’s possible of course.
  
  2 replies →
- fulafel 1 year ago
  
  Also, LPDDR memory, and no published bandwidth numbers.
  
  2 replies →
- ganoushoreilly 1 year ago
  
  and people are missing the "Starting at" price. I suspect the advertised specs will end up more than $3k. If it comes out at that price, i'm in for 2. But I'm not holding my breath given Nvidia and all.
  
  1 reply →
ranguna 1 year ago
I'm not sure you can fit a decent quant of R1 in digits, 128 GB of memory is not enough for 8 and I'm not sure of 4 but I have my doubts. So you might have to go for around 1, which has a significant quality loss.
- Cane_P 1 year ago
  
  You can connect two, and get 256 GB. But it will still not be enough to run it in native format. You will still need to use lower quant.
diffeomorphism 1 year ago
The webpage does not say $3000 but starting at $3000. I am not so optimistic that the base model will actually be capable of this.
- Cane_P 1 year ago
  
  They won't have different models, in any other ways than if you want more storage (up to 4 TB, we don't know the lowest they will sell) and cabling necessary for connecting two DIGITS (it won't be included in the box).
  We already know that it is going to be one single CPU and GPU and fixed memory. The GPU is most likely the RTX 5070 Ti laptop model (992 TFLOPS, clocked 1% higher to get 1 PFLOP).
yapyap 1 year ago

probably because nvidia digits is just a concept rn

christophilus 1 year ago

Aside: it’s pretty amazing what $2K will buy. It’s been a minute since I built my desktop, and this has given me the itch to upgrade.

Any suggestions on building a low-power desktop that still yields decent performance?

Havoc 1 year ago

>Any suggestions on building a low-power desktop that still yields decent performance?
You don't for now. The bottleneck is mem throughput. That's why people using CPU for LLM are running xeon-ish/epyc setups...lots of mem channels.
The APU class gear along the lines of Halo Strix is probably the path closest to lower power but it's not going to do 500gb of ram and still doesn't have enough throughput for big models
spaceport 1 year ago

Not to be that yt'r that shills my videos all over, but you did ask for a low powered desktop build and this $350 one I put together is still my favorite. The 3060 12GB with llama 3.2 vision 11b is a very fun box that is low idle power (intel rules) to leave on 24/7 and have it run some additional services like HA.
https://youtu.be/iflTQFn0jx4
baobun 1 year ago

Hard to know what ranges you have in mind with "decent performance" and "low-power".
I think your best bet might be a Ryzen U-series mini PC. Or perhaps an APU barebone. The ATX platform is not ideal from a power-efficiency perspective (whether inherently or from laziness or conspiracy from mobo and PSU makers, I do not know). If you want the flexibility or scale, you pay the price of course but first make sure it's what you want. I wouldn't look at discrete graphics unless you have specific needs (really high-end gaming, workstation, LLMs, etc) - the integrated graphics of last few years can both drive your 4k monitors and play recent games at 1080p smoothly, albeit perhaps not simultaneously ;)
Lenovo Tiny mq has some really impressive flavors (ECC support at the cost of CPU vendor-lock on PRO models) and there's the whole roster of Chinese competitors and up-and-comers if you're feeling adventerous. Believe me you can still get creative if you want to scratch the builder itch - thermals is generally what keeps these systems from really roaring (:

jbritton 1 year ago

Does it make any sense to have specialized models, which could possibly be a lot smaller. Say a model that just translates between English and Spanish, or maybe a model that just understands unix utilities and bash. I don’t know if limiting the training content affects the ultimate output quality or model size.

walterbell 1 year ago

Some enterprises have trained small specialized models based on proprietary data.
https://www.maginative.com/article/nvidia-leverages-ai-to-as...
> NVIDIA researchers customized LLaMA by training it on 24 billion tokens derived from internal documents, code, and other textual data related to chip design. This advanced “pretraining” tuned the model to understand the nuances of hardware engineering. The team then “fine-tuned” ChipNeMo on over 1,000 real-world examples of potential assistance applications collected from NVIDIA’s designers.
2023 paper, https://research.nvidia.com/publication/2023-10_chipnemo-dom...
> Our results show that these domain adaptation techniques enable significant LLM performance improvements over general-purpose base models across the three evaluated applications, enabling up to 5x model size reduction with similar or better performance on a range of design tasks.
2024 paper, https://developer.nvidia.com/blog/streamlining-data-processi...
> Domain-adaptive pretraining (DAPT) of large language models (LLMs) is an important step towards building domain-specific models. These models demonstrate greater capabilities in domain-specific tasks compared to their off-the-shelf open or commercial counterparts.

lhl 1 year ago

Last fall I built a new workstation with an EPYC 9274F (24C Zen4 4.1-4.3GHz, $2400), 384GB 12 x 32GB DDR5-4800 RDIMM ($1600), and a Gigabyte MZ33-AR0 motherboard. I'm slowly populating with GPUs (including using C-Payne MCIO gen5 adapters), not focused on memory, but I did spend some time recently poking at it.

I spent extra on the 9274F because of some published benchmarks [1] that showed that the 9274F had STREAM TRIAD results of 395 GB/s (on 460.8 GB/s of theoretical peak memory bandwidth), however sadly, my results have been nowhere near that. I did testing with LIKWID, Sysbench, and llama-bench, and even w/ an updated BIOS and NUMA tweaks, I was getting <1/2 the Fujitsu benchmark numbers:

  Results for results-f31-l3-srat:
  {
      "likwid_copy": 172.293857421875,
      "likwid_stream": 173.132177734375,
      "likwid_triad": 172.4758203125,
      "sysbench_memory_read_gib": 191.199125,
      "llama_llama-2-7b.Q4_0": {
          "tokens_per_second": 38.361456,
          "model_size_gb": 3.5623703002929688,
          "mbw": 136.6577115303955
      }
  }

For those interested in all the system details/running their own tests (also MLC and PMBW results among others): https://github.com/AUGMXNT/speed-benchmarking/tree/main/epyc...

[1] https://sp.ts.fujitsu.com/dmsp/Publications/public/wp-perfor...

menaerus 1 year ago
Assuming that you populated the channels correctly, which I believe you did, I can only think that this issue could be related to the motherboard itself or RAM. I think you could start by measuring the single-core RAM bandwidth and latency.
Since the CPU is clocked quite high, figures you should be getting are I guess around ~100ns, but probably less than that, and 40-ish GB/s of BW. If those figures do not match then it could be either a motherboard (HW) or BIOS (SW) issue or RAM stick issue.
If those figures closely match then it's not a RAM issue but a motherboard (BIOS or HW) and you could continue debugging by adding more and more cores to the experiment to understand at which point you hit the saturation point for the bandwidth. It could be a power issue with the mobo.
- lhl 1 year ago
  
  Yeah, that channels are populated correctly. As you can see from the mlc-results.txt, the latency looks fine:
  mlc --idle_latency Intel(R) Memory Latency Checker - v3.11b Command line parameters: --idle_latency Using buffer size of 1800.000MiB Each iteration took 424.8 base frequency clocks ( 104.9 ns)
  As does the per-channel --bandwidth_matrix results:
  Numa node Numa node 0 1 2 3 4 5 6 7 0 45999.8 46036.3 50490.7 50529.7 50421.0 50427.6 50433.5 52118.2 1 46099.1 46129.9 52768.3 52122.3 52086.5 52767.6 52122.6 52093.4 2 46006.3 46095.3 52117.0 52097.2 50385.2 52088.5 50396.1 52077.4 3 46092.6 46091.5 52153.6 52123.4 52140.3 52134.8 52078.8 52076.1 4 45718.9 46053.1 52087.3 52124.0 52144.8 50544.5 50492.7 52125.1 5 46093.7 46107.4 52082.0 52091.2 52147.5 52759.1 52163.7 52179.9 6 45915.9 45988.2 50412.8 50411.3 50490.8 50473.9 52136.1 52084.9 7 46134.4 46017.2 52088.9 52114.1 52125.0 52152.9 52056.6 52115.1
  I've tried various NUMA configurations (from 1 domain to a per-CCD config) and it doesn't seem to make much difference.
  Updating from the board-delivered F14 to the latest 9004 F31 BIOS (the F33 releases bricked the board and required using a BIOS flasher for manual recover) gave marginal (5-10%) improvement, but nothing major.
  While 1DPC, the memory is 2R (but still registers at 4800), training on every boot. The PMBW graph is probably the most useful behavior chart: https://github.com/AUGMXNT/speed-benchmarking/blob/main/epyc...
  Since I'm not so concerned with CPU inference, I feel like the debugging/testing I've done is... the amount I'm going to do, which is enough to at least characterize, if not fix the performance.
  I might write up a more step-by-step guide at some point to help others but for now the testing scripts are there - I think most people who are looking at theoretical MBW should probably do their own real-world testing as it seems to vary a lot more than GPU bandwidth.
  
  8 replies →

easygenes 1 year ago

This is neat, but what I really want to see is someone running it on 8x 3090/4090/5090 and what is the most practical configuration for that.

gatienboquet 1 year ago

According to NVIDIA. > a single server with eight H200 GPUs connected using NVLink and NVLink Switch can run the full, 671-billion-parameter DeepSeek-R1 model at up to 3,872 tokens per second.
You can rent a single H200 for 3$/hour.
MaxikCZ 1 year ago

I have been searching for a single example of someone running it like this (or 8x P40 and alike), and found nothing..
deoxykev 1 year ago
8x 3090 will net you around 10-12tok/s
- bick_nyers 1 year ago
  
  It would not be that slow as it is an MoE model with 37b activated parameters.
  Still, 8x3090 gives you ~2.25 bits per weight, which is not a healthy quantization. Doing bifurcation to get up to 16x3090 would be necessary for lightning fast inference with 4bit quants.
  At that point though it becomes very hard to build a system due to PCIE lanes, signal integrity, the volume of space you require, the heat generated, and the power requirements.
  This is the advantage of moving up to Quadro cards, half the power for 2-4x the VRAM (top end Blackwell Quadro expected to be 96GB).
  
  1 reply →
- rdlw 1 year ago
  
  Is it possible that eight graphics cards is the most practical configuration? How do you even set that up? I guess server mobos have crazy numbers of PCIe slots?

pama 1 year ago

What is the fastest documented way so far to serve the full R1 or V3 models (Q8, not Q4) if the main purpose is inference with many parallel queries and maximizing the total tokens per sec? Did anyone document and benchmark efficient distributed service setups?

manmal 1 year ago

The top comment in this thread mentions a 6k setup, which likely could be used with vLLM with more tinkering. AFAIK vLLM‘s batched inference is great.
snovv_crash 1 year ago

You need enough VRAM to hold the whole thing plus context. So probably a bunch of H100s, or MI300s.

jonotime 1 year ago

I'm also kind of new to this and coming from coding with ChatGPT. Isnt the time to first token important? He is sitting there for minutes waiting for a response. Shouldnt that be a concern?

HarHarVeryFunny 1 year ago
I'd rather wait to get a good response, than get a quick response that is much less useful, and it's the nature of these "reasoning" models that they reason before responding.
Yesterday I was comparing DeepSeek-R1 (NVidia hosted version) with both Sonnet 3.5 (regarded by most as most capable coder) and the new Gemini 2.0 flash, and the wait was worth it. I was trying to get all three to create a web page with a horizontally scrolling timeline with associated clickable photos...
Gemini got to about 90% success after half a dozen prompts, after which it became a frustrating game of whack-a-mole trying to get it to fix the remaining 10% without introducing new bugs - I gave up after ~30min. Sonnet 3.5 looked promising at first, generating based on a sketch I gave it, but also only got to 90%, then hit daily usage limit after a few attempts to complete the task.
DeepSeek-R took a while to generate it, but nailed it on first attempt.
- jonotime 1 year ago
  
  Interesting. So in my use, I rarely see gpt get it right on the first pass but thats mostly due to interpretation of the question. I'm ruling out the times when it hallucinates calls to functions that dont exist.
  Lets say I ask for some function that calculates some matrix math in python. It will spit out something but I dont like what it did. So I will say, now dont us any calls to that library you pulled in, and also allow for these types of inputs. Add exception handling...
  So response time is important since its a conversation, no matter how correct the response is.
  When you say deep seek "nailed it on the first attempt" do you mean it was without bugs? Or do you mean it worked how you imagined? Or what exactly?
  
  1 reply →
- lukeschlather 1 year ago
  
  The alternative isn't to use a weaker model, the alternative is to solve the problem myself. These are all very academically interesting, but they don't usually save any time. On the other hand, the other day I had a math problem I asked o1 for help with, and it was barely worth it. I realized my problem at the exact moment it gave me the correct answer. I say that because these high-end reasoning models are getting better. "Barely useful" is a huge deal and it seems like we are hitting the inflection point where expensive models are starting to be consistently useful.
  
  1 reply →

teruakohatu 1 year ago

If you are going to go to that effort, adding a second NMVE drive, and doing RAID 0 across them, will improve the speed of getting the model into RAM.

merb 1 year ago
than you will way more memory, since you can only do software raid a nvme drive.
- mgerdts 1 year ago
  
  I’ve found that striping across two drives like the 980 Pro described here or WD SN850 Black drives easily gets direct IO read speeds over 12 GB/s on threadripper pro systems. This assumes a stripe size somewhere around 1 -2 MiB. This means that most reads will not need to be split and high queue depth sequential reads and random reads keep both drives busy. With careful alignment of IOs, performance approaches 2x of one drive’s performance.
  IO takes CPU cycles but I’ve not seen evidence that striping impacts that. Memory overhead is minimal, as the stripe to read from is done via simple math from a tiny data structure.
  
  1 reply →
- davidmurdoch 1 year ago
  
  About how much memory overhead would that require?

buyucu 1 year ago

sylware 1 year ago

Well, I read this, now I am sure: as of today, deepseek handling of LLMs is the less wrong, and by far.

ptsneves 1 year ago

Went through the steps and ran it on a similar r6a.16xlarge and the model seems to only load after the first prompt. After that it takes maybe more than half an hour trying to load the model and still no answer. The context size in the post is also not validated in my experiment with the above. With 512GB of ram I cannot use more than 4k context size without the model outright refusing to load. I am new to model setups so I might have missed something.

jauntywundrkind 1 year ago

Is there any hope that desktops get 64GB dimms? 48GB dimms have been a nice boost but are we gonna get more anytime soon?

I'd love so much if quad-channel Strix Halo could get up to 256gb of memory. 192GB (4x48) won't be too bad, and at 8533MT/s, should provide competitive-ish throughput to these massive Epyc systems. Of course the $6k 24 channel 3200MHz is 4.4x more throughput but it does have a field of dimms to get there, and high power consumption.

zajio1am 1 year ago

> 512GB 2400 ECC RAM $400

Is this really that cheap? Looking at several local (CZ) eshops, i cannot find 32 GB DDR4 ECC RDIMM cheaper than $75, which will be $1200 for 512 GB.

formerly_proven 1 year ago
Used server hardware is much more expensive in the EU generally, because the market is much smaller (fewer data centers to begin with, longer cycles to reduce costs and EU WEEE mandatory scrapping instead of reuse).

crispyambulance 1 year ago

Kind of embarrassed to ask, I use AI a lot, I haven't really understood how the nuts and bolts work (other than at a 5th-grader 30000ft level)...

So, when I use a "full" AI like chatGPT4o, I ask it questions and it has a firm grip on a vast amount of knowledge, like, whole-internet/search-engine scope knowledge.

If I run an AI "locally", on even a muscular server, it obviously does NOT have vast amounts of stored information about everything. So what use is it to run locally? Can I just talk to it as though it were a very smart person who, tragically, knows nothing?

I mean, I suppose I could point it to a NAS box full of pdf's and ask questions about that narrow range of knowledge, or maybe get one of those downloaded wikipedia stores. Is that what folks are doing? It seems like you would really need a lot content for the AI to even be remotely useable like the online versions.

Workaccount2 1 year ago

Running it locally it will still have the vast/"full Internet" knowledge.
This is probably one of the most confusing things about LLMs. They are not vast archives of information and the models do not contain petabytes of copied data.
This is also why LLMs are so often wrong. They work by association, not by recall.
jodrellblank 1 year ago
Try one and find out. Look at https://github.com/Mozilla-Ocho/llamafile/ Quickstart section; download a single cross-platform ~3.7GB file and execute it, it starts a local model, local webserver, and you can query it.
See it demonstrated in a <7 minute video here: https://www.youtube.com/watch?v=d1Fnfvat6nM
The video explains that you can download the larger models on that Github page and use them with other command line parameters, and shows how you can get a Windows + nVidia setup to GPU accelerate the model (install CUDA and MSVC / VS Community edition with C++ tools, run for the first time from MSVC x64 command prompt so it can build a thing using cuBLAS, rerun normally with "-ngl 35" command line parameter to use 3.5GB of GPU memory (my card doesn't have much)).
- jodrellblank 1 year ago
  
  GPU bits have changed! I just noticed in the video description:
  "IMPORTANT: This video is obsolete as of December 26, 2023 GPU now works out of the box on Windows. You still need to pass the -ngl 35 flag, but you're no longer required to install CUDA/MSVC."
  So that's convenient.
7thpower 1 year ago

The LLMs have the ‘knowledge’ baked in, one of the things you will hear about are quantized models with lower precision (think 16-bit -> 4-bit) weights, which enables them to be run on greater variety of hardware and/or with greater performance.
When you quantize, you sacrifice model performance. In addition, a lot of the models favored for local use are already very small (7b, 3b).
What OP is pointing out is that you can actually run the full deepseek r1 model, along with all of the ‘knowledge’ on relatively modest hardware.
Not many people want to make that tradeoff when there are cheap, performant APIs around but for a lot of people who have privacy concerns or just like to tinker, it is pretty big deal.
I am far removed from having a high performance computer (although I suppose my MacBook is nothing to sneeze at), but I remember building computers or homelabs back in the day and then being like ‘okay now what is the most stressful workload I can find?!’ — this is perfect for that.
throwaway3b03 1 year ago

I've also been away from the tech (and AI scene) for a few years now. And I mostly stayed away from LLMs. But I'm certain that all the content is baked into the model, during training. When you query the model locally (since I suppose you don't train it yourself), you get all that knowledge that's baked into the model weights.
So I would assume the locally queried output to be comparable with the output you get from an online service (they probably use slightly better models, I don't think they release their latest ones to the public).
fbn79 1 year ago

It's all in the model. If you look for a good definition of "intellingence" that is compression. You can see ZIP algorithm as a primordial antenate of Chatgpt :))
londons_explore 1 year ago

Most of an AI's knowledge is inside the weights, so when you run it locally, it has all that knowledge inside!
Some AI services allow the use of 'tools', and some of those tools can search the web, calculate numbers, reserve restaurants, etc. However, you'll typically see it doing that in the UI.
Local models can do that too, but it's typically a bit more setup.
baq 1 year ago

LLMs are to a good approximation zip files intertwined with... magic... that allows the compressed data to be queried with plain old English - but you need to process all[0] the magic through some special matrix mincers together with the query (encoded as a matrix, too) to get an answer.
[0] not true but let's ignore that for a second
csnweb 1 year ago
The knowledge is stored in the model, the one mentioned here is rather large, the full version needs over 700GB of disk space. Most people use compressed versions, but even those will often be 10-30GB in size.
- crispyambulance 1 year ago
  
  ...the full version needs over 700GB of disk space.
  THAT is rather shocking. Vastly smaller than I would expect.
MuffinFlavored 1 year ago
I ask this all the time. locally running an LLM seems super hobbyist to me. like tweaking terminal font sizes on fringe BSD distros kind of thing
- noman-land 1 year ago
  
  Privacy.

PopSmoke 1 year ago

is there a way for deepseek to use gpu vram from 10 3070s? i have a lan setup of 10 pcs

ant6n 1 year ago

Ha, I'd recently asked about this here as well, just using some high memory AMD setup to infer.

Another thing I wonder is whether using a bunch of Geforce 4060 TI with 16GB could be useful - they cost only around 500 EUR. If VRAM is the bottleneck, perhaps a couple of these could really help with inference (unless they become GPU bound, like too slow).

cma 1 year ago

He's running quantized Q4 671b. However, MoE doesn't need cluster networking so you could probably run the full thing on two of them unquantized. Maybe the router could be all resident in GPU RAM instead of in contrast offloading a larger percentage of everything there, or is that already how it is set up in his gpu offload config?

rahimnathwani 1 year ago

Wow! Only $2k with no quantization.

  hit between 4.25 to 3.5 TPS (tokens per second) on the Q4 671b full model

kristianp 1 year ago
I think it is quantised, they actually said no distillation.
- rahimnathwani 1 year ago
  
  I think you're right. The instructions say
  ollama pull deepseek-r1:671b
  This will pull down 400GB: https://ollama.com/library/deepseek-r1:671b
  But the Huggingface repo has 163 files of ~4.3GB each, so around 700GB: https://huggingface.co/deepseek-ai/DeepSeek-R1/tree/main

zelcon 1 year ago

Or just wait for the NVIDIA Digits PC later this year which will cost the ~same amount and can fit on your desk

dzogchen 1 year ago
That one can handle up to 200B parameters according to NVIDIA.
- zelcon 1 year ago
  
  That's a shame. I suppose you'll need 4 of them with RDMA to run a 671B, but somehow that seems better to me than trying to run it on DDR4 RAM like the OP is saying. I have a system with 230G of DDR4 RAM, and running even small models on it is atrociously slow.

jeswin 1 year ago

Maybe Intel or AMD should bring back a Larrabee style CPU which can use say 48 sockets of DDR/CAMM2 sticks.

ilaksh 1 year ago

Has anyone run any benchmarks on the quantized (non-distilled) versions?

waltercool 1 year ago

Interesting and surprising to see, but 3-4 t/s is not practical overall.

Anything lower than 10 t/s is going to give lot of wait time considering the reasoning wait time

daft_pink 1 year ago

Is it possible to run the large Llama 3 models on this as well? Just curious if this is a one trick pony or not.

almostgotcaught 1 year ago

Those EPYCs he's advertising as $700 are either engineering samples or used. New he's off by 2:1.

waynenilsen 1 year ago

Someone please productize this

loxias 1 year ago
"Okay"
Can you describe for me what the product does?
It's not that hard to make a turnkey "just add power" appliance that does nothing but spit out tokens. Some sort of "ollama appliance", which just sits on your network and provides LLM functionality for your home lab?
But beyond that, what would your mythical dream product do?
- waynenilsen 1 year ago
  
  i'm talking about something you plug in and get on your network then you can chat with it with a typical chat interface and it should expose an openai compatible api
  automatic updates etc etc
  the AI home appliance
  great for privacy
gatienboquet 1 year ago

https://tinygrad.org/#tinybox

assimpleaspossi 1 year ago

Are there any security concerns over DeepSeek as there are over TikTok?

Kuinox 1 year ago
It's a local model, what security concern would you have ?
- jappgar 1 year ago
  
  I think this is unlikely but a local model could generate malicious code. You would have to run it manually though.
- dinosaurdynasty 1 year ago
  
  They also have an app that connects to their datacenter with R1.
  Also barely anyone can actually run the real R1 locally.
  
  1 reply →
Svoka 1 year ago

Local model - no. Using deepseek.com - absolutely. Do not put anything private there.

la64710 1 year ago

Basically three components : ollama , openwebui and deepseek. Nice!

jonotime 1 year ago

Any idea what the power draw is? Resting power, vs resting with the model loaded in memory, vs full power computation. In case you may want to also run this as your desktop for basic web browsing etc when not LLMing.

jjallen 1 year ago

He links to a ram kit that is 8x 32Gb but says it should have 512 Gb ram. What gives? Also with 8 ram slots you would obviously need more than 256.

Is the setup disingenuous to get people excited about the post or what is going on here?

dave7 1 year ago
The board recommended has 16 ram slots.
- jjallen 1 year ago
  
  The ram kit is still 8x32gb so the price is lower than it actually would be.

bofadeez 1 year ago

so this $2k investment can substitute for $0.69 in API calls per day

natas 1 year ago

okay, but does it run crysis?

sAggymEllenZ 1 year ago

[dead]

wanse_yung 1 year ago

[flagged]

leobg 1 year ago

This is a verbatim quote from Jason Calacanis.

sroussey 1 year ago

If you really want to cheap—run it in your browser via webgpu.

zubiaur 1 year ago
Not comparable. That is a quantized distill.
- sroussey 1 year ago
  
  Oh yes, for sure. Don’t mean otherwise.
sroussey 1 year ago

I got downvoted, but you really can run it in the browser.
ONNX version: https://huggingface.co/onnx-community/DeepSeek-R1-Distill-Qw...

alecco 1 year ago

Affiliate link spam.

codetrotter 1 year ago
No. Affiliate link spam is when someone fills a page with content they stole, or it’s a bunch of nonsense that is stuffed to the brim with keywords, and they combine that with affiliate links.
Someone getting a dollar or two in return for you following an affiliate link after you read something they put real time and effort into to make it valuable info for others is not “affiliate link spam”.
- jjallen 1 year ago
  
  I’m fine with useful content linking to affiliate links too. I am still confused at the ram specs required and those they linked to being off by a factor of 2 though. IF the setup is not realistic or accurate then that wouldn’t be cool.
  
  1 reply →
- alecco 1 year ago
  
  This is a lame me-too page to make money with affiliate links. The actual specs were in the linked original tweet. So, IMHO, it is affiliate link spam.

H8crilA 1 year ago

Do we have any estimate on the size of OpenAI top of the line models? Would they also fit in ~512GB of (V)RAM?

Also, this whole self hosting of LLMs is a bit like cloud. Yes, you can do it, but it's a lot easier to pay for API access. And not just for small users. Personally I don't even bother self hosting transcription models which are so small that they can run on nearly any hardware.

GreenWatermelon 1 year ago
It's nice because a company can optionaly provide a SOTA reasoning model for their clients, without having to go through a middleman e.g. HR co. Can provide an LLM for their HRMS system for a small 2000$ investment. Not 2000/month, just a one time 2000 investment.
- NeutralCrane 1 year ago
  
  No one will be doing anything practical with a local version of Deepseek on a $2000 server. The token throughout of this thing is like, 1 token every 4 seconds. It would take nearly a full minute just to produce a standard “Roses are red, violets are blue” poem. There’s absolutely no practical usage that you can use that for. It’s cool that you can do it, and it’s a step in the right direction, but self-hosting these wont be a viable alternative to using providers like OpenAI for business applications for a while.
  
  4 replies →
yapyap 1 year ago

Is the size of OpenAI‘s top of the line models even relevant? Last I checked they weren’t open source in the slightest.
exe34 1 year ago

it would make sense if you don't want somebody else to have access to all your code and customer data.

1ba9115454 1 year ago

I can't imagine this setup will get more than 1 token per second.

I would love to see Deepseek running on premise with a decent TPS.

thomquaid 1 year ago
It says 4.25 TPS in the first para.
- ricardobeat 1 year ago
  
  Honest mistake. Some people think HN is just a series of short tweets and haven’t realized they are links yet!
  
  3 replies →
- weatherlight 1 year ago
  
  That's still pretty slow, considering there's that "thinking" phase.
  
  1 reply →
october8140 1 year ago
You can get 1t/s on a raspberry pi.
https://youtu.be/o1sN1lB76EA?si=i8ecEBjLdV0zewFQ
- klohto 1 year ago
  
  this has nothing to do with the full 671B and the ollama models are distilled qwen2.5
  
  1 reply →