At what point do the OEMs begin to realize they don’t have to follow the current mindset of attaching a GPU to a PC and instead sell what looks like a GPU with a PC built into it?
The vast majority of computers sold today have a CPU / GPU integrated together in a single chip. Most ordinary home users don't care about GPU or local AI performance that much.
In this video Jeff is interested in GPU accelerated tasks like AI and Jellyfin. His last video was using a stack of 4 Mac Studios connected by Thunderbolt for AI stuff.
The Apple chips have both power CPU and GPU cores but also have a huge amount of memory (512GB) directly connected unlike most Nvidia consumer level GPUs that have far less memory.
Most ordinary home users don't care about GPU or local AI performance that much.
Right now, sure. There's a reason why chip manufacturers are adding AI pipelines, tensor processors, and 'neural cores' though. They believe that running small local models are going to be a popular feature in the future. They might be right.
Exactly.
With the Intel-Nvidia partnership signed this September, I expect to see some high-performance single-board computers being released very soon.
I don't think the atx form-factor will survive another 30 years.
Those machines multiplexed the bus to split access to memory, because RAM speeds were competitive with or faster than the CPU bus speed. The CPU and VDP "shared" the memory, but only because CPUs were slow enough to make that possible.
We have had the opposite problem for 35+ years at this point. The newer architecture machines like the Apple machines, the GB10, the AI 395+ do share memory between GPU and CPU but in a different way, I believe.
I'd argue with memory becoming suddenly much more expensive we'll probably see the opposite trend. I'm going to get me one of these GB10 or Strix Halo machines ASAP because I think with RAM prices skyrocketing we won't be seeing more of this kind of thing in the consumer market for a long time. Or at least, prices will not be dropping any time soon.
It's funny how ideas come and go. I made this very comment here on Hacker News probably 4-5 years ago and received a few down votes for it at the time (albeit that I was thinking of computers in general).
It would take a lot of work to make a GPU do current CPU type tasks, but it would be interesting to see how it changes parallelism and our approach to logic in code.
> I made this very comment here on Hacker News probably 4-5 years ago and received a few down votes for it at the time
HN isn't always very rational about voting. It will be a loss if you judge any idea on their basis.
> It would take a lot of work to make a GPU do current CPU type tasks
In my opinion, that would be counterproductive. The advantage of GPUs is that they have a large number of very simple GPU cores. Instead, just do a few separate CPU cores on the same die, or on a separate die. Or you could even have a forest of GPU cores with a few CPU cores interspersed among them - sort of like how modern FPGAs have logic tiles, memory tiles and CPU tiles spread out on it. I doubt it would be called a GPU at that point.
Is there any need for that? Just have a few good CPUs there and you’re good to go.
As for how the HW looks like we already know. Look at Strix Halo as an example. We are just getting bigger and bigger integrated GPUs. Most of the flops on that chip is the GPU part.
HN in general is quite clueless about topics like hardware, high performance computing, graphics, and AI performance. So you probably shouldn't care if you are downvoted, especially if you honestly know you are being correct.
Also, I'd say if you buy for example a Macbook with an M4 Pro chip, it is already is a big GPU attached to a small CPU.
Maybe at the point where you can run Python directly on the GPU. At which point the GPU becomes the new CPU.
Anyway, we're still stuck with "G" for "graphics" so it all doesn't make much sense and I'm actually looking for a vendor that takes its mission more seriously.
Not sure what was unexpected about the multi GPU part.
It's very well known that most LLM frameworks including llama.cpp splits models by layers, which has sequential dependency, and so multi GPU setups are completely stalled unless there are n_gpu users/tasks running in parallel. It's also known that some GPUs are faster in "prompt processing" and some in "token generation" that combining Radeon and NVIDIA does something sometimes. Reportedly the inter-layer transfer sizes are in kilobyte ranges and PCIe x1 is plenty or something.
It takes appropriate backends with "tensor parallel" mode support, which splits the neural network parallel to the direction of flow of data, which also obviously benefit substantially from good node interconnect between GPUs like PCIe x16 or NVlink/Infinity Fabric bridge cables, and/or inter-GPU DMA over PCIe(called GPU P2P or GPUdirect or some lingo like that).
Absent those, I've read somewhere that people can sometimes see GPU utilization spikes walking over GPUs on nvtop-style tools.
Looking for a way to break up tasks for LLMs so that there will be multiple tasks to run concurrently would be interesting, maybe like creating one "manager" and few "delegated engineers" personalities. Or simulating multiple different domains of brain such as speech center, visual cortex, language center, etc. communicating in tokens might be interesting in working around this problem.
Theres some technical implementations that makes it more efficient like EXO [1]. Jeff Geerling recently did a review on a 4 MAC Studio cluster with RDMA support and you can see that EXO has a noticeable advantage [2].
At this point I'd consider a cluster of top specced Mac Studio's to be worth while in production. I just need to host them properly in a rack and in a co-lo data center.
> Looking for a way to break up tasks for LLMs so that there will be multiple tasks to run concurrently would be interesting, maybe like creating one "manager" and few "delegated engineers" personalities.
This is pretty much what "agents" are for. The manager model constructs prompts and contexts that the delegated models can work on in parallel, returning results when they're done.
> Reportedly the inter-layer transfer sizes are in kilobyte ranges and PCIe x1 is plenty or something.
Not an expert, but napkin math tells me that more often that not this will be in the order of megabytes—not kilobytes—since it scales with sequence length.
Example: Qwen3 30B has a hidden state size of 5120, even if quantized to 8 bits that's 5120 bytes per token. It would pass the MB boundary with just a little over 200 tokens. Still not much of an issue when a single PCIe lane is ~2GB/s.
I think device to device latency is more of an issue here, but I don't know enough to assert that with confidence.
> Not sure what was unexpected about the multi GPU part.
It's very well known that most LLM frameworks including llama.cpp splits models by layers, which has sequential dependency, and so multi GPU setups are completely stalled
Oh, I thought the point of transformers was being able to split the load veritcally to avoid seqential dependancies. Is it true just for training or not at all?
Just for training and processing the existing context (pre fill phase). But when doing inference a token t has to be sampled before t+1 can so it’s still sequential
I've been kicking this around in my head for a while. If I want to run LLMs locally, a decent GPU is really the only important thing. At that point, the question becomes, roughly, what is the cheapest computer to tack on the side of the GPU? Of course, that assumes that everything does in fact work; unlike OP I am barely in a position to understand eg. BAR problems, let alone try to fix them, so what I actually did was build a cheap-ish x86 box with a half-decent GPU and called it a day:) But it still is stuck in my brain: there must be a more efficient way to do this, especially if all you need is just enough computer to shuffle data to and from the GPU and serve that over a network connection.
Nice! Though for older hardware it would be nice if the price reflected the current second hand market (harder to get data for, I know). Eg. Nvidia RTX 3070 ranks as second best GPU in tok/s/$ even at the MSRP of $499. But you can get one for half that now.
It seems like verification might need to be improved a bit? I looked at Mistral-Large-123B. Someone is claiming 12 tokens/sec on a single RTX 3090 at FP16.
Perhaps some filter could cut out submissions that don't really make sense?
This problem was already solved 10 years ago - crypto mining motherboards, which have a large number of PCIe slots, a CPU socket, one memory slot, and not much else.
> Asus made a crypto-mining motherboard that supports up to 20 GPUs
Those only gave each GPU a single PCIe lane though, since crypto mining barely needed to move any data around. If your application doesn't fit that mould then you'll need a much, much more expensive platform.
In theory, it’s only sufficient for pipeline parallel due to limited lanes and interconnect bandwidth.
Generally, scalability on consumer GPUs falls off between 4-8 GPUs for most.
Those running more GPUs are typically using a higher quantity of smaller GPUs for cost effectiveness.
We're not yet to the point where a single PCIe device will get you anything meaningful; IMO 128 GB of ram available to the GPU is essential.
So while you don't need a ton of compute on the CPU you do need the ability address multiple PCIe lanes. A relatively low-spec AMD EPYC processor is fine if the motherboard exposes enough lanes.
There is plenty that can run within 32/64/96gb VRAM.
IMO models like Phi-4 are underrated for many simple tasks.
Some quantized Gemma 3 are quite good as well.
There are larger/better models as well, but those tend to really push the limits of 96gb.
FWIW when you start pushing into 128gb+, the ~500gb models really start to become attractive because at that point you’re probably wanting just a bit more out of everything.
Datapoints like this really make me reconsider my daily driver. I should be running one of those $300 mini PCs at <20W. With ~flat CPU performance gains, would be fine for the next 10 years. Just remote into my beefy workstation when I actually need to do real work. Browsing the web, watching videos, even playing some games is easily within their wheelhouse.
> I should be running one of those $300 mini PCs at <20W.
Yes. They're basically laptop chips at this point. The thermals are worse but the chips are perfectly modern and can handle reasonably large workloads. I've got an 8 core Ryzen 7 with Radeon 780 Graphics and 96GB of DDR5. Outside of AAA gaming this thing is absolutely fine.
The power draw is a huge win for me. It's like 6W at idle. I live remotely so grid power is somewhat unreliable and saving watts when using solar batteries extends their lifetime massively. I'm thrilled with them.
Switching from my 8-core ryzen minipc to an 8-core ryzen desktop makes my unit tests run way faster. TDP limits can tip you off to very different performance envelopes in otherwise similar spec CPUs.
A full-size desktop computer will always be much faster for any workload that fully utilizes the CPU.
However, a full-size desktop computer seldom makes sense as a personal computer, i.e. as the computer that interfaces to a human via display, keyboard and graphic pointer.
For most of the activities done directly by a human, i.e. reading & editing documents, browsing Internet, watching movies and so on, a mini-PC is powerful enough. The only exception is playing games designed for big GPUs, but there are many computer users who are not gamers.
In most cases the optimal setup is to use a mini-PC as your personal computer and a full-size desktop as a server on which you can launch any time-consuming tasks, e.g. compilation of big software projects, EDA/CAD simulations, testing suites etc.
The desktop used as server can use Wake-on-LAN to stay powered off when not needed and wake up whenever it must run some task remotely.
Even if you could cool the full TDP in a micro PC, in a full size desktop you might be able to use a massive AIO radiator with fans running at very slow, very quiet speeds instead of jet turbine howl in the micro case. The quiet and ease of working in a bigger space are mostly a good tradeoff for a slightly larger form factor under a desk.
As experiment, I decided to try using proxmox VM with eGPU and usb bus bypassed to it, as my main PC for browsing and working on hobby projects.
It’s just 1 vCPU with 4 Gb ram, and you know what? It’s more than enough for these needs. I think hardware manufactures falsely convinced us that every professional needs beefy laptop to be productive.
PCIe 3.0 is the nice easy convenient generation where 1 lane = 1GBps. Given the overhead, thats pretty close to 10Gb ethernet speeds (low latency though).
I do wonder how long the cards are going to need host systems at all. We've already seen GPUs with m.2 ssd attached! Radeon Pro SSG hails back from 2016! You still need a way to get the model on that in the first place to get work in and out, but a 1Gbe and small RISC-V chip (which Nvidia already uses formanagement cores) could suffice. Maybe even an rpi on the card. https://www.techpowerup.com/224434/amd-announces-the-radeon-...
Given the gobs of memory cards have, they probably don't even need storage; they just need big pipes. Intel had 100Gbe on their Xeon & Xeon Phi cores (10x what we saw here!) in 2016! GPUs that just plug into the switch and talk across 400Gbe or UltraEthernet or switched CXL, that run semi independently, feel so sensible, so not outlandish. https://www.servethehome.com/next-generation-interconnect-in...
It's far off for now, but flash makers are also looking at radically many channel flash, which can provide absurdly high GB/s, High Bandwidth Flash. And potentially integrated some extremely parallel tensorcores on each channel. Switching from DRAM to flash for AI processing could be a colossal win for fitting large models cost effectively (& perhaps power efficiently) while still having ridiculous gobs of bandwidth. With that possible win of doing processing & filtering extremely near to the data too. https://www.tomshardware.com/tech-industry/sandisk-and-sk-hy...
The most interesting takeaway for me is that PCIe bandwidth really doesn't bottleneck LLM inference for single-user workloads. You're essentially just shuttling the model weights once, then the GPU churns through tokens using its own VRAM.
This is huge for home lab setups. You can run a Pi 5 with a high-end GPU via external enclosure and get 90% of the performance of a full workstation at a fraction of the power draw and cost.
The multi-GPU results make sense too - without tensor parallelism, you're just pipeline parallelism across layers, which is inherently sequential. The GPUs are literally sitting idle waiting for the previous layer's output. Exo and similar frameworks are trying to solve this but it's still early days.
For anyone considering this: watch out for ResizeBAR requirements. Some older boards won't work at all without it.
I wish for a hardware + software solution to enable direct PCIe interconnect using lanes independent from the chipset/CPU. A PCIe mesh of sorts.
With the right software support from say pytorch this could suddenly make old GPUs and underpowered PCs like in TFA into very attractive and competitive solutions for training and inference.
PCIe already allows DMA between peers on the bus, but, as you pointed out, the traces for the lanes have to terminate somewhere. However, it doesn't have to be the CPU (which is, of course, the PCIe root in modern systems) handling the traffic - a PCIe switch may be used to facilitate DMA between devices attached to it, if it supports routing DMA traffic directly.
Now compare batched training performance. Or batched inference.
Of course prefill is going to be GPU bound. You only send a few thousand bytes to it, and don't really ask to return much. But after prefill is done, unless you use batched mode, you aren't really using your GPU for anything more that it's VRAM bandwidth.
I really would have liked to see gaming performance, although I realize it might be difficult to find a AAA game that supports ARM. (Forcing the Pi to emulate x86 with FEX doesn't seem entirely fair.)
Of course, just go to any computer store where most gamer setups on affordable bugets go with the combo "beefy GPU + an i5", instead of using an i7 or i9 Intel CPUs.
I personally find his work and his posts interesting, and enjoy seeing them pop up on HN.
If you prefer not to see his posts on the HN list pages, a practical solution is to use a browser extension (such as Stylus) to customise the HN styling to hide the posts.
Here is a specific CSS style which will hide submissions from Jeff's website:
In this example, I've made it almost invisible, whilst it still takes up space on the screen (to avoid confusion about the post number increasing from N to N+2). You could use { display: none } to completely hide the relevant posts.
The approach can be modified to suit any origin you prefer to not come across.
The limitation is that the style modification may need refactoring if HN changes the markup structure.
I stopped following this guy back in 2015 when he straight up forked all of my ansible roles and then published everything to Ansible Galaxy before mine were even complete, tested and ready to be published, and only for me to find that the same day they were all forked by him a new Github organization with the name of the org I had used in my roles had been registered and then squatted, it completely turned me off to his methods.
At what point do the OEMs begin to realize they don’t have to follow the current mindset of attaching a GPU to a PC and instead sell what looks like a GPU with a PC built into it?
The vast majority of computers sold today have a CPU / GPU integrated together in a single chip. Most ordinary home users don't care about GPU or local AI performance that much.
In this video Jeff is interested in GPU accelerated tasks like AI and Jellyfin. His last video was using a stack of 4 Mac Studios connected by Thunderbolt for AI stuff.
https://www.youtube.com/watch?v=x4_RsUxRjKU
The Apple chips have both power CPU and GPU cores but also have a huge amount of memory (512GB) directly connected unlike most Nvidia consumer level GPUs that have far less memory.
Most ordinary home users don't care about GPU or local AI performance that much.
Right now, sure. There's a reason why chip manufacturers are adding AI pipelines, tensor processors, and 'neural cores' though. They believe that running small local models are going to be a popular feature in the future. They might be right.
2 replies →
Exactly. With the Intel-Nvidia partnership signed this September, I expect to see some high-performance single-board computers being released very soon. I don't think the atx form-factor will survive another 30 years.
One should also remember that NVidia does have organisational experience on designing and building CPUs[0].
They were a pretty big deal back in ~2010, and I have to admit I didn't know that Tegra was powering Nintendo Switch.
0: https://en.wikipedia.org/wiki/Tegra
3 replies →
At this point what you really need is an incredibly powerful heatsink with some relatively small chips pressed against it.
Transhcan Mac Pro was this idea, triangular heatsink core cpu+gpu+gpu for each side
If you disassemble a modern GPU, that's what you'll find. 95% by weight of a GPU card is cooling related.
So basically going back to the old days of Amiga and Atari, in a certain sense, when PCs could only display text.
Those machines multiplexed the bus to split access to memory, because RAM speeds were competitive with or faster than the CPU bus speed. The CPU and VDP "shared" the memory, but only because CPUs were slow enough to make that possible.
We have had the opposite problem for 35+ years at this point. The newer architecture machines like the Apple machines, the GB10, the AI 395+ do share memory between GPU and CPU but in a different way, I believe.
I'd argue with memory becoming suddenly much more expensive we'll probably see the opposite trend. I'm going to get me one of these GB10 or Strix Halo machines ASAP because I think with RAM prices skyrocketing we won't be seeing more of this kind of thing in the consumer market for a long time. Or at least, prices will not be dropping any time soon.
1 reply →
I'm not familiar with that history. Could you elaborate?
5 replies →
They already have. They have the Jetson Line: https://en.wikipedia.org/wiki/Nvidia_Jetson
It's funny how ideas come and go. I made this very comment here on Hacker News probably 4-5 years ago and received a few down votes for it at the time (albeit that I was thinking of computers in general).
It would take a lot of work to make a GPU do current CPU type tasks, but it would be interesting to see how it changes parallelism and our approach to logic in code.
> I made this very comment here on Hacker News probably 4-5 years ago and received a few down votes for it at the time
HN isn't always very rational about voting. It will be a loss if you judge any idea on their basis.
> It would take a lot of work to make a GPU do current CPU type tasks
In my opinion, that would be counterproductive. The advantage of GPUs is that they have a large number of very simple GPU cores. Instead, just do a few separate CPU cores on the same die, or on a separate die. Or you could even have a forest of GPU cores with a few CPU cores interspersed among them - sort of like how modern FPGAs have logic tiles, memory tiles and CPU tiles spread out on it. I doubt it would be called a GPU at that point.
10 replies →
Is there any need for that? Just have a few good CPUs there and you’re good to go.
As for how the HW looks like we already know. Look at Strix Halo as an example. We are just getting bigger and bigger integrated GPUs. Most of the flops on that chip is the GPU part.
1 reply →
It would just make everything worse. Some (if anything, most) tasks are far less paralleliseable than typical GPU loads.
HN in general is quite clueless about topics like hardware, high performance computing, graphics, and AI performance. So you probably shouldn't care if you are downvoted, especially if you honestly know you are being correct.
Also, I'd say if you buy for example a Macbook with an M4 Pro chip, it is already is a big GPU attached to a small CPU.
1 reply →
Maybe at the point where you can run Python directly on the GPU. At which point the GPU becomes the new CPU.
Anyway, we're still stuck with "G" for "graphics" so it all doesn't make much sense and I'm actually looking for a vendor that takes its mission more seriously.
I mean, that's kind of what's going on at a certain level with the AMD Strix Halo, the NVIDIA GB10, and the newer Apple machines.
In the sense that the RAM is fully integrated, anyways.
Not sure what was unexpected about the multi GPU part.
It's very well known that most LLM frameworks including llama.cpp splits models by layers, which has sequential dependency, and so multi GPU setups are completely stalled unless there are n_gpu users/tasks running in parallel. It's also known that some GPUs are faster in "prompt processing" and some in "token generation" that combining Radeon and NVIDIA does something sometimes. Reportedly the inter-layer transfer sizes are in kilobyte ranges and PCIe x1 is plenty or something.
It takes appropriate backends with "tensor parallel" mode support, which splits the neural network parallel to the direction of flow of data, which also obviously benefit substantially from good node interconnect between GPUs like PCIe x16 or NVlink/Infinity Fabric bridge cables, and/or inter-GPU DMA over PCIe(called GPU P2P or GPUdirect or some lingo like that).
Absent those, I've read somewhere that people can sometimes see GPU utilization spikes walking over GPUs on nvtop-style tools.
Looking for a way to break up tasks for LLMs so that there will be multiple tasks to run concurrently would be interesting, maybe like creating one "manager" and few "delegated engineers" personalities. Or simulating multiple different domains of brain such as speech center, visual cortex, language center, etc. communicating in tokens might be interesting in working around this problem.
Theres some technical implementations that makes it more efficient like EXO [1]. Jeff Geerling recently did a review on a 4 MAC Studio cluster with RDMA support and you can see that EXO has a noticeable advantage [2].
[1] https://github.com/exo-explore/exo [2] https://www.youtube.com/watch?v=x4_RsUxRjKU
At this point I'd consider a cluster of top specced Mac Studio's to be worth while in production. I just need to host them properly in a rack and in a co-lo data center.
1 reply →
> Looking for a way to break up tasks for LLMs so that there will be multiple tasks to run concurrently would be interesting, maybe like creating one "manager" and few "delegated engineers" personalities.
This is pretty much what "agents" are for. The manager model constructs prompts and contexts that the delegated models can work on in parallel, returning results when they're done.
> Reportedly the inter-layer transfer sizes are in kilobyte ranges and PCIe x1 is plenty or something.
Not an expert, but napkin math tells me that more often that not this will be in the order of megabytes—not kilobytes—since it scales with sequence length.
Example: Qwen3 30B has a hidden state size of 5120, even if quantized to 8 bits that's 5120 bytes per token. It would pass the MB boundary with just a little over 200 tokens. Still not much of an issue when a single PCIe lane is ~2GB/s.
I think device to device latency is more of an issue here, but I don't know enough to assert that with confidence.
For each token generated, you only send one token’s worth between layers; the previous tokens are in the KV cache.
> Not sure what was unexpected about the multi GPU part. It's very well known that most LLM frameworks including llama.cpp splits models by layers, which has sequential dependency, and so multi GPU setups are completely stalled
Oh, I thought the point of transformers was being able to split the load veritcally to avoid seqential dependancies. Is it true just for training or not at all?
Just for training and processing the existing context (pre fill phase). But when doing inference a token t has to be sampled before t+1 can so it’s still sequential
I've been kicking this around in my head for a while. If I want to run LLMs locally, a decent GPU is really the only important thing. At that point, the question becomes, roughly, what is the cheapest computer to tack on the side of the GPU? Of course, that assumes that everything does in fact work; unlike OP I am barely in a position to understand eg. BAR problems, let alone try to fix them, so what I actually did was build a cheap-ish x86 box with a half-decent GPU and called it a day:) But it still is stuck in my brain: there must be a more efficient way to do this, especially if all you need is just enough computer to shuffle data to and from the GPU and serve that over a network connection.
I run a crowd sourced website to collect data on the best and cheapest hardware setup for local LLM here: https://inferbench.com/
Source code: https://github.com/BinSquare/inferbench
Cool site, I noticed the 3090 is on there twice.
https://inferbench.com/gpu/NVIDIA%20GeForce%20RTX%203090
https://inferbench.com/gpu/NVIDIA%20RTX%203090
1 reply →
Nice! Though for older hardware it would be nice if the price reflected the current second hand market (harder to get data for, I know). Eg. Nvidia RTX 3070 ranks as second best GPU in tok/s/$ even at the MSRP of $499. But you can get one for half that now.
1 reply →
It seems like verification might need to be improved a bit? I looked at Mistral-Large-123B. Someone is claiming 12 tokens/sec on a single RTX 3090 at FP16.
Perhaps some filter could cut out submissions that don't really make sense?
This problem was already solved 10 years ago - crypto mining motherboards, which have a large number of PCIe slots, a CPU socket, one memory slot, and not much else.
> Asus made a crypto-mining motherboard that supports up to 20 GPUs
https://www.theverge.com/2018/5/30/17408610/asus-crypto-mini...
For LLMs you'll probably want a different setup, with some memory too, some m.2 storage.
Those only gave each GPU a single PCIe lane though, since crypto mining barely needed to move any data around. If your application doesn't fit that mould then you'll need a much, much more expensive platform.
2 replies →
In theory, it’s only sufficient for pipeline parallel due to limited lanes and interconnect bandwidth.
Generally, scalability on consumer GPUs falls off between 4-8 GPUs for most. Those running more GPUs are typically using a higher quantity of smaller GPUs for cost effectiveness.
M.2 is mostly just a different form factor for PCIe anyway.
There is a whole section in here on how to spec out a cheap rig and what to look for:
* https://jabberjabberjabber.github.io/Local-AI-Guide/
We're not yet to the point where a single PCIe device will get you anything meaningful; IMO 128 GB of ram available to the GPU is essential.
So while you don't need a ton of compute on the CPU you do need the ability address multiple PCIe lanes. A relatively low-spec AMD EPYC processor is fine if the motherboard exposes enough lanes.
There is plenty that can run within 32/64/96gb VRAM. IMO models like Phi-4 are underrated for many simple tasks. Some quantized Gemma 3 are quite good as well.
There are larger/better models as well, but those tend to really push the limits of 96gb.
FWIW when you start pushing into 128gb+, the ~500gb models really start to become attractive because at that point you’re probably wanting just a bit more out of everything.
7 replies →
I'm holding out for someone to ship a gpu with dimm slots on it.
10 replies →
And you don’t want to go the M4 Max/M3 Ultra route? It works well enough for most mid sized LLMs.
Get the DGX Spark computers? They’re exactly what you’re trying to build.
They’re very slow.
2 replies →
Datapoints like this really make me reconsider my daily driver. I should be running one of those $300 mini PCs at <20W. With ~flat CPU performance gains, would be fine for the next 10 years. Just remote into my beefy workstation when I actually need to do real work. Browsing the web, watching videos, even playing some games is easily within their wheelhouse.
> I should be running one of those $300 mini PCs at <20W.
Yes. They're basically laptop chips at this point. The thermals are worse but the chips are perfectly modern and can handle reasonably large workloads. I've got an 8 core Ryzen 7 with Radeon 780 Graphics and 96GB of DDR5. Outside of AAA gaming this thing is absolutely fine.
The power draw is a huge win for me. It's like 6W at idle. I live remotely so grid power is somewhat unreliable and saving watts when using solar batteries extends their lifetime massively. I'm thrilled with them.
Switching from my 8-core ryzen minipc to an 8-core ryzen desktop makes my unit tests run way faster. TDP limits can tip you off to very different performance envelopes in otherwise similar spec CPUs.
A full-size desktop computer will always be much faster for any workload that fully utilizes the CPU.
However, a full-size desktop computer seldom makes sense as a personal computer, i.e. as the computer that interfaces to a human via display, keyboard and graphic pointer.
For most of the activities done directly by a human, i.e. reading & editing documents, browsing Internet, watching movies and so on, a mini-PC is powerful enough. The only exception is playing games designed for big GPUs, but there are many computer users who are not gamers.
In most cases the optimal setup is to use a mini-PC as your personal computer and a full-size desktop as a server on which you can launch any time-consuming tasks, e.g. compilation of big software projects, EDA/CAD simulations, testing suites etc.
The desktop used as server can use Wake-on-LAN to stay powered off when not needed and wake up whenever it must run some task remotely.
1 reply →
Even if you could cool the full TDP in a micro PC, in a full size desktop you might be able to use a massive AIO radiator with fans running at very slow, very quiet speeds instead of jet turbine howl in the micro case. The quiet and ease of working in a bigger space are mostly a good tradeoff for a slightly larger form factor under a desk.
As experiment, I decided to try using proxmox VM with eGPU and usb bus bypassed to it, as my main PC for browsing and working on hobby projects.
It’s just 1 vCPU with 4 Gb ram, and you know what? It’s more than enough for these needs. I think hardware manufactures falsely convinced us that every professional needs beefy laptop to be productive.
For just basic windows desktop stuff, a $200 NUC has been good enough for like 15 years now.
That's why I use a M2 (not even pro) Mac Mini as a terminal and remote into other boxes when needed.
I went with a beelink for this purpose. Works great.
Keeps the desk nice and tidy while “the beasts” roar in a soundproofed closet.
Another benefit is low noise. Many consider fan noise under load to be the most important property of a workstation.
Slapping $300 worth of solar panels on your roof/balcony will probably get you ahead on power usage
PCIe 3.0 is the nice easy convenient generation where 1 lane = 1GBps. Given the overhead, thats pretty close to 10Gb ethernet speeds (low latency though).
I do wonder how long the cards are going to need host systems at all. We've already seen GPUs with m.2 ssd attached! Radeon Pro SSG hails back from 2016! You still need a way to get the model on that in the first place to get work in and out, but a 1Gbe and small RISC-V chip (which Nvidia already uses formanagement cores) could suffice. Maybe even an rpi on the card. https://www.techpowerup.com/224434/amd-announces-the-radeon-...
Given the gobs of memory cards have, they probably don't even need storage; they just need big pipes. Intel had 100Gbe on their Xeon & Xeon Phi cores (10x what we saw here!) in 2016! GPUs that just plug into the switch and talk across 400Gbe or UltraEthernet or switched CXL, that run semi independently, feel so sensible, so not outlandish. https://www.servethehome.com/next-generation-interconnect-in...
It's far off for now, but flash makers are also looking at radically many channel flash, which can provide absurdly high GB/s, High Bandwidth Flash. And potentially integrated some extremely parallel tensorcores on each channel. Switching from DRAM to flash for AI processing could be a colossal win for fitting large models cost effectively (& perhaps power efficiently) while still having ridiculous gobs of bandwidth. With that possible win of doing processing & filtering extremely near to the data too. https://www.tomshardware.com/tech-industry/sandisk-and-sk-hy...
The most interesting takeaway for me is that PCIe bandwidth really doesn't bottleneck LLM inference for single-user workloads. You're essentially just shuttling the model weights once, then the GPU churns through tokens using its own VRAM.
This is huge for home lab setups. You can run a Pi 5 with a high-end GPU via external enclosure and get 90% of the performance of a full workstation at a fraction of the power draw and cost.
The multi-GPU results make sense too - without tensor parallelism, you're just pipeline parallelism across layers, which is inherently sequential. The GPUs are literally sitting idle waiting for the previous layer's output. Exo and similar frameworks are trying to solve this but it's still early days.
For anyone considering this: watch out for ResizeBAR requirements. Some older boards won't work at all without it.
So glad someone did this. Have been running big gpus on egpus connected to spare laptops and thinking why not pis.
Really why have the PCI/CPU artifice at all? Apple and Nvidia have the right idea: put the MPP on the same die/package as the CPU.
> put the MPP on the same die/package as the CPU.
That would help in latency-constrained workloads, but I don't think it would make much of a difference for AI or most HPC applications.
We need low power but high PCIE lane count CPUs for that. Just purely for shoving models from NVMe to GPU
I wish for a hardware + software solution to enable direct PCIe interconnect using lanes independent from the chipset/CPU. A PCIe mesh of sorts.
With the right software support from say pytorch this could suddenly make old GPUs and underpowered PCs like in TFA into very attractive and competitive solutions for training and inference.
PCIe already allows DMA between peers on the bus, but, as you pointed out, the traces for the lanes have to terminate somewhere. However, it doesn't have to be the CPU (which is, of course, the PCIe root in modern systems) handling the traffic - a PCIe switch may be used to facilitate DMA between devices attached to it, if it supports routing DMA traffic directly.
That’s what happened in TFA.
2 replies →
Now compare batched training performance. Or batched inference.
Of course prefill is going to be GPU bound. You only send a few thousand bytes to it, and don't really ask to return much. But after prefill is done, unless you use batched mode, you aren't really using your GPU for anything more that it's VRAM bandwidth.
I really would have liked to see gaming performance, although I realize it might be difficult to find a AAA game that supports ARM. (Forcing the Pi to emulate x86 with FEX doesn't seem entirely fair.)
You might have to thread the needle to find a game which does not bottleneck on the CPU.
I currently have a £500 laptop hooked up to an egpu box with a £700 gpu. It's not a bad setup.
What about constrained decoding (with JSON schemas)? I noticed my vLLM instance is using 1 CPU 100%.
Of course, just go to any computer store where most gamer setups on affordable bugets go with the combo "beefy GPU + an i5", instead of using an i7 or i9 Intel CPUs.
I'd be interested to see if workloads like Folding@home could be efficiently run this way. I don't think they need a lot of bandwidth.
[flagged]
tired of jeff glinglin everywhere...
I personally find his work and his posts interesting, and enjoy seeing them pop up on HN.
If you prefer not to see his posts on the HN list pages, a practical solution is to use a browser extension (such as Stylus) to customise the HN styling to hide the posts.
Here is a specific CSS style which will hide submissions from Jeff's website:
In this example, I've made it almost invisible, whilst it still takes up space on the screen (to avoid confusion about the post number increasing from N to N+2). You could use { display: none } to completely hide the relevant posts.
The approach can be modified to suit any origin you prefer to not come across.
The limitation is that the style modification may need refactoring if HN changes the markup structure.
You're awesome, thank you.
I stopped following this guy back in 2015 when he straight up forked all of my ansible roles and then published everything to Ansible Galaxy before mine were even complete, tested and ready to be published, and only for me to find that the same day they were all forked by him a new Github organization with the name of the org I had used in my roles had been registered and then squatted, it completely turned me off to his methods.
I only ever see him on HN. He's smart, kind, and talks about interesting things. Are you sure what you're feeling isn't envy?