Comment by cafkafk

9 hours ago

Hi HN. I wrote this post after getting frustrated by the lack of ways to run the new Gemma 4 Drafter models, and mainstream tools not prioritizing this, and hiding all the performance levers.

I ended up getting a modern 26B MoE model (Gemma 4) running at reading speed on an old recycled server with a single Xeon E5-2620 v4 and 128GB of DDR3 RAM (and no GPU). It took a lot of work, but it actually worked out somehow.

I've also linked the quants at the end, but they're not gonna run unless you use the ik_llama-cpp fork I mention, see other posts for more details.

I'm not an ML engineer, so I'm by no means an expert, and the server is busy acting as a Nix cache, but if you have any question, I can try to answer, but best effort.

40 comments

cafkafk

sireat 15 minutes ago

Fantastic practical achievement!

I wonder if I could get similar or even better performance from similar Dell T7610 workstation with dual Xeons and also 128GB DDR3?

The CPUs are better core wise, but that probably does not make much difference?

It has CPUs 2 × Xeon E5-2697 v2

Cores / threads 24 cores / 48 threads total

Per-CPU cores 12 cores / 24 threads

Base clock 2.70 GHz

Max turbo 3.50 GHz

It is sitting gather dust but reading spead Gemma sounds promising.

Sweepi 6 hours ago

"-t 8 matches physical cores. The machine has 16 SMT threads but only 8 cores. On a memory-bound workload, oversubscribing threads adds scheduling cost without adding throughput: the cores are waiting on DDR3, not on each other."

But ... isnt that a classic use case for SMT? Giving T1 sth. to do while T0 is waiting on DDR(3) and vise-versa?

I also dont understand the explanation of "--cpu-moe". If an expert has ~ 4.0 GiB of Parameters, why does optimizing the sequence of experts minimize cash trashing? With 20 MiB of L3 Cash vs 4.0 GiB of Parameters, it wont cash any noticeable amount of the Parameters, will it?

As mentioned by others, only some Intel Xeon E5-2xxx v4 did support DDR3, and according to Intel, the E5-2620 v4 is not one of them.

zamadatix 5 hours ago
> But ... isnt that a classic use case for SMT? Giving T1 sth. to do while T0 is waiting on DDR(3) and vise-versa?
Waiting in terms of latency. When the bus is mostly empty and it takes a while to make a round trip it's great to try to find a few extra passengers to put on it. When the buses are all completely full adding the extra riders just makes the bus stop that much more chaotic.
- ethbr1 2 hours ago
  
  This is ironically a pretty solid use case for (ex VLIW research) ILP-optimizing compilers.
  Given knowable runtime hardware usage patterns (huge bursts of memory bandwidth saturation) and a single limited core/thread-shared resource (memory bandwidth), one could optimize for the constraint ahead of runtime.
  Because most of the performance optimization levers you have available to pull are (a) trade compute for memory bandwidth (e.g. compression), (b) preload when memory bandwidth is available, (c) optimize the choice of what's in cache when, (d) align to cache size / memory boundaries.
  Or tl;dr, try to approximate GPU ISAs at the CPU compiler level. (Which why would anyone but hobbyists, because everyone else just buys pallets of Nvidia/AMD or designs their own ML chips?)

Lerc 1 hour ago

This seems remarkably suited to my situation,

    CPU(s): 32
      On-line CPU(s) list: 0-31
    Vendor ID: GenuineIntel  
    Model name: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz

Also with 128G. Does 8 dimm sockets imply more actual bandwidth in practice?

This poor thing is currently a YouTube watching box.

miahi 20 minutes ago

One thing to note: These Xeons have quad memory channels, that usually means double the bandwidth of an equivalent desktop CPU, if you populate all the slots.
I have a dual E5-2667 v2 server with 512GB DDR3 and it's quite nice, the memory bandwidth is higher than of a DDR4 desktop with a way newer CPU, even though it's ECC and registered.

gdjdhdheb 6 hours ago

You sure you got DDR3 .. I have 2 e5 v4 rigs at home and both have ddr4 ... Unless I am wrong and 2011-3 supports ddr3 and ddr4

duffyjp 5 minutes ago

I won't speak for cafkafk, but I have two E5 (v3/v4) systems one on DDR4 and one on DDR3. This generation of CPU all support DDR4, but a few skus do support DDR3 also. ChatGPT told me they were niche products to meet specific customer needs.
I just picked up the DDR3 board, an Aliexpress "XD3" so I could reuse some DDR3 ram on a better CPU. Quad channel 1866MT/s is not bad!
lightedman 6 hours ago
The first two generations supported DDR3 only. Haswell and Broadwell (v4) brought DDR4 support.
- _zoltan_ 4 hours ago
  
  right, and they talk about "v4" which is DDR4.

fragmede 9 hours ago

(purple on black is really hard to read)

You say it runs "at reading speed". Have you benchmarked it?

cafkafk 8 hours ago
> (purple on black is really hard to read)
Noted, and agree (it looks like it has also already been clicked, which I dislike). I honestly I need to redo the themes.
> You say it runs "at reading speed". Have you benchmarked it?
At some point a few weeks ago, yes I think so, but I didn't write it down for some reason... so I'll have to find a time when it's not busy and do it again without a noisy system. Right now the system is noisy, but that said doing it like this:
llama-cli --model gemma-4-26B-A4B-it-Q8_0.gguf --model-draft gemma-4-26B-A4B-t-assistant-GGUF/wikitext-2-raw_ik-llama-mtp_drafter-conservative/gemma-4-26B-A4B-it-assistant-Q8_0.gguf --spec-type mtp --draft-max 3 --draft-p-min 0.0 --color -sm graph -smgs -sas -mea 256 --split-mode-f32 --temp 0.7 --cpu-moe -t 8 --flash-attn on --mla-use 3 --merge-up-gate-experts --special --mlock --run-time-repack --spec-autotune --no-kv-offload --parallel 8 --jinja -p "Why is the sky blue?" -n 128
Gives:
llama_print_timings: load time = 83911.65 ms llama_print_timings: sample time = 26.99 ms / 128 runs ( 0.21 ms per token, 4742.15 tokens per second) llama_print_timings: prompt eval time = 343.41 ms / 7 tokens ( 49.06 ms per token, 20.38 tokens per second) llama_print_timings: eval time = 10639.36 ms / 127 runs ( 83.77 ms per token, 11.94 tokens per second) llama_print_timings: total time = 11114.98 ms / 134 tokens
So 11.94 tokens per second while it's also playing binary cache and CI builder.
When I do it properly, I'll add it to the blog as well!
- fhars 4 hours ago
  
  And if you ever run out of things to do in your copious free time, it looks like that PR #1744 was merged without the has_target_ctx assert two days after you uploaded your drafter quants. So you can now redo all your quants and rerun all your benchmarks ;-).
  
  1 reply →
- bbatha 2 hours ago
  
  What's time to first token? Raw throughput is usually not the problem in local setups in my experience.
- anon-3988 7 hours ago
  
  I am pretty sure llamacpp have their own benchmarking binary that you can use.
  
  1 reply →
- ekianjo 6 hours ago
  
  20 tokens per second for eval time is the killer here. It means you can't use this to process any meaningful amount of text.
  A GPU typically processes close to 1000 tokens/s during eval.
  
  6 replies →

dark-star 4 hours ago

Something doesn't add up here. As someone who has only recently built a home-server from an E5-26xx v2 on DDR3 RAM (because I have a sh*tload of 32g DDR3 DIMMs), I can confidently say that the newer cores (E5-26xx v3 and v4) only run on DDR4 memory...

So either you have a v2 instead of a v4 (and run on DDR3 memory), or you have a v4 but with DDR4 memory (not DDR3)

Everything else doesn't work

mwpmaybe 2 hours ago

There are some OEM-only v3/v4 parts with dual memory controllers (because of a RAM supply crunch at the time, funnily enough), but the E5-2620 v4 is not one of them. The classic example is the very popular 12-core E5-2678 v3.
robeastham 1 hour ago
This is not true. A few well known brands made both DDR3 and DDR4 servers that support v3 & v4 chips. Ask me how I know :-)
- smartbit 40 minutes ago
  
  enlighten us
happycube 3 hours ago

It looks like Supermicro had some DDR3 Xeon v3/v4 boards, and the first thing that came to mind was a Shenzen workstation/gaming board using recycled parts... haven't searched on that but it's bound to exist.
justinclift 3 hours ago

Yeah, the Intel reference page only lists DDR4, not DDR3:
https://www.intel.com/content/www/us/en/products/sku/92986/i...
TacticalCoder 4 hours ago

> So either you have a v2 instead of a v4 (and run on DDR3 memory), or you have a v4 but with DDR4 memory (not DDR3)
Yup that's odd... I've got a Xeon 2680 v4 (14 cores) (amazing bargain of a little beast btw) and it's indeed on DDR4 and I saw all Xeons v4 as supporting DDR4 only.
Full spec (brand/model/mobo type) would have been nice: mine's an HP Z440 workstation repurposed as a server (which I only turn on when I'm working and which I religiously turn off before going to bed).

arpinum 7 hours ago

How many watts is that setup? Cool you got it to work, but maybe only useful for vintage / retro computing rather than practical if the energy consumption makes it economically wasteful.

vetrom 1 hour ago

IDK about OPs setup, but I run a pile of E5-2683v4 Xeon recycled servers for Ceph and self hosted business SaaS usage.
One node's ipmitool sensor report (and self-monitoring PSU, so grain of salt, but my UPS side monitoring tracks closely), reports 250-300w average power use. This though, mind you is for running 22 spinning disks, 2 SAS/SATA SSDs, and 4 NVME ssds, and 768GB of DDR4.
Mid-gen 2015ish Xeons were not great at power reduction, but if you are pegging the cores, they were never particularly slow, and they did have lots of PCIe lanes. This boils down to the CPU/mobo itself not being that big a cost floor, especially if you have high utilization rates.
As a comparison, my main desktop development machine, running a Threadripper 9970X, 128GB of DDR5, a RDNA4 GPU, and a small pile of NVME drives has a power floor of roughly 250W. Some CPU centric workloads you'll definitely lose out on on the older gens of machines, but they are by no means impractical.
Maybe for a desktop usecase they are absolutely suboptimal nowadays, but for a lot of realworld usecases I would say they're still relevant.
---
Like the author posts for the LLM usecase, I think optimizing the hardware choice to the application and not leaving levers unpulled is a big key, especially considering how wide a variety of bandwidth/power draw/peak frequency/corecount SKUs exist in the Xeon lines. Without knowing what you intend to run and fitting the correct processor to it, you will end up with a disappointingly poor environment fit.
RetroTechie 3 hours ago
How many kWh to fabricate a brand new machine better suited to the task?
As long as performance is useable (apply your own metrics!), pulling it from existing hardware is likely the option with the lower eco footprint.
Also: chances are it'll only be used for this purpose occasionally, and/or for a short while. In that scenario [fabricating new hardware] always has the bigger eco footprint.
- dangus 3 hours ago
  
  I don’t know why you’d assume that an older system is lower footprint.
  If you’ve got something consuming 100 watts average over your 24 hour period, and your electricity costs 20 cents per kWh, you’re already spending almost as much as a Claude subscription.
  Just on electricity, this assumes your hardware never fails and you never incur any additional costs.
  There’s a big reason why newer more efficient hardware is in demand. Something that’s 10+ years old has drastically worse performance per watt.
  Obviously I am not saying to throw away your old hardware as a rule but there is a point where some of this old stuff just isn’t even worth running.
  
  3 replies →

shevy-java 6 hours ago

Would you consider improving the website's layout? Right now I find it below average quality and very distracting. Whether you are an engineer or not is not really important; great engineers can write horrible text or use a layout that is not ideal, for instance.