Comment by 2001zhaozhao

15 hours ago

There's a tradeoff between dense models and MoEs on memory usage vs. compute for the same quality.

For example, Qwen3.5 27B and Qwen3.5 122B A10B have similar average performance across benchmarks. The 122B is much faster to run than the 27B (generates more tokens at the same compute). The 27B, on the other hand, uses ~4x less VRAM at low context lengths (less difference at high context lengths).

Right now, different hardware seems to be suited to different points in the dense vs. MoE balance. On one extreme is hardware like the DGX Spark and Strix Halo which have a lot of memory compared to compute performance and memory bandwidth, and are best-suited for MoE workflows. On the other extreme you have cards like RTX 5090 which have very high performance for the price but rather little memory, and is best suited for dense models.

The Arc Pro B70 seems to be the awkward middle. With 1-2 of these, you can run a ~30B dense model slowly, probably not fast enough to be useful interactively (you'd probably need a 5090 or 2x 3090 for that). Or, you can run a MoE model at high throughput, but probably not enough quality to support agentic workflows that actually use your throughput.

11 comments

2001zhaozhao

storus 13 hours ago

DGX Spark is at the compute level of 5070. Its main issue is low memory bandwidth, i.e. it has quite fast token prefill but awful token generation. Strix Halo is just slow on every metric and used to be a cheap way to get 128GB unified RAM (now its prices are comparable to DGX Spark).

tehologist 42 minutes ago

I have one, this isn't true. The wattage of a 5070 is about 300. The spark entire unit runs at 200 watts max. In reality it runs like a rtx 5060 with lots of vram. Very good for training, okay for inferencing if you are running batch jobs and don't mind waiting.

Readerium 8 hours ago

LLMs are memory bandwidth bound not compute bound.

ondra 7 hours ago

This is incorrect, prompt processing is compute bound.
AntiUSAbah 5 hours ago

LLMs are bound by both and depends on the hardware which factor is higher.
icelancer 6 hours ago

This is only true for some parts of the time cost function.

BoredPositron 14 hours ago

I am working mostly with image models so we do a lot of fun times and the card fits perfectly here. Performance isn't great but it can just tug along in the background.àp

varispeed 13 hours ago

I still not see the point running these models. I say they produce plausible garbage, nowhere near quality of frontier models (when they work).

Why can't Intel look beyond this nonsense state of affair and build something with 1TB of RAM or more?

What I am trying to say, I am yet to see anything competitive in the market. Cards very much stalled in sub 100GB region and best corporations can do is throw something to run toy models and forget about it after a week.

AlotOfReading 9 hours ago
What's wrong with Grace Hopper if you want to throw buckets of local memory at a problem?
- varispeed 4 hours ago
  
  Most consumer platforms only allow up to 128/256GB of RAM. If you want more you likely need a data centre platform. This is again a mismatch between what companies think consumers are at and the reality.
  I think e.g. AMD missed the boat with 9950x3d2 by limiting memory controller. If it was possible to hook it with 1TB of consumer DDR5 RAM, that would be something to write home about.
- MrDrMcCoy 5 hours ago
  
  Some people, including myself, loathe Nvidia with the fiery burning passion of a thousand suns, and will put up with whatever nonsense is necessary to run without them.