← Back to context

Comment by Bluecobra

1 day ago

I didn’t realize that you can get 128GB of memory in a notebook, that is impressive!

I've got a 128 GiB unified memory Ryzen Ai Max+ 395 (aka Strix Halo) laptop.

Trying to run LLM models somehow makes 128 GiB of memory feel incredibly tight. I'm frequently getting OOMs when I'm running models that are pushing the limits of what this can fit, I need to leave more memory free for system memory than I was expecting. I was expecting to be able to run models of up to ~100 GiB quantized, leaving 28 GiB for system memory, but it turns out I need to leave more room for context and overhead. ~80 GiB quantized seems like a better max limit when trying not running on a headless system so I'm running a desktop environment, browser, IDE, compilers, etc in addition to the model.

And memory bandwidth limitations for running the models is real! 10B active parameters at 4-6 bit quants feels usable but slow, much more than that and it really starts to feel sluggish.

So this can fit models like Qwen3.5-122B-A10B but it's not the speediest and I had to use a smaller quant than expected. Qwen3-Coder-Next (80B/3B active) feels quite on speed, though not quite as smart. Still trying out models, Nemotron-3-Super-120B-A12B just came out, but looks like it'll be a bit slower than Qwen3.5 while not offering up any more performance, though I do really like that they have been transparent in releasing most of its training data.

  • There's been some very recent ongoing work in some local AI frameworks on enabling mmap by default, which can potentially obviate some RAM-driven limitations especially for sparse MoE models. Running with mmap and too little RAM will then still come with severe slowdowns since read-only model parameters will have to be shuttled in from storage as they're needed, but for hardware with fast enough storage and especially for models that "almost" fit in the RAM filesystem cache, this can be a huge unblock at negligible cost. Especially if it potentially enables further unblocks via adding extra swap for K-V cache and long context.

Most workstation class laptops (i.e. Lenovo P-series, Dell Precision) have 4 DIMM slots and you can get them with 256 GB (at least, before the current RAM shortages).

There's also the Ryzen AI Max+ 395 that has 128GB unified in laptop form factor.

Only Apple has the unique dynamic allocation though.

  • Yep, I have a 13" gaming tablet with the 128 GB AMD Strix Halo chip (Ryzen AI Max+ 395, what a name). Asus ROG Flow Z13. It's a beast; the performance is totally disproportionate to its size & form factor.

    I'm not sure what exactly you're referring to with "Only Apple has the unique dynamic allocation though." On Strix Halo you set the fixed VRAM size to 512 MB in the BIOS, and you set a few Linux kernel params that enable dynamic allocation to whatever limit you want (I'm using 110 GB max at the moment). LLMs can use up to that much when loaded, but it's shared fully dynamically with regular RAM and is instantly available for regular system use when you unload the LLM.

  • > Only Apple has the unique dynamic allocation though.

    What do you mean? On Linux I can dynamically allocate memory between CPU and GPU. Just have to set a few kernel parameters to set the max allowable allocation to the GPU, and set the BIOS to the minimum amount of dedicated graphics memory.

    • Maybe things have changed but the last time I looked at this, it was only max 96GB to the GPU. And it isn't dynamic in the sense you still have to tweak the kernel parameters, which require a reboot.

      Apple has none of this.

      2 replies →

  • Intel had dynamic allocation since Intel 830(2001) for Pentium III Mobile. Everything always did, especially platforms with iGPUs like Xbox 360.

    Only Apple and AMD have APUs with relatively fast iGPU that becomes relevant in large local LLM(>7b) use cases.