← Back to context

Comment by fouc

1 day ago

>Starting with 4 virtual cores and 8 GB vRAM, where the VM ran perfectly briskly with around 5 GB of memory used, I stepped down to 3 cores and 6 GB, to discover that memory usage fell to 3.9 GB and everything worked well. With just 2 cores and 4 GB of memory only 3.1 GB of that was used, and the VM continued to handle those lightweight tasks normally.

Good reminder that there's a certain amount of memory tied up with each core (probably mainly page cache and concurrency handling etc).

As a general rule, also the amount of physical memory installed in a computer should be proportional with the number of hardware threads provided by its CPU.

Besides the fact that the operating system may allocate some memory for each thread, when you launch a multi-threaded application that is able to use all available threads, for instance the compilation of a big software project, it frequently will allocate some working memory in an amount proportional with the amount of working threads.

I have encountered many multi-threaded applications that need up to 2 GB per thread to work well.

This corresponds to having 64 GB for a desktop CPU with 32 threads, like Ryzen 9 9950X.

For the compilation example, I have seen software projects, like Chrome/Chromium and its derivatives, where if you do not have enough memory, proportional to the number of hardware threads, e.g. when you have only 32 GB for a 16 core/32 thread CPU, you must reduce the number of concurrent compilations, e.g. with an appropriate parameter to "make -j", leaving some threads and cores idle, because otherwise you may encounter out-of-memory errors.

  • > when you have only 32 GB for a 16 core/32 thread CPU, you must reduce the number of concurrent compilations

    Also, depending on the architecture, avoiding odd(or even) virtual cores might free more L2 or L3 for the worker threads and speed up the process.

  • Compiling flash-attn (Flash Attention) is a another great stress-test for CPU+RAM as just using 16 threads can balloon you into 128GB RAM usage territory already. Same thing with needing to not do too much concurrency when compiling it.

    • I have this problem with NixOS as one of my build servers doesn’t have enough ram. There doesn’t seem to be a way to know if a compilation is likely to be ram heavy and either use a tagged server with more ram or use few threads on servers with less ram.

  • It's an important point. I went from 4c/8t and 32GB to 16/32 and 96GB. Dramatically less memory per thread. Some software (looking at you, Vivado) can take incredible amounts of memory per parallel job thus mandating some projects can only run with a subset of my cores. At least until I stepped up my work laptop to 10.66 GB/thread. That seems to be manageable

  • Yes! I have also observed that with compilation VMs on a big server.

I'd bet for the null hypothesis: the memory behaviour changes would hold if the core count was kept constant and only the VM's memory size was adjusted.

  • Agreed. This is the OS adapting to available memory.

    Similarly if you started with 4GB and there was 900MB available for user apps, I expect you could launch apps that consume 1500MB just fine; the OS is leaving enough to launch anything, and making use of unused memory for cache/etc.

  • There is a per-cpu data structure in the xnu kernel, but it is not big enough to tilt the scales when you are talking about RAM in units of gigabytes.

    • It’s not just the kernel. I wouldn’t be surprised if there’s a fair few userspace services spawning a thread per core.

There is some overhead per-core, you're right, but imo this reduction in usage is likely from how the kernel allocates available memory, which is being reduced as well. The kernel will keep read caches around longer with more memory, it'll prefer to compress memory instead of swap to disk if it has more, it'll purge/cleanup reclaimable memory less often with more memory, etc. It even scales its internal buffer sizes and vnode tables depending on total memory.

All good things imo, it dynamically makes the most of what is available, at the expense of making it harder to see a true baseline of hard min requirement to operate.

Fun things to check, `vm_stat`

$ vm_stat Mach Virtual Memory Statistics: (page size of 4096 bytes)

Pages free: 230295.

Pages active: 1206857.

Pages inactive: 1206361.

Pages speculative: 31863.

Pages throttled: 0.

Pages wired down: 470093.

Pages purgeable: 18894.

"Translation faults": 21635255.

Pages copy-on-write: 1590349.

Pages zero filled: 11093310.

Pages reactivated: 15580.

Pages purged: 50928.

File-backed pages: 689378.

Anonymous pages: 1755703.

Pages stored in compressor: 0.

Pages occupied by compressor: 0.

Decompressions: 0.

Compressions: 0.

Pageins: 832529.

Pageouts: 225.

Swapins: 0.

Swapouts: 0.

edit: no code fence markdown support or am I doing something wrong?

  • Single inline backticks like `this` aren't recognized (although still useful in my opinion, they just don't change the rendering).

    Triple backticks also aren't recognized. However, if you indent by I believe 4 spaces, it formats it in a fixed width font presuming it's code.

    Let's try (4 spaces):

        func main() {
            fmt.Println("Hello, HN!")
        }
    

    None for comparison:

    func main() { fmt.Println("Hello, HN!") }

    • Seems I missed the window to be able to edit my message, but I'll remember this info for next time, thanks!