Comment by crest
5 days ago
One problem with your setup is that ZFS by design can't use a traditional *nix filesystem buffer cache. Instead it has to use its own ARC (adaptive replacement cache) with end-to-end checksumming, transparent compression, and copy-on-write semantics. This can lead to annoying performance problems when the two types of file system caches contest for available memory. There is a back pressure mechanism, but it effectively pauses other writes while evicting dirty cache entries to release memory.
Traditionally, you have the page cache on top of the FS and the buffer cache below the FS, with the two being unified such that double caching is avoided in traditional UNIX filesystems.
ZFS goes out of its way to avoid the buffer cache, although Linux does not give it the option to fully opt out of it since the block layer will buffer reads done by userland to disks underneath ZFS. That is why ZFS began to purge the buffer cache on every flush 11 years ago:
https://github.com/openzfs/zfs/commit/cecb7487fc8eea3508c3b6...
That is how it still works today:
https://github.com/openzfs/zfs/blob/fe44c5ae27993a8ff53f4cef...
If I recall correctly, the page cache is also still above ZFS when mmap() is used. There was talk about fixing it by having mmap() work out of ARC instead, but I don’t believe it was ever done, so there is technically double caching done there.
what's the best way to deal with this then? disable filecache of linux? I've tried disabling/minimizing arc in the past to avoid the oom reaper, but the arc was stubborn and its RAM usage remained as is
These days, ZFS frees memory fast enough when Linux requests memory to be freed that you generally do not see OOM because of ZFS, but if you have a workload where it is not fast enough, you can limit the maximum arc size to try to help:
https://openzfs.github.io/openzfs-docs/Performance%20and%20T...
I didn't have any trouble limiting zfs_arc_max to 3GB on one system where I felt that it was important. I ran it that way for a fair number of years and it always stayed close to that bound (if it was ever exceeded, it wasn't by a noteworthy amount at any time when I was looking).
At the time, I had it this way because I had fear of OOM events causing [at least] unexpected weirdness.
A few months ago I discovered weird issues with a fairly big, persistent L2ARC being ignored at boot due to insufficient ARC. So I stopped arbitrarily limiting zfs_arc_max and just let it do its default self-managed thing.
So far, no issues. For me. With my workload.
Are you having issues with this, or is it a theoretical problem?