Comment by Svetlitski
6 months ago
I understand the decision to archive the upstream repo; as of when I left Meta, we (i.e. the Jemalloc team) weren’t really in a great place to respond to all the random GitHub issues people would file (my favorite was the time someone filed an issue because our test suite didn’t pass on Itanium lol). Still, it makes me sad to see. Jemalloc is still IMO the best-performing general-purpose malloc implementation that’s easily usable; TCMalloc is great, but is an absolute nightmare to use if you’re not using bazel (this has become slightly less true now that bazel 7.4.0 added cc_static_library so at least you can somewhat easily export a static library, but broadly speaking the point still stands).
I’ve been meaning to ask Qi if he’d be open to cutting a final 6.0 release on the repo before re-archiving.
At the same time it’d be nice to modernize the default settings for the final release. Disabling the (somewhat confusingly backwardly-named) “cache oblivious” setting by default so that the 16 KiB size-class isn’t bloated to 20 KiB would be a major improvement. This isn’t to disparage your (i.e. Jason’s) original choice here; IIRC when I last talked to Qi and David about this they made the point that at the time you chose this default, typical TLB associativity was much lower than it is now. On a similar note, increasing the default “page size” from 4 KiB to something larger (probably 16 KiB), which would correspondingly increase the large size-class cutoff (i.e. the point at which the allocator switches from placing multiple allocations onto a slab, to backing individual allocations with their own extent directly) from 16 KiB up to 64 KiB would be pretty impactful. One of the last things I looked at before leaving Meta was making this change internally for major services, as it was worth a several percent CPU improvement (at the cost of a minor increase in RAM usage due to increased fragmentation). There’s a few other things I’d tweak (e.g. switching the default setting of metadata_thp from “disabled” to “auto”, changing the extent-sizing for slabs from using the nearest exact multiple of the page size that fits the size-class to instead allowing ~1% guaranteed wasted space in exchange for reducing fragmentation), but the aforementioned settings are the biggest ones.
That was me that filed the Itanium test suite failure. :)
Ah, porting to HP Superdome servers. It’s like being handed a brochure describing the intricate details of the iceberg the ship you just boarded is about to hit in a few days.
A fellow traveler, ahoy!
I worked on the Superdome servers back in the day. What a weird product. I still can't believe it was a profitable division (at my time circa 2011).
HP was going through some turbulent waters in those days.
1 reply →
The Itanic was kind of great :). I'm convinced it helped sink SGI.
Itanium did its most most important job: it killed everything but ARM and POWER.
Sunk by the Great Itanic ?
Why was the sinking of SGI great?
3 replies →
SGI and HP! Intel should have a statue of Rick Belluzzo on they’r campus.
one of the best books on Linux architecture i've read was the one on the Itanium port
i think, because Itanic broke a ton of assumptions
This sounds interesting. Can you share the title of the book?
Stuff like this is what keeps me coming back here. Thanks for posting this!
What's hard about using TCMalloc if you're not using bazel? (Not asking to imply that it's not, but because I'm genuinely curious.)
It’s just a huge pain to build and link against. Before the bazel 7.4.0 change your options were basically:
1. Use it as a dynamically linked library. This is not great because you’re taking at a minimum the performance hit of going through the PLT for every call. The forfeited performance is even larger if you compare against statically linking with LTO (i.e. so that you can inline calls to malloc, get the benefit of FDO , etc.). Not to mention all the deployment headaches associated with shared libraries.
2. Painfully manually create a static library. I’ve done this, it’s awful; especially if you want to go the extra mile to capture as much performance as possible and at least get partial LTO (i.e. of TCMalloc independent of your application code, compiling all of TCMalloc’s compilation units together to create a single object file).
When I was at Meta I imported TCMalloc to benchmark against (to highlight areas where we could do better in Jemalloc) by pain-stakingly hand-translating its bazel BUILD files to buck2 because there was legitimately no better option.
As a consequence of being so hard to use outside of Google, TCMalloc has many more unexpected (sometimes problematic) behaviors than Jemalloc when used as a general purpose allocator in other environments (e.g. it basically assumes that you are using a certain set of Linux configuration options [1] and behaves rather poorly if you’re not)
[1] https://google.github.io/tcmalloc/tuning.html#system-level-o...
Thanks for sharing the insight!
As I observed when I was at Google: tcmalloc wasn't a dedicated team but a project driven by server performance optimization engineers aiming to improve performance of important internal servers. Extracting it to github.com/google/tcmalloc was complex due to intricate dependencies (https://abseil.io/blog/20200212-tcmalloc ). As internal performance priorities demanded more focus, less time was available for maintaining the CMake build system. Maintaining the repo could at best be described as a community contribution activity.
> Meta’s needs stopped aligning well with those of external uses some time ago, and they are better off doing their own thing.
I think Google's diverged from the external uses even long ago:) (For a long time google3 and gperftools's tcmalloc implementations were so different.)
Everything from Google is an absolute pain to work with unless you're in Google using their systems, FWIW. Anything from the Chromium project is deeply intangled with everything else from the Chromium project as part of one gigantic Chromium source tree with all dependencies and toolchains vendored. They do not care about ABI what so ever, to the point that a lot of Google libraries change their public ABI based on whether address sanitizer is enabled or not, meaning you can't enable ASAN for your code if you use pre-built (e.g package manager provided) versions of their code. Their libraries also tend to break if you link against them from a project with RTTI enabled, a compiler set to a slightly different compiler version, or any number of other minute differences that most other developers don't let affect their ABI.
And if you try to build their libraries from source, that involves downloading tens of gigabytes of sysroots and toolchains and vendored dependencies.
Oh and you probably don't want multiple versions of a library in your binary, so be prepared to use Google's (probably outdated) version of whatever libraries they vendor.
And they make no effort what so ever to distinguish between public header files and their source code, so if you wanna package up their libraries, be prepared to make scripts to extract the headers you need (including headers from vendored dependencies), you can't just copy all of some 'include/' folder.
And their public headers tend to do idiotic stuff like `#include "base/pc.h"`, where that `"base/pc.h"` path is not relative to the file doing the include. So you're gonna have to pollute the include namespace. Make sure not to step on their toes! There's a lot of them.
I have had the misfortune of working with Abseill, their WebRTC library, their gRPC library and their protobuf library, and it's all terrible. For personal projects where I don't have a very, very good reason to use Google code, I try to avoid it like the plague. For professional projects where I've had to use libwebrtc, the only reasonable approach is to silo off libwebrtc into its own binary which only deals with WebRTC, typically with a line-delimited JSON protocol on stdin/stdout. For things like protobuf/gRPC where that hasn't been possible, you just have to live with the suffering.
..This comment should probably have been a blog post.
24 replies →
Wow. That does sound quite unpleasant.
Thanks again. This is far outside my regular work, but it fascinates me.
I’ve successfully used LLMs to migrate Makefiles to bazel, more or less. I’ve not tried the reverse but suspect (2) isn’t so bad these days. YMMV, of course, but food for thought
6 replies →
I would love to see these changes - or even some sort of blog post or extended documentation explaining rational. As is the docs are somewhat barren. I feel that there’s a lot of knowledge that folks like you have right now from all of the work that was done internally at Meta that would be best shared now before it is lost.
Do you have any opinions on mimalloc?
> filed an issue because our test suite didn’t pass on Itanium lol
For the non low-level programmers in the bowels of memory allocators among us, why is this a "lol"?
The Itanium ISA was an infamous failure, never seeing widespread usage, hence people often referring to it as “The Itanic” (i.e. the much-touted ship that immediately sunk). The fact that anyone would be using it today at all is sort of hilariously niche, and is illustrative of how wide-ranging and obscure the issues filed to the GitHub repo could be. On a similar token I recall seeing an issue (or maybe it was a PR?) to fix our build on GNU Herd.
> we (i.e. the Jemalloc team) weren’t really in a great place to respond to all the random GitHub issues people would file
Why not? I mean this is complete drive-by comment, so please correct me, but there was a fully staffed team at Meta that maintained it, but was not in the best place to manage the issues?
Well, to be blunt, the company does not care about this, so it does not get done.
They said the team was not in a great place to do it, eg. they probably had competing priorities that overshadowed triaging issues.
> TCMalloc is great, but is an absolute nightmare to use if you’re not using bazel
custom-malloc-newbie question: Why is the choice of build system (generator) significant when evaluating the usability of a library?
Because you need to build it to use it, and you likely already have significant build related infrastructure, and you are going to need to integrate any new dependencies into that. I'm increasingly convinced that the various build systems are elaborate and wildly successful ploys intended only to sap developer time and energy.
Because you have to build it. If they don't use the same build system as you, you either want to invoke their system, or import it into yours. The former is unappealing if it's 'heavy' or doesn't play well as a subprocess; the latter can take a lot of time if the build process you're replicating is complex.
I've done both before, and seen libraries at various levels of complexity; there is definitely a point where you just want to give up and not use the thing when it's very complex.
This. When step one is "install our weird build system," I'll immediately look for something else that meets my needs. All build systems suck, so everyone thinks they can write a better one, and too many people try. Pretty soon you end up having to learn a majority of this (https://en.wikipedia.org/wiki/List_of_build_automation_softw...) to get your code to compile.
3 replies →
It's kind of wild that great software is hindered by a complicated build and integration process.