The Garbage Collection Handbook, 2nd Edition

3 years ago (routledge.com)

On a related note, I found ART's implementation of Concurrent copying and compaction GC to be pretty novel. Here's a nice write up detailing how handling page faults in userspace come in handy to accomplish that: https://www.tdcommons.org/cgi/viewcontent.cgi?article=4751&c... (pdf) / https://web.archive.org/web/20230216233459/https://www.tdcom...

For context, here's a brief overview of the evolution of the Android Runtime Garbage Collector: https://archive.is/Ue6Pj

  • Certainly interesting, but there are no performance numbers mentioned in the white paper comparing userfaultfd to read barriers. So the actual benefit to switching to userfaultfd is unknown (at least in publically accessible documents -- I'm sure Google has done internal performance evaluations).

    Using page faults (and/or page protection) to perform compaction instead of barriers is a pretty old technique (see the 1988 paper by Appel, Ellis, and Li [1]; see also the Compressor by Kermany and Petrank [2]). But handling page faults were very expensive (at least on Linux) until the addition of userfaultfd recently.

    [1]: https://dl.acm.org/doi/10.1145/960116.53992 [2]: https://dl.acm.org/doi/10.1145/1133255.1134023

    • What is the actual perf benefit of userfaultd (e.g. for write protection)? It sounds interesting but its unclear to me why it would be any faster than a signal handler. Is it just a simpler code path within the kernel? Or is it that the hardware is configured to directly jump to a user space handler without any kernel intervention?

      1 reply →

20 years ago in the programming languages lesson at the university we started learning about OOP using Java and Ada as examples. When the professor started describing Java, a fellow student interrupted him to inform the class that "Java has garbage collection" (and boast of his special knowledge).

After that incident his nickname was "the garbage collector"!

Met Richard Jones, one of the authors of the book, at its original launch. Very nice guy. Bought the original book and did some heavy reading on it – if you get this book, be prepared to put time into it, it's readable but it's not one you can do a quick skim of – and I really profited from it. It should be required reading for all programmers.

(To be clear, I'm referring to the very first version, I haven't read the subsequent versions but given the quality of the first I'd be very surprised if they were any less good).

Edit: https://www.cs.kent.ac.uk/people/staff/rej/ - a jumpoff page for his stuff.

Edit2: https://www.cs.kent.ac.uk/people/staff/rej/gcbib/ - "[This] online bibliographic database includes nearly 3,000 garbage collection-related publications. It contains abstracts for some entries and URLs or DOIs for most of the electronically available ones, and is continually being updated. The database can be searched online or downloaded as BibTeX, PostScript, or PDF." Welcome to the ultimate rabbit hole I guess.

I don't know if it is included in the new edition of the book but in case anyone is interested in a modern, highly efficient RC implementation that does not rely on deferring the reference count updates (which kills one of the advantages of RC in the first place), check the Perseus paper. Koka (which uses perseus) is quite competitive with OCaml and Haskell

Just search for "perseus reference counting", you'll find it. It uses linear logic to insert explicit "dup/drop" operations and then merges and coalesces them.

if you need code to understand garbage collection, there is walkthrough of garbage collector and C code at http://maplant.com/gc.html is really helpful.

I tweaked it to work on amd64 and started adding register scanning based on what eatonphil's discord people told me to do.

https://github.com/samsquire/garbage-collector

It's not fit for any purpose but more of a learning exercise.

  • Bob Nystrom (of Game Programming Patterns, Crafting Interpreters, and dartfmt fame) also wrote a tutorial implementation[1], of a precise tracing GC as opposed to a conservative one.

    Regarding register scanning in a conservative GC, Andreas Kling has made (or at least quoted) the amusing observation[2] that your C runtime already has a primitive to dump all callee-save registers to memory: setjmp(). So all you have to do to scan both registers and stack is to put a jmp_buf onto the stack, setjmp() to it, then scan the stack normally starting from its address.

    [1] https://journal.stuffwithstuff.com/2013/12/08/babys-first-ga...

    [2] https://youtu.be/IzB6iTeo8kk

    • Glibc's mangles some pointer registers in setjmp(). It XORs a per-process value with the stack pointer, frame pointer and instruction pointer stored in the jmp_buf, on all (or nearly all) architectures.

      Although losing the stack and instruction pointers is unlikely to be a problem for the GC context, the frame pointer register need not contain a frame pointer value. It can be an arbitrary program value depending on compile options. That's something to watch out for with this GC technique.

      1 reply →

    • Implementations are unfortunately allowed to do whatever they want to that jmp_buf, they could xor the contents for all you know. Hopefully no implementation does something silly like that.

      2 replies →

  • Side topic, but the maplant website design is great.

    • Besides the code blocks, it has 0 CSS. So what you're actually complementing is your browser's default styles for HTML

What I really want out of a garbage collector is a "Collect" function with a deadline. Pick a max time it's allowed to run before stopping and returning to the program.

  • Real time GCs exist such as the IBM Metronome GC. Though I'll be honest and say I haven't heard of many real-time GCs other than the Metronome one. Certainly many new GCs have reduced pause times dramatically but that's orthogonal to real-time GC (as you can make pause times infinitesimally small but not let the mutator actually progress).

  • I've achieved GC timing that is good enough for real-time competitive game hosting using .NET6+. Worst case over 4 hours of load testing was an 8ms spike. Average 1-2ms.

    The magic trick is to intentionally collect as often as reasonably possible (i.e. at batch/frame/tick processing boundaries) and avoid using sophisticated GC schemes that involve multiple threads or asynchrony.

    Oh, and obviously you need to minimize allocations throughout or it won't matter.

  • Nim has exactly that with the GC_step proc.

    https://nim-lang.org/1.4.0/gc.html

    However recent and future versions (2.0) are moving towards a different approach that is also applicable for deterministic real time systems: ARC, which is basically alloc and free calls inserted automatically by the compiler using static analysis (no "runtime").

  • That is necessary but often not sufficient. It still leaves the problem of moving time from where garbage is created to the bulk 'collect' step. I've run collect every frame to keep stalls down, but the time always creeps up and up and you end up doing forensic analysis of which code is making the garbage in the first place.

  • Not really possible unless you have a real-time OS. Virtual memory and concurrency means there is really no way to bound how long anything will take.

Great that there's a new edition of this coming. This is the definitive reference if you work with garbage collectors, it's also very well written.

Note that (the first edition of) this book considers reference counting, including variations like deferred reference counting, to be garbage collection (though not tracing garbage collection), and reviews it as well.

  • Just like most relevant computer science sources do, only joe and jane on the street without CS background think otherwise.

Anyone know what's new in this edition? Both the publisher's site and the author's site are light on details.

Great that is having an update, it is one of the most relevant source of informations for GC algorithms.

Wow, this really is a new edition that will be available in 2023. The old (2012) edition is excellent. I don't have the impression that a whole lot has changed since then, but an update with all the latest refinements has to be worthwhile to implementers.

Who would usually need this book?

  • The first edition is a definitive reference for anyone writing a garbage collector (and by extension, exploring writing a programming language). Also people who are interested in how GCs work.

    Probably less so for people trying to optimize a program to work with an existing GC (eg. tweaking the JVM), but I suppose knowing the basic principles can help.

  • I’ve written two garbage collectors in my time, neither of which were in a programming language.

    One was for an in memory cache of data relationships. Another was to clean up soft references with an RDF graph. Neither were, nor needed to be, particularly sophisticated.

    The cache was a compacting collector, the RDF one was mostly a “connectedness” test, pruning those nodes fallen from the graph.

    Recall that malloc has a simple garbage collector for its free space, and arguably the level of sophistication that ranks a modern malloc implementation is how it manages its free space.

    In the end detritus must be identified and resources reclaimed. So you see how GC like systems can occur in divergent areas of work.

  • This book (the 1st edition) gave me exactly what I needed when writing the garbage collector for my programming language.

    Besides the great technical content, I found it to be a very enjoyable, readable book.

  • I read the mark-and-sweep GC parts to gain some deeper understanding about Go's garbage collector. Probably the best resource for that kind of stuff other than reading actual code.

I feel the need for garbage collection is a language design mis-feature. That is to say, producing garbage is a language design-mis-feature. To quote Bjarne Stroustrup:

> I don't like garbage. I don't like littering. My ideal is to eliminate the > need for a garbage collector by not producing any garbage. That is now > possible.

and it's indeed possible. For example It's become pretty much a non-issue in modern C++: https://stackoverflow.com/a/48046118/1593077 (and C++ is not the only example, it's just a prominent example of a language which almost standardized garbage collection, but eventually did not go that way.)

  • C++ can actually produce quite a bit of garbage unintentionally, it's why linters will remind you often to call std::move.

    That said I much prefer deterministic resource cleanup even in a janky language like C++ over a tracing GC.

    • It's true that C++ can trigger a lot of unnecessary copying if one writes carelessly; but that's not the same as garbage, in that both copies are used, and none of them continue living indefinitely without special intervention. But point taken about pass-by-value deficiencies.

  • You can’t avoid garbage if you deal with generic programs — Rust and C++ can only implement a subset of all programs without RC (which is the most basic GC algorithm).

    This is the same way to many other interesting CS properties — most of them are undecidable at compile time, so you have to do it at runtime.

  • More and more people nowadays are programming at high levels of abstraction. If you're designing the frontend of a website, or making a mobile game, or developing a stock trading algorithm, or whatever else, then you probably don't want to constantly worry about details of memory management...

    On top of this, GC is necessary for some algorithms. Any data structure with partial sharing (e.g. binary search tree with versioning via path-copying) needs GC to be space-efficient. You could either rely on a built-in GC, or write your own. If you write your own, I think you'll find that it is tedious and error-prone due to memory safety issues.

    • We could already be doing that during the 1980's had it not been for UNIX getting out of Bell Labs with source tapes, and bad management decisions.

      https://interlisp.org/

      http://toastytech.com/guis/cedar.html

      "Eric Bier Demonstrates Cedar"

      https://www.youtube.com/watch?v=z_dt7NG38V4&t=2s

      "Making Smalltalk"

      https://www.youtube.com/watch?v=PaOMiNku1_M

      http://www.edm2.com/index.php/VisualAge_Smalltalk

      Also there is to note that most BASIC implementations had support for automatic memory management, at least for strings and arrays, the structured compiled dialects even better.

      Also database programming with languages like Clipper and FoxPro.

      And even if we stay in the UNIX world, that is exactly using stuff like Perl also allowed for, C like programming without the headaches of manual memory management.

      Or the brief fad of 4GL languages.

    • > then you probably don't want to constantly worry about details of memory management...

      Oh, you actually don't have to, that's the whole point... in the past, you (effectively) had a choice between careful manual management of memory and garbage collection with its overheads. These days, you can use constructs which take care of that management for you, with very little or no overhead, which don't involve garbage.

      It's true that sometimes GC is algorithmically necessary; but then the high-level-of-abstraction argument is irrelevant. And in those cases, a GC library does indeed come in useful. You don't need to write your own, others have likely done it already.

Cool stuff! The last 10 years have seen lots of great languages pushing the field so I'm excited to see what's in the book =)

Bizarre that you can't preorder it yet, you have to wait until June, just to preorder.

Will there be a PDF / epub option?

  • If I'm not mistaken, Routledge uses that VitalSource dreck, so you can only read their ebooks via a dedicated app. I don't buy dead-tree books any more, but I'll make an exception for this one.

    • The previous edition of this book is available on e.g. Apple's bookstore and Kindle. Unfortunately, the new one does not show an eBook option yet. I very much hope that one will be available soon, instead of them holding back the eBook to goose hardcover sales or some other reason.

[flagged]

Rust has left the building

  • Rust also uses reference counting, probably the worst sort of garbage collection.

    • Only when used in a naïve way, which Rust does not. For example, the increments/decrements are done only when "clone" is called and scope exit respectively, and based on Rust ownership/borrow checking, is rarely done combining the best of both worlds (but yes, implementations with aggressive increment/decrements in loops and on every function call can be very slow). Rust also separates Arc (atomic refs) and Rc (non-atomic refs) and enforces usage scenarios in the type checker giving you cheap Rc in single threaded scenarios. Reference counting when done in a smart way works pretty well, but you obviously have to be a little careful of cycles (which in my experience are pretty rare and fairly obvious when you have such a data type).

Do we need this book now that we have Rust ??