← Back to context

Comment by vlovich123

3 years ago

Tracing garbage collectors don’t generally win against reference counting connectors, especially when those reference counts are automatically elided via ARC (eg swift and objective C) or because they’re rarely used by means of composition (c++ and rust). Additionally, different kinds of application strategies are better depending on the use case (eg a pool allocator that you bulk drop at the end of some computation).

What papers are you referencing showing tracing GCs outperforming things? If it’s just the website, I think it’s an artifact of a micro benchmark rather than something that holds true for non trivial programs.

I've always heard of the "Swift elides reference counts" statements but I've never seen it substantiated. I don't claim to be a Swift GC expert by any means, but the impression I get from the two Swift GC papers I've read [1, 2] is that Swift has a very simple implementation of RC. The RC optimization document (albeit the document is incomplete) [3] also doesn't give me the impression that Swift is doing much eliding of reference counts (I'm sure it is doing it for simple cases).

Do you have any links which might explain what kind of eliding Swift is doing?

EDIT: The major RC optimizations I have seen which elide references are deferral and coalescing and I'm fairly certain that Swift is doing neither.

[1]: https://dl.acm.org/doi/abs/10.1145/3170472.3133843

[2]: https://doi.org/10.1145/3243176.3243195

[3]: https://github.com/apple/swift/blob/main/docs/ARCOptimizatio...

  • Swift's compiler elides obviously useless RC operations at compile time. It doesn't do anything at runtime though.

    • That is correct. Like Objective-C the compiler statically removes ARC when it can prove it doesn’t need it (or when you annotate it because where you’re getting it from is manually managing and giving you ownership of the reference). So within a function or cross function with LTO (although it looks like swift doesn’t yet have LTO [1] so I’m not sure about cross-module optimizations).

      > When receiving a return result from such a function or method, ARC releases the value at the end of the full-expression it is contained within, subject to the usual optimizations for local values.

      Quote from [2] (section 6 has the full details about optimizations). I believe [3] might be the compiler pass.

      There actually is some elision that happens at runtime if you install an autoreleasepool if I recall correctly.

      I did actually work at Apple so that’s where my recollection comes from although it’s been 8 years since then and I didn’t work on the compiler side of things so my memory could be faulty.

      [1] https://github.com/apple/swift/pull/32233

      [2] https://opensource.apple.com/source/lldb/lldb-112/llvm/tools...

      [3] https://llvm.org/doxygen/group__ARCOpt.html

      5 replies →

The conventional wisdom is that evacuating (copying) GC's win over malloc/free since 1) the GC touches only the live data and not the garbage, and 2) it compacts the active memory periodically, which improves its cache and (when relevant) paging hit rates.

Obviously though, this will be situation dependent.

Then why do every performant managed language opts for tracing GCs when they can?

RC is used in lower level languages because it doesn’t require runtime support, and can be implemented as a library.

As I wrote in another comment, even with elisions, you are still trading off constant writes on the working thread for parallel work, and you even have to pay for synchronization in parallel contexts.

  • Because tracing GCs can solve referential loops which RC can’t. So at the language level where you have to handle all sorts of programs written by programmers of varying quality (+ mistakes) a tracing GC gives better predictable memory usage performance across a broader range of programs.

    Seriously. A single threaded reference counter is super cheap. Cross thread reference counts shouldn’t be used and I think are an anti pattern - it’s better to have the owning thread be responsible for maintaining the reference count and passing a borrow via IPC that the borrower has to hand back. There is also hybrid RC where you Arc across threads but use RC within the thread. This gives you the best of both worlds with minimal cost. Which model you prefer is probably a matter of taste.

    CPUs are stupid fast at incrementing and decrementing a counter. Additionally most allocations should be done on the stack with a small amount done on the heap that is larger / needs to outlive the current scope. I’ve written all sorts of performance-critical programs (including games) and never once has shared_ptr in C++ (which is atomic) popped up in the profiler because the vast majority of allocations are on stack, value composition, or unique_ptr (ie no GC of any kind needed).

    The fastest kind of GC is one where you don’t even need any (ie Box / unique_ptr). The second fastest is an integrated increment that’s likely in your CPU cache. I don’t think anyone can claim that pointer chasing is “fast” and certainly not faster than ARC. Again assuming you’re not being uncareful and throwing ARC around everywhere when it’s not needed in the first place. Value composition is much more powerful and leave RC / Arc when you have a more complicated object graph with shared ownership (and even then try to give ownership to the root uniquely or through RC and hand out references only to children and RC to peers).

    • Obviously the 3 objects you allocate with shared_ptr in C++ won’t be a performance bottleneck, but then we are not comparing apples to oranges.

      Your single threaded RC will still have to write back to memory, no one thinks that incrementing an integer is the slow part — destroying cache is.

      9 replies →

  • Because they don’t care as much about working set size as Apple does.

    • Sure, it is a tradeoff as basically every other technical choice.

      But we were talking about performance here, and especially in throughput, tracing GCs are much better.

      7 replies →

  • this isn't fully true. Java is in the process of getting lxr which is uses both tracing and reference counting.

    • Yes and no. LXR is a highly optimized deferred and coalesced reference counting (amongst many other optimizations) GC and it looks nothing like the RC implementations you see in other languages like Python, Nim, Swift, etc. So yes, reference counting can be performant, _but_ only if you are using a deferred and coalesced implementation as otherwise the simple implementation requires atomic operations for every pointer read/write operation (though compilers could optimize it to use non-atomic operations when it's sure semantics are preserved).

      EDIT: The post you're responding to is referring to the simple standard implementation of RC.

      2 replies →

Swift and Objective-C ARC performance is quite poor.

  • Compared to what, though? And is that still the case if all OS components use whatever it is, as opposed to a few applications? Memory efficiency is crucial for overall system performance, and ARC is highly memory-efficient compared to every production GC I’m aware of.

  • iOS ships the same speed phones as Android with half the RAM. So I’d say here’s a real comparison of ARC vs tracing GC.

    • And the Android rules the mobile world with 80% market share.

      Also iOS applications tend to crash due to memory leaks or not enough memory being available.

      So yeah a real comparison.

    • Iphones are 3 generations ahead in terms of single core CPU performance, so that’s just a biased take.