Comment by kaba0

3 years ago

Then why do every performant managed language opts for tracing GCs when they can?

RC is used in lower level languages because it doesn’t require runtime support, and can be implemented as a library.

As I wrote in another comment, even with elisions, you are still trading off constant writes on the working thread for parallel work, and you even have to pay for synchronization in parallel contexts.

27 comments

kaba0

vlovich123 3 years ago

Because tracing GCs can solve referential loops which RC can’t. So at the language level where you have to handle all sorts of programs written by programmers of varying quality (+ mistakes) a tracing GC gives better predictable memory usage performance across a broader range of programs.

Seriously. A single threaded reference counter is super cheap. Cross thread reference counts shouldn’t be used and I think are an anti pattern - it’s better to have the owning thread be responsible for maintaining the reference count and passing a borrow via IPC that the borrower has to hand back. There is also hybrid RC where you Arc across threads but use RC within the thread. This gives you the best of both worlds with minimal cost. Which model you prefer is probably a matter of taste.

CPUs are stupid fast at incrementing and decrementing a counter. Additionally most allocations should be done on the stack with a small amount done on the heap that is larger / needs to outlive the current scope. I’ve written all sorts of performance-critical programs (including games) and never once has shared_ptr in C++ (which is atomic) popped up in the profiler because the vast majority of allocations are on stack, value composition, or unique_ptr (ie no GC of any kind needed).

The fastest kind of GC is one where you don’t even need any (ie Box / unique_ptr). The second fastest is an integrated increment that’s likely in your CPU cache. I don’t think anyone can claim that pointer chasing is “fast” and certainly not faster than ARC. Again assuming you’re not being uncareful and throwing ARC around everywhere when it’s not needed in the first place. Value composition is much more powerful and leave RC / Arc when you have a more complicated object graph with shared ownership (and even then try to give ownership to the root uniquely or through RC and hand out references only to children and RC to peers).

kaba0 3 years ago
Obviously the 3 objects you allocate with shared_ptr in C++ won’t be a performance bottleneck, but then we are not comparing apples to oranges.
Your single threaded RC will still have to write back to memory, no one thinks that incrementing an integer is the slow part — destroying cache is.
- vlovich123 3 years ago
  
  Even when I had this in the hot path 10 years ago and was owning hundreds of objects in a particle filter, handing out ownership copies and creating new ones ended up taking ~5% (ie making it a contiguous vector without any shared_ptr). It can be expensive but in those case you probably shouldn’t be using shared_ptr.
  Oh, and the cost of incrementing an integer by itself (non atomically) is stupid fast. Like you can do a billion of them per second. The CPU doesn’t actually write that immediately to RAM and you’re not putting a huge amount of extra cache pressure vs all the other things your program is doing normally.
- _a_a_a_ 3 years ago
  
  > Your single threaded RC will still have to write back to memory
  I think you mean mem or cache, and there's a good chance it will remain in cache and not be flushed to ram for short lived objects.
  > no one thinks that incrementing an integer is the slow part — destroying cache is.
  agreed
  
  7 replies →
pjmlp 3 years ago
So fast that Apple Silicon introduced specific memory instructions to handle ARC counters.
- astrange 3 years ago
  
  It did not do this, standard atomics are relatively fast because it's a unified memory system.
  
  1 reply →

KerrAvon 3 years ago

Because they don’t care as much about working set size as Apple does.

kaba0 3 years ago
Sure, it is a tradeoff as basically every other technical choice.
But we were talking about performance here, and especially in throughput, tracing GCs are much better.
- vlovich123 3 years ago
  
  Proof about throughput? Stating something as fact isn’t quite as strong as showing axes. Also performance has multiple axis for measuring: throughput, latency, memory usage, etc etc. Java eating all your memory and spending a non-trivial amount of overall CPU on a tracing collector which can negatively impact latency isn’t necessarily what people want efficiency wise where smearing that over total execution time tends to give better results.
  The main advantage from a tracing GC is that you have an easier programming model (which isn’t a trivial trade off) but the downside is that you don’t have deterministic destruction which can make certain programming paradigms difficult (yes Java introduced alternate mechanisms to get you there if you care but it feels like a 100% bolt on rather than something integrated more neatly and elegantly and most developers don’t actually care).
  
  6 replies →

adgjlsfhk1 3 years ago

this isn't fully true. Java is in the process of getting lxr which is uses both tracing and reference counting.

NobleExpress 3 years ago
Yes and no. LXR is a highly optimized deferred and coalesced reference counting (amongst many other optimizations) GC and it looks nothing like the RC implementations you see in other languages like Python, Nim, Swift, etc. So yes, reference counting can be performant, _but_ only if you are using a deferred and coalesced implementation as otherwise the simple implementation requires atomic operations for every pointer read/write operation (though compilers could optimize it to use non-atomic operations when it's sure semantics are preserved).
EDIT: The post you're responding to is referring to the simple standard implementation of RC.
- vlovich123 3 years ago
  
  Rust doesn’t use multi threaded Rc unless you actually need to send it across threads.
  
  1 reply →