Comment by vlovich123
3 years ago
Because tracing GCs can solve referential loops which RC can’t. So at the language level where you have to handle all sorts of programs written by programmers of varying quality (+ mistakes) a tracing GC gives better predictable memory usage performance across a broader range of programs.
Seriously. A single threaded reference counter is super cheap. Cross thread reference counts shouldn’t be used and I think are an anti pattern - it’s better to have the owning thread be responsible for maintaining the reference count and passing a borrow via IPC that the borrower has to hand back. There is also hybrid RC where you Arc across threads but use RC within the thread. This gives you the best of both worlds with minimal cost. Which model you prefer is probably a matter of taste.
CPUs are stupid fast at incrementing and decrementing a counter. Additionally most allocations should be done on the stack with a small amount done on the heap that is larger / needs to outlive the current scope. I’ve written all sorts of performance-critical programs (including games) and never once has shared_ptr in C++ (which is atomic) popped up in the profiler because the vast majority of allocations are on stack, value composition, or unique_ptr (ie no GC of any kind needed).
The fastest kind of GC is one where you don’t even need any (ie Box / unique_ptr). The second fastest is an integrated increment that’s likely in your CPU cache. I don’t think anyone can claim that pointer chasing is “fast” and certainly not faster than ARC. Again assuming you’re not being uncareful and throwing ARC around everywhere when it’s not needed in the first place. Value composition is much more powerful and leave RC / Arc when you have a more complicated object graph with shared ownership (and even then try to give ownership to the root uniquely or through RC and hand out references only to children and RC to peers).
Obviously the 3 objects you allocate with shared_ptr in C++ won’t be a performance bottleneck, but then we are not comparing apples to oranges.
Your single threaded RC will still have to write back to memory, no one thinks that incrementing an integer is the slow part — destroying cache is.
Even when I had this in the hot path 10 years ago and was owning hundreds of objects in a particle filter, handing out ownership copies and creating new ones ended up taking ~5% (ie making it a contiguous vector without any shared_ptr). It can be expensive but in those case you probably shouldn’t be using shared_ptr.
Oh, and the cost of incrementing an integer by itself (non atomically) is stupid fast. Like you can do a billion of them per second. The CPU doesn’t actually write that immediately to RAM and you’re not putting a huge amount of extra cache pressure vs all the other things your program is doing normally.
> Your single threaded RC will still have to write back to memory
I think you mean mem or cache, and there's a good chance it will remain in cache and not be flushed to ram for short lived objects.
> no one thinks that incrementing an integer is the slow part — destroying cache is.
agreed
If you write to cache, then depending on architecture that change has to be made visible to every other thread as well. Reading is not subject to such a constraint.
6 replies →
So fast that Apple Silicon introduced specific memory instructions to handle ARC counters.
It did not do this, standard atomics are relatively fast because it's a unified memory system.
ARM also has an option for weaker consistency atomics which help with speed (and ARC does take advantage of this but not something Apple sped up specifically extra afaik in silicon)