It's kind of like the small string optimization you see in C++[1] where all the string metadata to account for heap pointer, size and capacity is union'ed with char*. Getting the stack allocation doesn't costs extra memory, but does cost a bit check. Not sure if slices in go use the same method. 32 bytes is a lot so maybe they fattened slice representations a bit to get a bit more bang for your buck?
> It's kind of like the small string optimization you see in C++ ...
Agreed. These types of optimizations can yield significant benefits and are often employed in language standard libraries. For example, the Scala standard library employs an analogous optimization in their Set[0] collection type.
This article is about Go, but I wonder how many C/C++ developers realize that you've always had the ability to allocate on the stack using alloca() rather than malloc().
Of course use cases are limited (variable length buffers/strings, etc) since the lifetime of anything on the stack has to match the lifetime of the stack frame (i.e the calling function), but it's super fast since it's just bumping up the stack pointer.
alloca() is super useful, but it's also quite dangerous because you can easily overflow the stack.
The obvious issue is that you can't know how much space is left on the stack, so you basically have to guess and pick an arbitrary "safe" size limit. This gets even more tricky when functions may be called recursively.
The more subtle issue is that the stack memory returned by alloca() has function scope and therefore you must never call it directly in a loop.
I use alloca() on a regular basis, but I have to say there are safer and better alternatives, depending on the particular use case: arena/frame allocators, threadlocal pseudo-stacks, static vectors, small vector optimizations, etc.
> The obvious issue is that you can't know how much space is left on the stack [...]
Oh, huh. I've never actually tried it, but I always assumed it would be possible to calculate this, at least for a given OS / arch. You just need 3 quantities, right? `remaining_stack_space = $stack_address - $rsp - $system_stack_size`.
But I guess there's no API for a program to get its own stack address unless it has access to `/proc/$pid/maps` or similar?
If you have well defined boundaries, you can move the stack to an arbitrarily large chunk of memory before the recursive call and restore it to the system stack upon completion.
If you're not doing recursion, I prefer using an appropriately sized thread_local buffer in this scenario. Saves you the allocation and does the bookkeeping of having one per thread
Most C compilers let you use variable length arrays on the stack. However they're problematic and mature code bases usually disable this (-Werror -Wvla) because if the size is derived from user input then it's exploitable.
For purely historical reasons the C/C++ stack is "small" with exactly how small being outside of programmer control. So you have to avoid using the stack even if it would be the better solution. Otherwise you risk your program crashing/failing with stack overflow errors.
I wouldn't call it a hack, but it's not a general alternative for memory allocated on the heap since the lifetime is tied to that of the allocating function.
I think what you're referring to is an arena allocator where you allocate a big chunk of memory from the heap, then sequentially sub-allocate from that, then eventually free the entire heap chunk (arena) in one go. Arena allocators are therefore also special use case since they are for when all the sub-allocations have the same (but arbitrary) lifetime, or at least you're willing to defer deallocation of everything to the same time.
So, heap, arena and stack allocation all serve different purposes, although you can just use heap for everything if memory allocation isn't a performance issue for your program, which nowadays is typically the case.
Back in the day when memory was scarce and computers were much slower, another common technique was to keep a reuse "free list" of allocated items of a given type/size, which was faster than heap allocate and free/coalesce, and avoided the heap fragmentation of random malloc/frees.
alloca()'s availability and correctness/bugginess is platform dependent, so it probably sees only niche usage since it's not portable. Furthermore, even its man page discourages its use in the general case:
>The alloca() function is machine- and compiler-dependent. Because it allocates from the stack, it's faster than malloc(3) and free(3). In certain cases, it can also simplify memory deallocation in applications that use longjmp(3) or siglongjmp(3). Otherwise, its use is discouraged.
Furthermore:
>The alloca() function returns a pointer to the beginning of the allocated space. If the allocation causes stack overflow, program behavior is undefined.
A TLB miss could happen when executing the next statement in your program. It's not something you have a lot of control over, and doesn't change the fact that allocating from the stack (when an option) is going to be faster than allocating from the heap.
Nice to see common and natural patterns to have their performance improved. Theoretically appending to a slice would be possible to handle with just stack growth, but that would require having large gaps between goroutine stacks and mapping them lazily upon access instead of moving goroutines to the new contiguous blocks as it's implemented right now. But given how many questionable changes it requires from runtime it's certainly not going to happen :)
Having big stack frames is bad for cache locality. Stack is not something magical, it's mapped to the same physical memory as heap and needs to be loaded. Pretty sure such optimization would reduce performance in most cases.
In the case where you're using the top of the stack as a, well, stack, I don't see the problem. It would only work if you're not interleaving processing of dynamically-sized objects and function codegen works out. It's similar to TCO in the sense of maintaining certain invariants across calls (e.g. no temporaries need be preserved), and actually in languages with TCO, like Lua, you can hack an application-level stack data structure using tail recursion (and coroutines/threads if you need more than one) that can sometimes be more performant or more convenient than using a native data structure.
There's been a least one experiment (posted a few years ago to HN) where someone benchmarked a stackful coroutine implementation with hundreds of thousands (millions?) of stacks that could grow contiguously on-demand up to, e.g., 2MB, but were initially minimally sized and didn't reserve the maximum stack size upfront. The bottleneck was the VMA bookkeeping--the syscalls, exploding the page table, TLB flushing, etc. In principle it could work well and be even more performant than existing solutions, and it might work better today since Linux 6.13's lightweight guard page feature, MADV_GUARD_INSTALL, but we probably still need more architectural support from the system (kernel, if not hardware) to make it performant and competitive with language-level solutions like goroutines, Rust async, etc.
Awesome stuff! Does Go have profile-guided optimization? I'm wondering whether a profile could hint to the compiler how large to make the pre-reserved stack space.
I never noticed much difference with using pgo even after taking a very long real life profile. All the machinery required to get it and put it to CI was never worth the speed-up. Of course YMMV.
Agreed. There's quite a bit of room for optimization if your language design allows for it. Plus you have flexibility to make different tradeoffs as computer architectures and the cost of various operations change over time.
> ...
> On the third loop iteration, the backing store of size 2 is full. append again has to allocate a new backing store, this time of size 4. The old backing store of size 2 is now garbage.
Correct me if I'm wrong, but isn't this a worst-case scenario? realloc can, iirc, extend in place. Your original pointer is still invalid then, but no copy is needed then.
Unless I'm missing something?
Equally, what happens to the ordering of variables on the stack? Is this new one pushed as the last one? Or is there space kept open?
The ability to grow without copying is already part of how slices work. Every slice is really a 3-word tuple of pointer, length, and capacity. If not explicitly set with make, the capacity property defaults to a value that fills out the size class of the allocation. It just so happens that, in this case, the size of the Task type doesn't allow for more than 1 value to fit in the smallest allocation. If you were to do this with a []byte or []int32 etc., you would see that the capacity doesn't necessarily start at 1: https://go.dev/play/p/G5cifdChGIZ
[]task is a pointer to a range of elements. TFA says if you initialize it to point to a new array of 10 elements, that array of 10 elements may be stack–allocated. If you allocate another array dynamically, that one won't be.
I want to like this, and it's directionally good work...
But it's hard to see this as very useful unless we also start to see some increases in legibility, and ways to make sure these optimizations are being used (and that textually minor changes don't cause non-obvious performance regressions).
I've written a lot of golang code that was benchmarked to shreds, and in which we absolutely cared about stack-vs-heap allocations because they were crucial to overall performance. I've spent a lot of time pouring over assembler dumps, because grepping those for indications of new object creation was sometimes clearer (and certainly more definitive) than trying to infer it from the source code level. The one thing I've learned from this?
It's very, very easy for all those efforts to come to naught if the rules change slightly.
And it's very, very, VERY easy for a co-maintainer on a project to stroll in and make seemingly textually trivial changes that have outsized impacts. (I'm looking at inliner thresholds, specifically. Hoo boy.)
The best balm we have for this right now is writing benchmarks and making sure they report zero allocs. (Or unit tests using the runtime memstats harness; potato potato.) But that is a very fragile balm, and relatively complex to maintain, and (if DX is considered) is not textually local to the code in question -- which means someone changing the code can easily miss the criticality of a section (until the tests yell at them, at least).
I really yearn for some markup that can say "I expect this code to contain zero heap allocations; please flunk the compile if that is not the case".
It is actually a bad design when compiler go this far into a micro optimization but assume it understands the context so it can make decisions for you.
"The reason is that the compiler decided to allocate the backing store on the stack. Because it knows what size it needs to be (10 times the size of a task) it can allocate storage for it in the stack frame of process2 instead of on the heap1. Note that this depends on the fact that the backing store does not escape to the heap inside of processAll."
This is definitionally small-object boxing optimizations.
It's kind of like the small string optimization you see in C++[1] where all the string metadata to account for heap pointer, size and capacity is union'ed with char*. Getting the stack allocation doesn't costs extra memory, but does cost a bit check. Not sure if slices in go use the same method. 32 bytes is a lot so maybe they fattened slice representations a bit to get a bit more bang for your buck?
[1] https://github.com/elliotgoodrich/SSO-23
> It's kind of like the small string optimization you see in C++ ...
Agreed. These types of optimizations can yield significant benefits and are often employed in language standard libraries. For example, the Scala standard library employs an analogous optimization in their Set[0] collection type.
0 - https://github.com/scala/scala3/blob/88438e2c6e6204e12666067...
Reminds me of zig's on stack fixedBufferAllocator
This article is about Go, but I wonder how many C/C++ developers realize that you've always had the ability to allocate on the stack using alloca() rather than malloc().
Of course use cases are limited (variable length buffers/strings, etc) since the lifetime of anything on the stack has to match the lifetime of the stack frame (i.e the calling function), but it's super fast since it's just bumping up the stack pointer.
alloca() is super useful, but it's also quite dangerous because you can easily overflow the stack.
The obvious issue is that you can't know how much space is left on the stack, so you basically have to guess and pick an arbitrary "safe" size limit. This gets even more tricky when functions may be called recursively.
The more subtle issue is that the stack memory returned by alloca() has function scope and therefore you must never call it directly in a loop.
I use alloca() on a regular basis, but I have to say there are safer and better alternatives, depending on the particular use case: arena/frame allocators, threadlocal pseudo-stacks, static vectors, small vector optimizations, etc.
> The obvious issue is that you can't know how much space is left on the stack [...]
Oh, huh. I've never actually tried it, but I always assumed it would be possible to calculate this, at least for a given OS / arch. You just need 3 quantities, right? `remaining_stack_space = $stack_address - $rsp - $system_stack_size`.
But I guess there's no API for a program to get its own stack address unless it has access to `/proc/$pid/maps` or similar?
8 replies →
If you have well defined boundaries, you can move the stack to an arbitrarily large chunk of memory before the recursive call and restore it to the system stack upon completion.
1 reply →
> alloca() is super useful, but it's also quite dangerous because you can easily overflow the stack.
This is not a problem for Go, because it has resizable stacks.
If you're not doing recursion, I prefer using an appropriately sized thread_local buffer in this scenario. Saves you the allocation and does the bookkeeping of having one per thread
Most C compilers let you use variable length arrays on the stack. However they're problematic and mature code bases usually disable this (-Werror -Wvla) because if the size is derived from user input then it's exploitable.
For purely historical reasons the C/C++ stack is "small" with exactly how small being outside of programmer control. So you have to avoid using the stack even if it would be the better solution. Otherwise you risk your program crashing/failing with stack overflow errors.
What do you mean outside of programmer control? What's stopping you from setting the stack size in the linker flags?
1 reply →
This is more of a patch/hack solution as far as I can understand.
You can just as well pass a heap allocated buffer + size around and allocate by incrementing/decrementing size.
Or even better use something like zig's FixedSizeAllocator.
Correct me if I am wrong please
I wouldn't call it a hack, but it's not a general alternative for memory allocated on the heap since the lifetime is tied to that of the allocating function.
I think what you're referring to is an arena allocator where you allocate a big chunk of memory from the heap, then sequentially sub-allocate from that, then eventually free the entire heap chunk (arena) in one go. Arena allocators are therefore also special use case since they are for when all the sub-allocations have the same (but arbitrary) lifetime, or at least you're willing to defer deallocation of everything to the same time.
So, heap, arena and stack allocation all serve different purposes, although you can just use heap for everything if memory allocation isn't a performance issue for your program, which nowadays is typically the case.
Back in the day when memory was scarce and computers were much slower, another common technique was to keep a reuse "free list" of allocated items of a given type/size, which was faster than heap allocate and free/coalesce, and avoided the heap fragmentation of random malloc/frees.
alloca()'s availability and correctness/bugginess is platform dependent, so it probably sees only niche usage since it's not portable. Furthermore, even its man page discourages its use in the general case:
>The alloca() function is machine- and compiler-dependent. Because it allocates from the stack, it's faster than malloc(3) and free(3). In certain cases, it can also simplify memory deallocation in applications that use longjmp(3) or siglongjmp(3). Otherwise, its use is discouraged.
Furthermore:
>The alloca() function returns a pointer to the beginning of the allocated space. If the allocation causes stack overflow, program behavior is undefined.
https://man7.org/linux/man-pages/man3/alloca.3.html
It is a good thing many people do not know it. Since if you need this to squeeze that little performance window, you’d better know what you are doing.
It becames super slow when you bump that pointer into a page that's missing from the TLB.
A TLB miss could happen when executing the next statement in your program. It's not something you have a lot of control over, and doesn't change the fact that allocating from the stack (when an option) is going to be faster than allocating from the heap.
1 reply →
Nice to see common and natural patterns to have their performance improved. Theoretically appending to a slice would be possible to handle with just stack growth, but that would require having large gaps between goroutine stacks and mapping them lazily upon access instead of moving goroutines to the new contiguous blocks as it's implemented right now. But given how many questionable changes it requires from runtime it's certainly not going to happen :)
Having big stack frames is bad for cache locality. Stack is not something magical, it's mapped to the same physical memory as heap and needs to be loaded. Pretty sure such optimization would reduce performance in most cases.
In the case where you're using the top of the stack as a, well, stack, I don't see the problem. It would only work if you're not interleaving processing of dynamically-sized objects and function codegen works out. It's similar to TCO in the sense of maintaining certain invariants across calls (e.g. no temporaries need be preserved), and actually in languages with TCO, like Lua, you can hack an application-level stack data structure using tail recursion (and coroutines/threads if you need more than one) that can sometimes be more performant or more convenient than using a native data structure.
There's been a least one experiment (posted a few years ago to HN) where someone benchmarked a stackful coroutine implementation with hundreds of thousands (millions?) of stacks that could grow contiguously on-demand up to, e.g., 2MB, but were initially minimally sized and didn't reserve the maximum stack size upfront. The bottleneck was the VMA bookkeeping--the syscalls, exploding the page table, TLB flushing, etc. In principle it could work well and be even more performant than existing solutions, and it might work better today since Linux 6.13's lightweight guard page feature, MADV_GUARD_INSTALL, but we probably still need more architectural support from the system (kernel, if not hardware) to make it performant and competitive with language-level solutions like goroutines, Rust async, etc.
Awesome stuff! Does Go have profile-guided optimization? I'm wondering whether a profile could hint to the compiler how large to make the pre-reserved stack space.
Yep. `go build -pgo=foo.pprof`
https://go.dev/doc/pgo
I never noticed much difference with using pgo even after taking a very long real life profile. All the machinery required to get it and put it to CI was never worth the speed-up. Of course YMMV.
Optimizations like these are so cool. I love seeing higher level languages take advantage of their high level-ness
Agreed. There's quite a bit of room for optimization if your language design allows for it. Plus you have flexibility to make different tradeoffs as computer architectures and the cost of various operations change over time.
you can do this in C, you just need to let its low level-ness be at the same level as everything else you do, just a setjmp longerjmp
> ... > On the third loop iteration, the backing store of size 2 is full. append again has to allocate a new backing store, this time of size 4. The old backing store of size 2 is now garbage.
Correct me if I'm wrong, but isn't this a worst-case scenario? realloc can, iirc, extend in place. Your original pointer is still invalid then, but no copy is needed then.
Unless I'm missing something?
Equally, what happens to the ordering of variables on the stack? Is this new one pushed as the last one? Or is there space kept open?
E.g.:
The ability to grow without copying is already part of how slices work. Every slice is really a 3-word tuple of pointer, length, and capacity. If not explicitly set with make, the capacity property defaults to a value that fills out the size class of the allocation. It just so happens that, in this case, the size of the Task type doesn't allow for more than 1 value to fit in the smallest allocation. If you were to do this with a []byte or []int32 etc., you would see that the capacity doesn't necessarily start at 1: https://go.dev/play/p/G5cifdChGIZ
[]task is a pointer to a range of elements. TFA says if you initialize it to point to a new array of 10 elements, that array of 10 elements may be stack–allocated. If you allocate another array dynamically, that one won't be.
I want to like this, and it's directionally good work...
But it's hard to see this as very useful unless we also start to see some increases in legibility, and ways to make sure these optimizations are being used (and that textually minor changes don't cause non-obvious performance regressions).
I've written a lot of golang code that was benchmarked to shreds, and in which we absolutely cared about stack-vs-heap allocations because they were crucial to overall performance. I've spent a lot of time pouring over assembler dumps, because grepping those for indications of new object creation was sometimes clearer (and certainly more definitive) than trying to infer it from the source code level. The one thing I've learned from this?
It's very, very easy for all those efforts to come to naught if the rules change slightly.
And it's very, very, VERY easy for a co-maintainer on a project to stroll in and make seemingly textually trivial changes that have outsized impacts. (I'm looking at inliner thresholds, specifically. Hoo boy.)
The best balm we have for this right now is writing benchmarks and making sure they report zero allocs. (Or unit tests using the runtime memstats harness; potato potato.) But that is a very fragile balm, and relatively complex to maintain, and (if DX is considered) is not textually local to the code in question -- which means someone changing the code can easily miss the criticality of a section (until the tests yell at them, at least).
I really yearn for some markup that can say "I expect this code to contain zero heap allocations; please flunk the compile if that is not the case".
Nice! That's (seems) so simple yet also so very effective. Shouldn't other memory-managed languages be able to profit from this as well?
It’s a very well known pattern, as someone else mentioned it’s used in CPP in smallstring, Rust smallvec, C usually hand rolled etc.
C# has `stackalloc`
But that requires an explicit declaration and isn't done automatically under the hood, or am I missing something?
1 reply →
I read that as "Allocating on the Slack" and immediately came up with three ways how to do that.
[flagged]
It is actually a bad design when compiler go this far into a micro optimization but assume it understands the context so it can make decisions for you.
alloca() is not part of the C++ standard, and I can't imagine how it could used safely in a C++ environment
If I had a nickel for every article about avoiding implicit boxing in gc-heap languages...
...you would have the same balance as before, because this is not an article about implicit boxing. ;-)
"The reason is that the compiler decided to allocate the backing store on the stack. Because it knows what size it needs to be (10 times the size of a task) it can allocate storage for it in the stack frame of process2 instead of on the heap1. Note that this depends on the fact that the backing store does not escape to the heap inside of processAll."
This is definitionally small-object boxing optimizations.
1 reply →