Comment by matklad

1 day ago

There's some reshuffling of bugs for sure, but, from my experience, there's also a very noticeable reduction! It seems there's no law of conservation of bugs.

I would say the main effect here is that global allocator often leads to ad-hoc, "shotgun" resource management all other the place, and that's hard to get right in a manually memory managed language. Most Zig code that deals with allocators has resource management bugs (including TigerBeetle's own code at times! Shoutout to https://github.com/radarroark/xit as the only code base I've seen so far where finding such bug wasn't trivial). E.g., in OP, memory is leaked on allocation failures.

But if you manage resources manually, you just can't do that, you are forced to centralize the codepaths that deal with resource acquisition and release, and that drastically reduces the amount of bug prone code. You _could_ apply the same philosophy to allocating code, but static allocation _forces_ you to do that.

The secondary effect is that you tend to just more explicitly think about resources, and more proactively assert application-level invariants. A good example here would be compaction code, which juggles a bunch of blocks, and each block's lifetime is tracked both externally:

* https://github.com/tigerbeetle/tigerbeetle/blob/0baa07d3bee7...

and internally:

* https://github.com/tigerbeetle/tigerbeetle/blob/0baa07d3bee7...

with a bunch of assertions all other the place to triple check that each block is accounted for and is where it is expected to be

https://github.com/tigerbeetle/tigerbeetle/blob/0baa07d3bee7...

I see a weak connection with proofs here. When you are coding with static resources, you generally have to make informal "proofs" that you actually have the resource you are planning to use, and these proofs are materialized as a web of interlocking asserts, and the web works only when it is correct in whole. With global allocation, you can always materialize fresh resources out of thin air, so nothing forces you to do such web-of-proofs.

To more explicitly set the context here: the fact that this works for TigerBeetle of course doesn't mean that this generalizes, _but_, given that we had a disproportionate amount of bugs in small amount of gpa-using code we have, makes me think that there's something more here than just TB's house style.

Hey matklad! Thanks for hanging out here and commenting on the post. I was hoping you guys would see this and give some feedback based on your work in TigerBeetle.

You mentioned, "E.g., in OP, memory is leaked on allocation failures." - Can you clarify a bit more about what you mean there?

  • In

        const recv_buffers = try ByteArrayPool.init(gpa, config.connections_max, recv_size);
        const send_buffers = try ByteArrayPool.init(gpa, config.connections_max, send_size);
    

    if the second try throws, than the memory allocation created by the first try is leaked. Possible fixes:

    A) clean up individual allocations on failure:

        const recv_buffers = try ByteArrayPool.init(gpa, config.connections_max, recv_size);
        errdefer recv_buffers.deinit(gpa);
    
        const send_buffers = try ByteArrayPool.init(gpa, config.connections_max, send_size);
        errdefer send_buffers.deinit(gpa);
    

    B) ask the caller to pass in an arena instead of gpa to do bulk cleanup (types & code stays the same, but naming & contract changes):

        const recv_buffers = try ByteArrayPool.init(arena, config.connections_max, recv_size);
        const send_buffers = try ByteArrayPool.init(arena, config.connections_max, send_size);
    

    C) declare OOMs to be fatal errors

        const recv_buffers = ByteArrayPool.init(gpa, config.connections_max, recv_size) catch |err| oom(err);
        const send_buffers = ByteArrayPool.init(gpa, config.connections_max, send_size) catch |err| oom(err);
    
        fn oom(_: error.OutOfMemory) noreturn { @panic("oom"); }
    

    You might also be interesting in https://news.ycombinator.com/item?id=46423691

    • Gotcha. Thanks for clarifying! I guess I wasn't super concerned about the 'try' failing here since this code is squarely in the initialization path, and I want the OOM to bubble up to main() and crash. Although to be fair, 1. Not a great experience to be given a stack trace, could definitely have a nice message there. And 2. If the ConnectionPool init() is (re)used elsewhere outside this overall initialization path, we could run into that leak.

      The allocation failure that could occur at runtime, post-init, would be here: https://github.com/nickmonad/kv/blob/53e953da752c7f49221c9c4... - and the OOM error kicks back an immediate close on the connection to the client.

That's an interesting observation. BTW, I've noticed that when I write in Assembly I tend to have fewer bugs than when I write in C++ (and they tend to be easier to find). That's partly because I'm more careful, but also because I only write much shorter and simpler things in Assembly.