Comment by chrsw

2 days ago

Something like this maybe:

https://whitebox.systems/

Doesn't seem to meet all your desired features though.

Yes, that’s a good example — thanks for the link. Tools like this seem very strong at visualizing and exploring state, but they still tend to stay fairly close to the traditional “pause and inspect” model. What I keep struggling with is understanding how a particular state came to be — especially with concurrency or events that happened much earlier. That gap between state visualization and causality feels hard to bridge, and I’m not sure what the right abstraction should be yet.

  • Sounds like you want a time travel debugger, eg. rr.

    Sophisticated live debuggers are great when you can use them but you have to be able to reproduce the bug under the debugger. Particularly in distributed systems, the hardest bugs aren't reproducible at all and there are multiple levels of difficulty below that before you get to ones that can be reliably reproduced under a live debugger, which are usually relatively easy. Not being able to use your most powerful tools on your hardest problems rather reduces their value. (Time travel debuggers do record/replay, which expands the set of problems you can use them on, but you still need to get the behaviour to happen while it's being recorded.)

    • That’s a very fair point. The hardest bugs I’ve dealt with were almost always the least reproducible ones, which makes even very powerful debuggers hard to apply in practice. It makes me wonder whether the real challenge is not just having time-travel, but deciding when and how much history to capture before you even know something went wrong.

  • Sounds like you want time travel debugging [1]. Then you can just run forwards and backwards as needed and look at the full evolution of state and causality. You usually want to use a integrated history visualization tool to make the most of that since the amount of state you are looking at is truly immense; identifying the single wrong store 17 billion instructions ago can be a pain without it.

    [1] https://en.wikipedia.org/wiki/Time_travel_debugging

  • > I keep struggling with is understanding how a particular state came to be — especially with concurrency or events that happened much earlier.

    Yeah, I faced this problem. I have no general solution to it, but I wonder if a fuzzer can be bred with a debugger to get a tool that can given two states of a program to find inputs that can transition program from state A to state B. Maybe you would need to define state A and/or B with some predicates, so they would be a classes of states. Or maybe the tool could fuzz the state A to see what part of it are important to transition to the state B eventually.

  • This doesn't sound like a particularly difficult problem for some scenarios.

    It's definitely convoluted as it comes to memory obtained from the stack, but for heap allocations, a debugger could trace the returns of the allocator APIs, use that as a beginning point of some data's lifetime, and then trace any access to that address, and then gather the high-level info on the address of the reader/writer.

    Global variables should also be trivial (fairly so) as you'll just need to track memory accesses to their address.

    (Of course, further work is required to actually apply this.)

    For variables on the stack, or registers, though, you'll possibly need heuristics which account for reusage of memory/variables, and maybe maintain a strong association with the thread this is happening in (for both the thread's allocated stack and the thread context), etc.

  • Here's another one

    https://scrutinydebugger.com/

    It's for embedded systems though, which is where I come from. In embedded we have this concept called instruction trace where every instruction executed with the target gets sent over to the host. The host can reconstruct part of what's been going on in the target system. But there's usually so much data, I've always assumed a live view is kind of impractical and only used it for offline debugging. But maybe that's not a correct assumption. I would love to see better observability in embedded systems.

    • For context, I’ve been experimenting with a small open-source prototype while thinking about these ideas: https://github.com/manux81/qddd It’s very early and incomplete — mostly a way for me to explore what breaks once you actually try to model time and causality in a debugger.