Comment by gmuslera
4 hours ago
Reminded me a note I heard about backups. You don't want backups, it is a waste of time, bandwidth and disk space, by far most if not all of it will end being discarded without being ever used. What you really want is something to restore from if anything breaks. That is the cost that should matter to you. What if you don't have anything meaningful to make a restore from?
With observability is not the volume of data, time and bandwidth used on it, is being able to understand your system and properly diagnose and solve problems when they happen. Can you do that with less? For the next problem that you don't know yet? If you can't because of lack of information or information you didn't collect, then spending so much may be was not enough.
Of course that there are more efficient (towards the end result) ways to do it than others. But having the needed information available, even if it is never used, is the real goal here.
I agree with the framing. The goal isn't less data for its own sake. The goal is understanding your systems and being able to debug when things break.
But here's the thing: most teams aren't drowning in data because they're being thorough. They're drowning because no one knows what's valuable and what's not. Health checks firing every second aren't helping anyone debug anything. Debug logs left in production aren't insurance, they're noise.
The question isn't "can you do with less?" It's "do you even know what you have?" Most teams don't. They keep everything just in case, not because they made a deliberate choice, but because they can't answer the question.
Once you can answer it, you can make real tradeoffs. Keep the stuff that matters for debugging. Cut the stuff that doesn't.
The problem is until I hit a specific bug I don't know what logs might be useful. For every bug I've had to fix 99% of the logs were useless, but I've had to fix many bugs over the years and each one needed a different set of logs. Sometimes I know in the code "this can't happen but I'll log an error just in case" - when I see those in a bug report they are often a clue, but I often need a lot of info bugs that happen normally all the time to figure out how my system got into that state.
"disk getting full" isn't useful unless you understand how/why it got full and that requires logging things that might or might matter to the problem.
There is a lot of crap that is and will ever be useless when debugging a problem. But there is a also a lot that you don't know if you will need it, at least, not yet, not when you are defining what information you collect, and may become essential when something in particular (usually unexpected) breaks. And then you won't have the past data you didn't collect.
You can go in a discovering path, can the data you collect explain how and why the system is running now? There are things that are just not relevant when things are normal and when they are not? Understanding the system, and all the moving parts, are a good guide for tuning what you collect, what you should not, and what are the missing pieces. And cycle with that, your understanding and your system will keep changing.