Comment by stackskipton

1 month ago

As Ops (DevOps/Sysadmin/SREish) person here, excellent article.

However, as always, the problem is more political than technical and those are hardest problems to solve and another service with more cost IMO won't solve it. However, there is plenty of money to be made in attempting to solve it so go get that bag. :)

At end of day, it's back to DevOps mentality and it's never caught on at most companies. Devs don't care, Project Manager wants us to stop block feature velocity and we are not properly staffed since we are "massive wasteful cost center".

11 comments

stackskipton

binarylogic 1 month ago

100% accurate. It is very much political. I'd also add that the problem is perpetuated by a disconnection between engineers who produce the data and those who are responsible for paying for it. This is somewhat intentional and exploited by vendors.

Tero doesn't just tell you how much is waste. It breaks down exactly what's wrong, attributes it to each service, and makes it possible for teams to finally own their data quality (and cost).

One thing I'm hoping catches on: now that we can put a number on waste, it can become an SLO, just like any other metric teams are responsible for. Data quality becomes something that heals itself.

stackskipton 1 month ago
I'd be shocked if you can accurately identify waste since you are not ultimately familiar with the product.
Sure, I've kicked over what I thought was waste but told it's not or "It is but deal Ops"
- binarylogic 1 month ago
  
  You're right, it's not always binary. That's why we broke it down into categories:
  https://docs.usetero.com/data-quality/logs/malformed-data
  You'd be shocked how much obviously-safe waste (redundant attributes, health checks, debug logs left in production) accounts for before you even get to the nuanced stuff.
  But think about this: if you had a service that was too expensive and you wanted to optimize the data, who would you ask? Probably the engineer who wrote the code, added the instrumentation, or whoever understands the service best. There's reasoning going on in their mind: failure scenarios, critical observability points, where the service sits in the dependency graph, what actually helps debug a 3am incident.
  That reasoning can be captured. That's what I'm most excited about with Tero. Waste is just the most fundamental way to prove it. Each time someone tells us what's waste or not, the understanding gets stronger. Over time, Tero uses that same understanding to help engineers root cause, understand their systems, and more.
  
  7 replies →

xmprt 1 month ago

The first step to solving this is correct cost attribution. And then once you do that, it's easy to go to org leads and tell them that their logs are costing them $X and you can save them 40% by applying these suggestions. They'll be happy to accept your help at that point. But if the costs are all on the Ops team, then why would the product teams care about any cost optimizations which just takes away development time from them.