Comment by lanstin

2 months ago

And everything logging from the API to the network to the ingestion pipeline needs to be best effort - configure a capacity and ruthlessly drop msgs as needed, at all stages. Actually a nice case for UDP :)

3 comments

lanstin

otterley 2 months ago

It depends. Some cases like auditing require full fidelity. Others don’t.

Plus, if you’re offering a logging service to a customer, the customer’s expectation is that once successfully ingested, your service doesn’t drop logs. If you’re violating that expectation, this needs to be clearly communicated to and assented by the customer.

lanstin 2 months ago
1. those ingested logs are not logs for you, they are customer payload which are business criticial; 2. I've yet to see a Logging as a Service provider not have outages where data was lost or severely delayed. Also, the alternative to best effort/shed excessive load isn't 100% availability, it's catastrophic failure when capacity is reached.
Auditing has the requirement to be mostly not lost, but most importantly not being able to be deleted by people on the host. And for the capacity side, again the design question is "what happens when incoming events exceed our current capacity - all the collectors/relays balloon their memory and become much much slower, effectively unresponsive, or immediately close the incoming sockets, lower downstream timeouts, and so on." Hopefully, the audit traffic is consistent enough that you don't get spikes and can over-capacitize with confidence.
- otterley 2 months ago
  
  > those ingested logs are not logs for you, they are customer payload which are business criticial
  Why does that make any difference? Keep in mind that at large enough organizations, even though the company is the same, there will often be an internal observability service team (frequently, but not always, as part of a larger platform team). At a highly-functioning org, this team is run very much like an external service provider.
  > I've yet to see a Logging as a Service provider not have outages where data was lost or severely delayed.
  You should take a look at CloudWatch Logs. I'm unaware of any time in its 17-year history that it has successfully ingested logs and subsequently lost or corrupted them. (Disclaimer: I work for AWS.) Also, I didn't say anything about delays, which we often accept as a tradeoff for durability.
  > And for the capacity side, again the design question is "what happens when incoming events exceed our current capacity - all the collectors/relays balloon their memory and become much much slower, effectively unresponsive, or immediately close the incoming sockets, lower downstream timeouts, and so on."
  This is one of the many reasons why buffering outgoing logs in memory is an anti-pattern, as I noted earlier in this thread. There should always -- always -- be some sort of non-volatile storage buffer in between a sender and remote receiver. It’s not just about resilience against backpressure; it also means you won’t lose logs if your application or machine crashes. Disk is cheap. Use it.