Comment by Veserv
3 days ago
I do not see how logging could bottleneck you in a degraded state unless your logging is terribly inefficient. A properly designed logging system can record on the order of 100 million logs per second per core.
Are you actually contemplating handling 10 million requests per second per core that are failing?
Generation and publication is just the beginning (never mind the fact that resources consumed by an application to log something are no longer available to do real work). You have to consider the scalability of each component in the logging architecture from end to end. There's ingestion, parsing, transformation, aggregation, derivation, indexing, and storage. Each one of those needs to scale to meet demand.
I already accounted for consumed resources when I said 10 million instead of 100 million. I allocated 10% to logging overhead. If your service is within 10% of overload you are already in for a bad time. And frankly, what systems are you using that are handling 10 million requests per second per core (100 nanoseconds per request)? Hell, what services are you deploying that you even have 10 million requests per second per core to handle?
All of those other costs are, again, trivial with proper design. You can easily handle billions of events per second on the backend with even a modest server. This is done regularly by time traveling debuggers which actually need to handle these data rates. So again, what are we even deploying that has billions of events per second?
In my experience working at AWS and with customers, you don't need billions of TPS to make an end-to-end logging infrastructure keel over. It takes much less than that. As a working example, you can host your own end-to-end infra (the LGTM stack is pretty easy to deploy in a Kubernetes cluster) and see what it takes to bring yours to a grind with a given set of resources and TPS/volume.
5 replies →
Damn that's fast! I'm gonna stick my business logic in there instead.