Comment by hruk

2 years ago

p99.9 referring to latency. However, we also do a weekly test of how quickly we recover from a catastrophic crash, which is roughly about 6 minutes (which is the amount of time it takes for the autoscaling group to spin up a new host, Litestream to restore the database from s3, and the server to start up again).

Honestly, 99.9% uptime is pretty generous - we can fit in quite a few catastrophes per year and still have 99.9% uptime. In the 2 years this service has been running, we've had 100% uptime via zero-downtime deployments, anyway.

In terms of monitoring, traces and error logs are shipped to our observability solution, yes.

0 comments

hruk

No comments yet

Contribute on Hacker News ↗