← Back to context

Comment by ch33zer

3 months ago

I used to work on machine repair automation at a big tech company. IMO repairs are one of the overlooked and harder things to deal with. When you run on AWS you don't really think about broken hardware it mostly just repairs itself. When you do it yourself you don't have that luxury. You need to have spare parts, technician to do repairs, a process for draining/undraining jobs off hosts, testing suites, hardware monitoring tools and 1001 more things to get this right. At smaller scales you can cut corners on some of these things but they will eventually bite you. And this is just machines! Networking gear has it's own fun set of problems, and when it fails it can take down your whole rack. How much do you trust your colos not to lose power during peak load? I hope you run disaster recovery drills to prep for these situations!

Wishing all the best to this team, seems like fun!