Comment by jcgrillo

2 years ago

That's about 13.4min/mo of downtime, every month. That seems likely to cause all kinds of havoc at scale.

Maybe we're talking about different things then.

My laptop is a SPoF in exactly the same way.

If my laptop is closed then data collection will still happen, as collection and processing are different systems; but my ability to mutate the data hands-on is affected.

Thus any downtime of my laptop is not really a problem.

See also: Jupyter notebooks, Excel, etc;

I will also point out that robustness in distributed systems is not as cut and dry for two reasons:

1: These are not considered hot-path systems that are mission critical so will be neglected by SRE.

2: Complexity is increased in distributed systems, thus you have more likelihood of failure until you have a lot of effort put into it.

  • Yes, I believe we are talking about different things. In my experience the hadoop (or mapR) cluster ended up getting used for a bunch of heterogenous workloads running simultaneously at different priorities. High priority workloads were production impacting batch jobs where downtime would be noticed by users. Lower priority workloads were as you describe--analysts running ad-hoc jobs to support BI, data science, operations, etc.

    Hbase also ran on that infrastructure serving real-time workloads. Downtime on any of the Hbase clusters would be a high severity outage.

    So minutes/mo of downtime would certainly have unacceptable business impact. Another important thing is replication. Drives do fail, and if a single drive failure brings down prod how long would that take to fix?

    To be clear in general my opinions are aligned with the article, I think using the whole machine at high utilization is the only environmentally (and financially) responsible way. But I don't believe it's true that purely vertical scaling is realistic for most businesses.

    EDIT: there are also security and compliance concerns that rule out the scenario of copying data onto an employee laptop. I guess what I'm trying to get at is the scenario seems a little contrived.

    • > Drives do fail, and if a single drive failure brings down prod how long would that take to fix?

      You already failed if thats happening.

      Are we really at the degenerated level of sysadmin competence that we forgot even what RAID is?

      8 replies →

We are talking about data processing, not a publicly available service. When is 13 min/month of downtime for processing of data a problem?