Comment by dijit

2 years ago

Maybe we're talking about different things then.

My laptop is a SPoF in exactly the same way.

If my laptop is closed then data collection will still happen, as collection and processing are different systems; but my ability to mutate the data hands-on is affected.

Thus any downtime of my laptop is not really a problem.

See also: Jupyter notebooks, Excel, etc;

I will also point out that robustness in distributed systems is not as cut and dry for two reasons:

1: These are not considered hot-path systems that are mission critical so will be neglected by SRE.

2: Complexity is increased in distributed systems, thus you have more likelihood of failure until you have a lot of effort put into it.

Yes, I believe we are talking about different things. In my experience the hadoop (or mapR) cluster ended up getting used for a bunch of heterogenous workloads running simultaneously at different priorities. High priority workloads were production impacting batch jobs where downtime would be noticed by users. Lower priority workloads were as you describe--analysts running ad-hoc jobs to support BI, data science, operations, etc.

Hbase also ran on that infrastructure serving real-time workloads. Downtime on any of the Hbase clusters would be a high severity outage.

So minutes/mo of downtime would certainly have unacceptable business impact. Another important thing is replication. Drives do fail, and if a single drive failure brings down prod how long would that take to fix?

To be clear in general my opinions are aligned with the article, I think using the whole machine at high utilization is the only environmentally (and financially) responsible way. But I don't believe it's true that purely vertical scaling is realistic for most businesses.

EDIT: there are also security and compliance concerns that rule out the scenario of copying data onto an employee laptop. I guess what I'm trying to get at is the scenario seems a little contrived.

  • > Drives do fail, and if a single drive failure brings down prod how long would that take to fix?

    You already failed if thats happening.

    Are we really at the degenerated level of sysadmin competence that we forgot even what RAID is?

    • > degenerated level of sysadmin competence that we forgot even what RAID is

      At the risk of troll-feeding, what are you hoping to accomplish with this? Of course I haven't "forgot even what RAID is", and I'm confident my competence is not "degenerated".

      In this magical world where we can fit the entire "data lake" on one box of course we can replicate with RAID, but you've still got a spof. So this only works if downtime is acceptable, which I'll concede maybe it could be iff this box is somehow, magically, detached from customer facing systems.

      But they never really are. Assuming even that there aren't ever customer impacting reads from this system, downtime in the "data lake" means all the systems which write to it have to buffer (or shed) data during the outage. Random, frequent off-nominal behavior is a recipe for disaster IME. So this magic data box can't be detached, really.

      I've only ever worked at companies which are "always on" and have multi-petabyte data sets. I guess if you can tolerate regular outages and/or your working data set is so small that copying it around willy-nilly is acceptably cheap go for it! I wish my life was that simple.

      1 reply →