Comment by wood_spirit

1 year ago

This is a bit of a tangent but it’s a thought experiment that I recently heard:

Data pipeline A is written and maintained by a team in a type safe language with extensive unit tests.

Data pipeline B be was written long ago by a scientist who has since left, in sql in a day.

Both compute the same dataset, but B gets the answer correct.

Which is the better pipeline, and why?

Without more context or additional assumptions there's no good answer to that.

If you just care about today, then clearly B is better because it provides the correct result today.

Also, just because A is written in a type-safe language and has extensive unit tests doesn't in itself mean it's any less complex and undecipherable than B.

I can think of several takes, with different assumptions, leading to very different perspectives.

One take could be this:

Lets assume A has been written by a competent team, using good practices. Lets also assume the problem of incorrect answers in A has been known for some time and has been investigated a fair bit. That is, it's not just a trivial bug that's not been caught yet.

Since A doesn't work one could reasonably assume B is complex and difficult to understand, otherwise A's team should be able to find their error based on studying the SQL in B. Otherwise it indicates A's team is not competent, which goes against our previous assumption.

Given that, one could reasonably assume changing B will be very difficult.

Thus if one cares about maintaining and evolving the pipeline due to changing demands over many years, then it's likely A is better, as the bug in A producing the wrong answer should be fixable by a competent team.

Again, just one take of many possible...

An alternate, more trivial take could be that team A were given an incorrect specification. So while they implemented the specification correctly, B actually implements something slightly differently.

We see this one with customers all the time. Where they think the old system does X but it does in fact do something slightly different, so when we implement the new system as requested, the customer files a bug report because it doesn't do what the old system actually did.

And how do you know B's answer is correct?

  • In the world of science you get to compare software's predictions against reality. It's a weird concept but it grows on you.

    I've seen systems with this structure. Part of the fun is B's code is likely to have a lot of errors in it that cancel each other out in exciting ways when running on the domain of interest, which makes using it to work out why A's code is failing to correspond to reality much harder than it could be.