Comment by few

2 months ago

I felt like one or two decades ago, all the rage was about rewriting programs into just two primitives: map and reduce.

For example filter can be expressed as:

  is_even = lambda x: x % 2 == 0
  mapped = map(lambda x: [x] if is_even(x) else [], data)
  filtered = reduce(lambda x, y: x + y, mapped, [])

But then the world moved on from it because it was too rigid

MapReduce is nice but it doesn't, by itself, help you reason about pushdowns for one. Parquet, for example, can pushdown select/project/filter, and that's lost if you have MapReduce. And a reduce is just a shuffle + map, not very different from a distributed join. MapReduce as an escape hatch over what is fundamentally still relational algebra may be a good intuition.

There might have been some misunderstanding there.

The point of map/reduce was that it could easily be parallelized across large numbers of machines, for processing very large amounts of data. Hadoop implemented the first open-source example of this.

The limitations on what it could do were well-known from the start. No-one who knew what they were doing proposed that programs should be rewritten that way unless you were processing enough data to need to run them distributed on a cluster, in which case that was often your best option.

Many of the limitations of pure map/reduce were overcome by adding steps to the basic map/reduce parallel pipelines. Apache Spark is one example. It still has map and reduce operations in its pipeline, but it has several other operations as well. Nothing better than map and reduce has been found for the purpose it serves in such pipelines.

Reductions are painful because they specify a sequence of ordered operations. Runtime is O(N), where N is the sequence length, regardless of amount of hardware. So you want to work at a higher level where you can exploit commutativity and independence of some (or even most) operations.

  • You can reduce in parallel. That was the whole point of MapReduce. For example, the sum abcdefgh can be found by first ab, cd, ef, gh; then those results (ab)(cd), (ef)(gh); then the final result by (abcd)(efgh). That's just three steps to compute seven sums.

    • No, you can not. Your example is correct only if addition is associative. And it is not always associative. Hence the need for higher abstractions, where you model commutativity and associativity of certain operations.

  • You're right it's primarily a runtime + compiler + language issue. I really don't understand why people tried to force functional programming in environments without decent algebraic reasoning mechanisms.

    Modern graph reducers have inherent confluence and aren't reliant on explicit commutation. They can do everything parallel and out of order (until they have to talk to some extrinsic thing like getting input or spitting out output), including arbitrary side-effectual mutation. We really live in the future.

Performance aside it seems you could do most maybe a the ops with those three. I say three because your sneaky plus is a union operation. So map, reduce and union.

But you are also allowing arbitrary code expressions. So it is less lego-like.