← Back to context

Comment by ragall

3 hours ago

Containers are too low-level. What we need is a high-level batch job DSL, where you specify the inputs and the computation graph to perform on those inputs, as well as some upper limits on the resources to use, and a scheduler will evaluate the data size and decide how to scale it. In many cases that means it will run everything on a single node, but in any case data devs shouldn't be tasked with making things run in parallel because the vast majority aren't capable and they end up with very bad choices.

And by the way, what I just described is a framework that Google has internally, named Flume. 10+ years ago they had already noticed that devs aren't capable of using Map/Reduce effectively because tuning the parallelism was beyond most people's abilities, so they came up with something much more high-level. Hadoop is still a Map/Reduce clone, thus destined to fail at useability.