Comment by iskander

11 years ago

>Spark takes a Scala AST tree and computes a plan on it in order to distribute it to a cluster of possibly thousands of nodes

Unless Spark has changed dramatically in the year since I used it, that's not really how it works. You lazily construct an expression graph of RDD operators but the actual Scala code (the functional payload of e.g. flatMap) doesn't get analyzed. Are you talking specifically about Spark SQL's code generator?

>The cluster compute benchmarks on it are crazy good.

...and also carefully chosen to show off the niche cases where Spark shines. More commonly, you'll encounter crippling garbage collector stalls leading to mysterious and undebuggable job failures.

Spark has a very clean API but the implementation (at least for Spark versions 0.8-1.0) is still prototype quality.

So basically you're talking about the design and limitations of https://databricks.com/blog/2015/04/13/deep-dive-into-spark-... I take it?

The implication of this work I thought is that it could be further expanded to other languages and DSLs. However, Spark's SQL generator still being very JVM-dependent and this optimizer generating bytecode kind of makes it pretty specific to JVM-supporting languages only. This would probably leave out Haskell and Erlang / Elixir in the short term, which is probably where I'd expect to see a different perspective on the whole data analytics front. We have Datomic, sure, (and I guess Clojure is enough for a lot of folks) but it'd be nice to have something other than "we want to make Hadoop... BUT BETTER" as a lot of the motivations.