← Back to context

Comment by narrator

11 years ago

Scala is getting really interesting lately though with projects like Spark and SparkSQL. Spark takes a Scala AST tree and computes a plan on it in order to distribute it to a cluster of possibly thousands of nodes and then uses transformations on that plan, similar to how an SQL query optimizer works, to make the plan more efficient. The cluster compute benchmarks on it are crazy good.

The only problem I have with Scala is that it has so many features that it tends to have a steeper learning curve than most languages. However, this is mitigated somewhat in that it lets one switch between object and functional depending on what's more convenient or performant, thus making it easier for programmers who are not highly skilled in functional programming to pick up.

Yeah, I did a presentation at the DC Area Apache Spark meetup... slides: http://www.slideshare.net/RichardSeymour3/2015-0224-washingt... associated blog post: https://www.endgame.com/blog/streaming-data-processing-pyspa...

I've done a bit of scala spark as well, and my initial thought was prototype in pyspark and then rewrite in scala if necessary. Just this week DataBricks announced they are working on changing some of the data structures behind RDDs to save on unnecessary java object creation hubbub https://databricks.com/blog/2015/04/28/project-tungsten-brin...

That and the SQL compiler thing seems pretty darn awesome. Spark has the nice benefit of being plug and play (w/ a joyful time of compiling and deploying) with legacy HDFS/Hadoop systems. That alone will keep it in toolboxes for a long time to come.

>Spark takes a Scala AST tree and computes a plan on it in order to distribute it to a cluster of possibly thousands of nodes

Unless Spark has changed dramatically in the year since I used it, that's not really how it works. You lazily construct an expression graph of RDD operators but the actual Scala code (the functional payload of e.g. flatMap) doesn't get analyzed. Are you talking specifically about Spark SQL's code generator?

>The cluster compute benchmarks on it are crazy good.

...and also carefully chosen to show off the niche cases where Spark shines. More commonly, you'll encounter crippling garbage collector stalls leading to mysterious and undebuggable job failures.

Spark has a very clean API but the implementation (at least for Spark versions 0.8-1.0) is still prototype quality.

  • So basically you're talking about the design and limitations of https://databricks.com/blog/2015/04/13/deep-dive-into-spark-... I take it?

    The implication of this work I thought is that it could be further expanded to other languages and DSLs. However, Spark's SQL generator still being very JVM-dependent and this optimizer generating bytecode kind of makes it pretty specific to JVM-supporting languages only. This would probably leave out Haskell and Erlang / Elixir in the short term, which is probably where I'd expect to see a different perspective on the whole data analytics front. We have Datomic, sure, (and I guess Clojure is enough for a lot of folks) but it'd be nice to have something other than "we want to make Hadoop... BUT BETTER" as a lot of the motivations.

Would there be a F# equivalent to that Scala/Spark thing?

  • so I googled and saw this.

    one could say they're... looking into it...

    https://careers.microsoft.com/jobdetails.aspx?jid=170944&pp=...

    • extract:

      > Inspired by Spark, we are building OneNet, a distributed functional programming platform based on F# that allows programmer to build distributed system at >3x productivity. OneNet offer similar programming model and extensibility of Spark, but goes much beyond in distributed functional programming to offer additional capability and performance (e.g., in-memory sharing of instantiated object with concurrent operation, use of both managed code & unmanaged code in functional programming, use of model and/or key value store, multi cluster with EDGE + cloud, private + public cloud compute, use of GPGPU/FPGA in cluster)