Comment by thecleaner

2 years ago

Sure but single node performance. This makes it not very useful IMO since quite a few data science folks work with Hadoop clusters or Snowflake clusters or DataBricks where data is distributed and querying is handled by Spark executors.

3 comments

thecleaner

chaxor 2 years ago

The comparison is to pandas, so single node performance is understood in the scope. This is for people running small tasks that may only take a couple days on a single node with a 32 core CPU or something, not tasks that take 3 months using thousands of cores. My understanding for the latter is that pyspark is a decent option, while ballista is the better option for which to look forward. Perhaps using bastion-rs as a backend can be useful for an upcoming system as well. Databricks et al are cloud trash IMO, as is anything that isn't meant to be run on a local single node system and a local HPC cluster with zero code change and a single line of config change.

While for most of my jobs I ended up being able to evade the use of HPC by simply being smarter and discovering better algorithms to process information, I recall like pyspark decently, but preferring the simplicity of ballista over pyspark due to the simpler installation of Rust over managing Java and JVM junk. The constant problems caused by anything using JVM backend and the environment config with it was terrible to add to a new system every time I ran a new program.

In this regard, ballista is a enormous improvement. Anything that is a one-line install via pip on any new system, runs local-first without any cloud or telemetry, and requires no change in code to run on a laptop vs HPC is the only option worth even beginning to look into and use.

Kalanos 2 years ago

Hadoop hasn't been relevant for a long time, which is telling.

Unless I had thousands of files to work with, I would be loathe to use cluster computing. There's so much overhead, cost, waiting for nodes to spin up, and cloud architecture nonsense.

My "single node" computer is a refurbished tower server with 256GB RAM and 50 threads.

Most of these distributed computing solutions arose before data processing tools started taking multi-threading seriously.

markhahn 2 years ago

understood: big facilities get shared; sharing requires arbitration and queueing.
an interesting angle on 50 threads and 256G: your data is probably pretty cool (cache-friendly). if your threads are merely HT, that's only 25 real cores, and might be only a single socket. implying probably <100 GB/s memory bandwidth. so a best-case touch-all-memory operation would take several seconds. for non-sequential patterns, effective rates would be much lower, and keep cores even less busy.
so cache-friendliness is really the determining feature in this context. I wonder how much these packages are oriented towards cache tuning. it affects basic strategy, such as how filtering is implemented in an expression graph...