Comment by chaxor

5 days ago

The comparison is to pandas, so single node performance is understood in the scope. This is for people running small tasks that may only take a couple days on a single node with a 32 core CPU or something, not tasks that take 3 months using thousands of cores. My understanding for the latter is that pyspark is a decent option, while ballista is the better option for which to look forward. Perhaps using bastion-rs as a backend can be useful for an upcoming system as well. Databricks et al are cloud trash IMO, as is anything that isn't meant to be run on a local single node system and a local HPC cluster with zero code change and a single line of config change.

While for most of my jobs I ended up being able to evade the use of HPC by simply being smarter and discovering better algorithms to process information, I recall like pyspark decently, but preferring the simplicity of ballista over pyspark due to the simpler installation of Rust over managing Java and JVM junk. The constant problems caused by anything using JVM backend and the environment config with it was terrible to add to a new system every time I ran a new program.

In this regard, ballista is a enormous improvement. Anything that is a one-line install via pip on any new system, runs local-first without any cloud or telemetry, and requires no change in code to run on a laptop vs HPC is the only option worth even beginning to look into and use.