Comment by fodkodrasz
6 hours ago
So DuckDB was developed to allow queries for bigish data finally without the need for a cluster to simplify data analysis... and we now put it to a cluster?
I think there are solutions for that scale of data already, and simplicity is the best feature of DuckDB (at lest for me).
> "So DuckDB was developed to allow queries for bigish data finally without the need for a cluster to simplify data analysis... and we now put it to a cluster?"
This is a fair point, but I think there's a middle ground. DuckDB handles surprisingly large datasets on a single machine, but "surprisingly large" still has limits. If you're querying 10TB of parquet files across S3, even DuckDB needs help.
The question is whether Ray is the right distributed layer for this. Curious what the alternative would be—Spark feels like overkill, but rolling your own coordination is painful.
Big fan of this push back, because there are alot of projects that have that smell over engineering with the wrong base. (especially with vibecoding now) Thought there are use cases where some have lots of medium-sized data divided up. For compliance, I have a lot of reporting data split such that duckdb instances running in separate processes work amazing for us especially with lower complexity to other compute engines in that environment. If I wanted to move everything into somewhere a clickhouse/trino/databrick/etc would work well the compliance complexity skyrockets and makes it so we have to have perfect configs and tons of extra time invested to get the same devex