Comment by jitl
10 hours ago
On the other hand, now we have duckdb for all the “small big data”, and a slew of 10-100x faster than Java equivalent stuff in the data x rust ecosystem, like DataFusion, Feldera, ByteWax, RisingWave, Materialize etc
10 hours ago
On the other hand, now we have duckdb for all the “small big data”, and a slew of 10-100x faster than Java equivalent stuff in the data x rust ecosystem, like DataFusion, Feldera, ByteWax, RisingWave, Materialize etc
The point of the article is those don’t actually work that well.
I guarantee those rust projects have spent more time playing with rust and library design than the domain problem they are trying to solve.
None of the systems I mentioned existed at the time the article was published. I think the author would love duckdb which is a very speedy CLI SQL thingy that reads and writes data in all sorts of formats. It fits in great with other Unix CLI stuff.
Many of the projects I mentioned you could see as a response to OP and the 2015 “Scalability, but at what COST?” paper which benchmarked distributed systems to see how many cores they need to beat a single thread. (https://news.ycombinator.com/item?id=26925449)
> None of the systems I mentioned existed at the time the article was published
So Hadoop was doing distributed compute wrong but now they have it figured out?
The point is that there is enormous overhead and complexity in going it in any kind of system. And your computer has a lot of power you probably aren’t maxing out.
> which is a very speedy CLI SQL thingy that reads and writes data in all sorts of formats.
Do you know about SQLite?
1 reply →
I call BS on those Rust 10-100x claims. Rust and Java are roughly equal in performance. It is just that there are a lot of old NoSQL frameworks in Java which are trash. I also checked out those companies, some of which are doing interesting stuff. None claim things are 100x faster because of Rust. You just hurt your credibility when you say such clearly false things. That's how you end up with a Hadoop cluster which is 236x slower than a batch script.
PS None of the companies you linked seem to be using a datapath architecture which is the key to the highest level of performance
It wasn’t my intention to say “this stuff is 100x faster because rust”. DuckDB is C++. My intention was to draw distinction between the Java/Hadoop era of cluster and data systems, and the 2020s era of cluster and data systems, much of which has designs informed by stuff like this article / “Scalability but at what COST?”. I guess instead of “faster” I should say “more efficient”.