Comment by jitl

1 month ago

On the other hand, now we have duckdb for all the “small big data”, and a slew of 10-100x faster than Java equivalent stuff in the data x rust ecosystem, like DataFusion, Feldera, ByteWax, RisingWave, Materialize etc

6 comments

jitl

groundzeros2015 1 month ago

The point of the article is those don’t actually work that well.

I guarantee those rust projects have spent more time playing with rust and library design than the domain problem they are trying to solve.

jitl 1 month ago
None of the systems I mentioned existed at the time the article was published. I think the author would love duckdb which is a very speedy CLI SQL thingy that reads and writes data in all sorts of formats. It fits in great with other Unix CLI stuff.
Many of the projects I mentioned you could see as a response to OP and the 2015 “Scalability, but at what COST?” paper which benchmarked distributed systems to see how many cores they need to beat a single thread. (https://news.ycombinator.com/item?id=26925449)
- groundzeros2015 1 month ago
  
  > None of the systems I mentioned existed at the time the article was published
  So Hadoop was doing distributed compute wrong but now they have it figured out?
  The point is that there is enormous overhead and complexity in going it in any kind of system. And your computer has a lot of power you probably aren’t maxing out.
  > which is a very speedy CLI SQL thingy that reads and writes data in all sorts of formats.
  Do you know about SQLite?
  
  1 reply →

hunterpayne 1 month ago

I call BS on those Rust 10-100x claims. Rust and Java are roughly equal in performance. It is just that there are a lot of old NoSQL frameworks in Java which are trash. I also checked out those companies, some of which are doing interesting stuff. None claim things are 100x faster because of Rust. You just hurt your credibility when you say such clearly false things. That's how you end up with a Hadoop cluster which is 236x slower than a batch script.

PS None of the companies you linked seem to be using a datapath architecture which is the key to the highest level of performance

jitl 1 month ago

It wasn’t my intention to say “this stuff is 100x faster because rust”. DuckDB is C++. My intention was to draw distinction between the Java/Hadoop era of cluster and data systems, and the 2020s era of cluster and data systems, much of which has designs informed by stuff like this article / “Scalability but at what COST?”. I guess instead of “faster” I should say “more efficient”.
For example, the Kafka ecosystem tends to use Avro as the data transfer serialization, which needs a copy/deserialization step before it can be used in application logic. Newer stream systems like Timely tend to use zero-copy capable data transfer formats (timely’s is called Abomination) but it’s the same idea in CapnProto or Flatbuffers - it’s infinity faster to not copy the data as you decode! In my experience this kind of approach is more accessible in systems languages like C++ or Rust, and harder to do in GC languages where the default approach to memory layout and memory management is “don’t worry about it.”