Comment by jitl

1 month ago

None of the systems I mentioned existed at the time the article was published. I think the author would love duckdb which is a very speedy CLI SQL thingy that reads and writes data in all sorts of formats. It fits in great with other Unix CLI stuff.

Many of the projects I mentioned you could see as a response to OP and the 2015 “Scalability, but at what COST?” paper which benchmarked distributed systems to see how many cores they need to beat a single thread. (https://news.ycombinator.com/item?id=26925449)

2 comments

jitl

groundzeros2015 1 month ago

> None of the systems I mentioned existed at the time the article was published

So Hadoop was doing distributed compute wrong but now they have it figured out?

The point is that there is enormous overhead and complexity in going it in any kind of system. And your computer has a lot of power you probably aren’t maxing out.

> which is a very speedy CLI SQL thingy that reads and writes data in all sorts of formats.

Do you know about SQLite?

jitl 1 month ago

Yeah im a big fan of SQLite :). But at analytical workloads like aggregating every row, DuckDB will outperform SQLite by a wide margin. SQLite is great stuff but it’s not a very good data Swiss Army knife because it’s very focused on a single core competency: embeddable OLTP with a simple codebase. DuckDB can read/write many more formats from local disk or via a variety of network protocols. DuckDB also embeds SQLite so you can use it with SQLite DBs as inputs or outputs.
> they were doing distributed compute wrong but now they have it figured out?
Like anything the future is here but it’s unevenly distributed. Frank McSherry, the first author of “Scalability but at what COST” wrote Timely Dataflow as his answer to that question. ByteWax is based on Timely as is Materialize. Stuff is still complex but these more modern systems with performance as their goal are orders of magnitude better than the Hadoop era Java stuff.