Comment by riku_iki

2 days ago

Not clear why someone need to give up on native duckdb format if it is much faster.

Because it means you need to keep another copy of your data in a special format just for DuckDb. The point of Parquet is that it’s an open format queryable by multiple tools. You don’t need to wait to load every table into a new format, you don’t need to retain multiple copies, and you don’t need to keep them in sync.

If DuckDb is the only query engine in your analytics stack, then it makes sense to use its specialized format. But that’s not the typical Lakehouse use case.

  • > But that’s not the typical Lakehouse use case.

    that benchmark is also not typical lakehouse use case, since all data is hosted locally, so they don't test significant component of the stack.

    • Yeah, that’s one of many issues with Clickbench. It’s also one table so it can’t test joins.

      TPC-H is okay but not Lakehouse specific. I’m not aware of any benchmarks that specifically test performance of engines under common setups like external storage or scalable compute. It would be hard to design one that’s easily reproducible. (And in fairness to Clickbench, it’s intentionally simple for that exact reason - to generate a baseline score for any query engine that can query tabular data).