Comment by twoodfin

10 months ago

Indeed, given the massive interest Parquet has generated over the past 5 years, and its critical role in modern data infrastructure, I’ve been disappointed every time I’ve dug into the open source ecosystem around it for one reason or another.

I think it’s revealing and unfortunate that everyone serious about Parquet, from DuckDB to Databricks, has written their own “codec”.

Some recent frustrations on this front from the DuckDB folks:

https://duckdb.org/2025/01/22/parquet-encodings.html

1 comment

twoodfin

dev_l1x_be 10 months ago

Unfortunately many of the big data libraries are like that and there is no motivation to fix these things. One example is the ORC Java libraries that had 100s of unnecessary dependencies while at the same time importing the filesystem into the format itself.