← Back to context

Comment by twoodfin

14 days ago

Indeed, given the massive interest Parquet has generated over the past 5 years, and its critical role in modern data infrastructure, I’ve been disappointed every time I’ve dug into the open source ecosystem around it for one reason or another.

I think it’s revealing and unfortunate that everyone serious about Parquet, from DuckDB to Databricks, has written their own “codec”.

Some recent frustrations on this front from the DuckDB folks:

https://duckdb.org/2025/01/22/parquet-encodings.html

Unfortunately many of the big data libraries are like that and there is no motivation to fix these things. One example is the ORC Java libraries that had 100s of unnecessary dependencies while at the same time importing the filesystem into the format itself.