← Back to context

Comment by tikhonj

13 days ago

Last time I tried to use the official Apache Parquet Java library, parsing "normal" Parquet files depended on parquet-avro because the library used Avro's GenericRecord class to represent rows from Parquet files with arbitrary schemas. So this problem would presumably affect any kind of Parquet parsing, even if there is absolutely no Avro actually involved.

(Yes, this doesn't make sense; the official Parquet Java library had some of the worst code design I've had the misfortune to depend on.)

Indeed, given the massive interest Parquet has generated over the past 5 years, and its critical role in modern data infrastructure, I’ve been disappointed every time I’ve dug into the open source ecosystem around it for one reason or another.

I think it’s revealing and unfortunate that everyone serious about Parquet, from DuckDB to Databricks, has written their own “codec”.

Some recent frustrations on this front from the DuckDB folks:

https://duckdb.org/2025/01/22/parquet-encodings.html

  • Unfortunately many of the big data libraries are like that and there is no motivation to fix these things. One example is the ORC Java libraries that had 100s of unnecessary dependencies while at the same time importing the filesystem into the format itself.

The Apache Arrow libraries are a good alternative for reading parquet files in Java. They provide a column oriented interface, rather than the ugly Avro stuff in the Apache Parquet library.