Comment by amluto

7 months ago

The documentation for all of this is atrocious.

But if avro-in-parquet is a weird optional feature, it should be off by default! Parquet’s metadata is primarily in Thrift, not Avro, and it seems to me that no Avro should be involved in decoding Parquet files unless explicitly requested.

To the sibling comment’s point, I suppose it’s not weird in the Java ecosystem. The parquet-java project has a design where it deserializes Parquet fields into Java representations grabbed from _other_ projects rather than either having some kind of canonical self-representation in memory or acting as just an abstract codec. So, one of the most common things to do is apparently to use the “Avro” flavored serdes to get generic records in memory (note that the actual Avro serialization format is not involved with doing that; parquet-java just uses the classes from Avro as the in memory representations and deserializes Parquet into them). The whole approach seems a bit goofy; I’d expect the library to work as some kind of abstracted codec interface (requiring the in-memory representations to host Parquet, rather than the other way around - like how pandas hosts fastparquet in Python land) or provide a canonical object representation. Instead, it’s this in between where it has a grab bag of converters that transform Parquet to and from random object types pulled from elsewhere in the Java ecosystem.

  • I’d still like to see a clear explanation of where one can stick a Java class name in a Parquet file such that it ends up interpreted by the Avro codec. And I’m curious why it was fixed by making a list of allowed class names instead of disabling the entire mechanism.