← Back to context

Comment by thebuilderjr

6 days ago

I agree. The main reason I shared it is because I find it interesting as a library. I actually use it behind the scenes to build https://telemetry.sh. Essentially, I ingest JSON, infer a Parquet schema, store the data in S3 with a lookaside cache on disk, and then use DataFusion for querying.

How do you infer your Parquet schemas?

  • You infer the types of the source data.

    For example you can go through say 1% of your data and for each column see if you can coerce all of the values to a float, int, date, string etc. And then from there you can set the Parquet schema with proper types.