Comment by bigshik
13 hours ago
Nice work—this hits a real pain point with Parquet. My main use case is debugging partitioned datasets on S3 with schema drift and skew, where I care about: which files/partitions have schema mismatches, weird row-group stats (all-null, out-of-range, huge skew), and doing that via metadata only.
Right now parqeye looks mainly single-file focused. Do you have plans for a “dataset mode” that takes a dir/S3 prefix and surfaces per-file/row-group summaries (row counts, min/max, null %, schema diffs vs a reference file) using just Parquet stats so it scales to tens of GB? Or do you see parqeye intentionally staying a single-file inspector?
No comments yet
Contribute on Hacker News ↗