Comment by seanhunter
13 days ago
I get the point you’re making but I’m gonna push back a little on this (as someone who has written a fair few ETL processes in their time). When are you ever ETLing a parquet file? You are always ETLing some raw format (css, json, raw text, structured text, etc) and writing into parquet files, never reading parquet files themselves. It seems a pretty bad practise to write your etl to just pick up whatever file in whatever format from a slop bucket you don’t control. I would always pull files in specific formats from such a common staging area and everything else would go into a random “unstructured data” dump where you just make a copy of it and record the metadata. I mean it’s a bad bug and I’m happy they’re fixing it, but it feels like you have to go out of your way to encounter it in practice.
No comments yet
Contribute on Hacker News ↗