← Back to context

Comment by vovavili

3 hours ago

Replacing an 11.6GB Parquet file every 5 minutes strikes me as a bit wasteful. I would probably use Apache Iceberg here.

It's not doing that. If you look at the repository, it's adding a new commit with tiny parquet files every 5 minutes. This recent one only was a 20.9 KB parquet file: https://huggingface.co/datasets/open-index/hacker-news/commi... and the ones before it were a median of 5 KB: https://huggingface.co/datasets/open-index/hacker-news/tree/...

The bigger concern is how large the git history is going to get on the repository.

"The dataset is organized as one Parquet file per calendar month, plus 5-minute live files for today's activity. Every 5 minutes, new items are fetched from the source and committed directly as a single Parquet block. At midnight UTC, the entire current month is refetched from the source as a single authoritative Parquet file, and today's individual 5-minute blocks are removed from the today/ directory."

So it's not really one big file getting replaced all the time. Though a less extreme variation of that is happening day to day.

  • Parquet is a very efficient storage approach. Data interfaces tend to treat paths as partitions, if logical.

Was thinking the same thing. probably once a day would be more than enough. if you really want a minute by minute probably a delta file from the previous day should be more than enough.