Comment by vovavili

3 hours ago

Replacing an 11.6GB Parquet file every 5 minutes strikes me as a bit wasteful. I would probably use Apache Iceberg here.

10 comments

vovavili

ai-inquisitor 3 hours ago

It's not doing that. If you look at the repository, it's adding a new commit with tiny parquet files every 5 minutes. This recent one only was a 20.9 KB parquet file: https://huggingface.co/datasets/open-index/hacker-news/commi... and the ones before it were a median of 5 KB: https://huggingface.co/datasets/open-index/hacker-news/tree/...

The bigger concern is how large the git history is going to get on the repository.

btown 2 hours ago
I recall that this became a big problem for the Homebrew project in terms of load on the repo, to the extent that Github asked them not to recommend/default-enable shallow clones for their users: https://github.com/Homebrew/brew/issues/15497#issuecomment-1...
This is likely to be lower traffic, and the history should (?) scale only linearly with new data, so likely not the worst thing. But it's something to be cognizant of when using SCM software in unexpected ways!
- roncesvalles 2 hours ago
  
  How would shallow clone be more stressful for GitHub than a regular clone?
  
  1 reply →
vovavili 3 hours ago
This makes more sense. I still wonder if the author isn't just effectively recreating Apache Iceberg manually here.
- tomrod 3 hours ago
  
  Are they paying for the repo space, I wonder?
  
  1 reply →

zerocrates 3 hours ago

"The dataset is organized as one Parquet file per calendar month, plus 5-minute live files for today's activity. Every 5 minutes, new items are fetched from the source and committed directly as a single Parquet block. At midnight UTC, the entire current month is refetched from the source as a single authoritative Parquet file, and today's individual 5-minute blocks are removed from the today/ directory."

So it's not really one big file getting replaced all the time. Though a less extreme variation of that is happening day to day.

tomrod 3 hours ago

Parquet is a very efficient storage approach. Data interfaces tend to treat paths as partitions, if logical.

fabmilo 3 hours ago

Was thinking the same thing. probably once a day would be more than enough. if you really want a minute by minute probably a delta file from the previous day should be more than enough.