Comment by moffkalast
3 months ago
L3 has open pretraining data, it's just not official for obvious legal reasons: https://huggingface.co/datasets/HuggingFaceFW/fineweb
3 months ago
L3 has open pretraining data, it's just not official for obvious legal reasons: https://huggingface.co/datasets/HuggingFaceFW/fineweb
Wait, whole (english speaking) web content dataset size is ~50TB?
Yes, if we take the filtered and deduplicated HTMLs of CommonCrawl. I've made a video on this topic recently: https://www.youtube.com/watch?v=8yH3rY1fZEA
Fun presentation, thanks! 72min ingestion time for ~81TB of data is ~1TB/min or ~19GB/s. Distributed or single-node? Shards? I see 50 jobs are used for parallel ingestion, and I wonder how ~19GB/s was achieved since ingestion rates were far below that figure last time I played around with CH performance. Granted, that was some years ago.
1 reply →