Comment by albertzeyer

5 months ago

Why not store the data directly as Arrow files, to allow for mmaping? I see F3 also supports such zero-copy mmap, and skimming through the paper, it actually seems that it uses Arrow buffers, so I wonder what is the difference to directly using Arrow files? (Arrow files is what is being used by HuggingFace datasets.)

The main reason is that arrow files are not compressed at all. So storing everything as Arrow would increase storage size by 10-100x (depending on data)