Comment by mkl

3 years ago

On Windows in my experience it's at least a factor of 10. I worked on a script recently that reads ~20000 files a few kb each and extracts some info to generate a web page. I sped it up enormously just by putting the file contents into SQLite tables.

22 comments

mkl

nostrademons 3 years ago

Lots of small files (particularly in a single directory) is a known failure mode of many OS filesystems. I remember putting a million files into MogileFS and finding that filesystem operations basically did not complete in any reasonable length of time.

avgcorrection 3 years ago

It also seems that some “do one thing well” tools want to implement their logic using files. For example: each entry in a password database is a file (pass(1)).
Whatever overhead of such small files might not matter though if the problem space is kept “human sized” (whatever one human can be bothered to manage).
I used to have tens of thousands of file in git annex. I had to tarball (chunk) some of the things that I never really use in order to speed up `git annex fsck`.
breck 3 years ago
Yes I did an experiment years ago with 6.5 million files and it was disaster (https://breckyunits.com/building-a-treebase-with-6.5-million...).
However, things have totally changed with the M1 generation of Macbooks. Things that were once near impossible now run in an instant. I need to redo this experiment.
- _dhruva 3 years ago
  
  When I worked at NetApp, this was a problem there too.
  IIRC, the fix was for directory entries were later stored in sorted buckets (aka hash map) and only the corresponding bucket for the file name would get locked. This reduced scope of lock for atomic operations like rename and also allowed faster lookup instead of O(n) based scan.
icelancer 3 years ago
Yeah I got absolutely crushed by this when trying to migrate a Windows Server 2016 machine to unRAID, whose filesystem is absolutely horrible when dealing with thousands of smaller files. Wiped out a month of work for NAS-related activities; we're back on Windows again.
- leoc 3 years ago
  
  Small-file performance was the main selling point of (v3) ReiserFS.
  
  3 replies →

chongli 3 years ago

Windows struggles horrendously with lots of small files. If you tried the same task on Linux you’d see a large jump in performance. I don’t know if Linux small file access could match SQLite but it would be a lot closer.

sz4kerto 3 years ago
That's because Windows' FS offers quite a lot of features that many other FSs don't -- especially SQLite doesn't. You might not need those features, of course.
One core feature is that Windows offers hooks ('filters') that allows other components to put themselves between the client program and the files. This is how virtual filesystems work (like OneDrive, etc.), or how anti-malware works.
When you read from sqlite, then those reads can't come from another server, the objects can't be scanned automatically, etc. Again -- you might not need these features, but it's not that sqlite or ext4 is somehow magically faster; they just made different design choices.
- ectopod 3 years ago
  
  The design choices make a bit of a difference but most of the overhead is Defender. When you try to read thousands of files your computer spends most of its time running Defender. Turn it off and the problem goes away.
- kaba0 3 years ago
  
  Linux also has virtual file systems.
icelancer 3 years ago
Depends. fuse filesystem + SMB was so bad (unRAID; Linux underneath) that I had to switch back to Windows, which works mostly fine.
- bombolo 3 years ago
  
  Well fuse… did you expect performances using fuse?
  
  1 reply →
mkl 3 years ago

Yes, Linux does a lot better (haven't tried that specific script, but have done a lot of similar things), but I've gotten speed improvements with similar use of SQLite there too, especially when dealing with a lot of files in the same directory.

vbezhenar 3 years ago

Try zip file. Might work even better.

lanstin 3 years ago
I would hate to see rename semantics in this case.
- becurious 3 years ago
  
  Zip stores all its metadata at the end of the file.

karmakaze 3 years ago

Whenever I have lots of files that need to go into a directory, I'll split into subdirs by prefix (or other partitioning scheme)--look in .git/objects for example.

Also be careful if the function that returns directory contents sorts results that you don't need done.

vanviegen 3 years ago

I think an in-process database will always have a large advantage in this scenario, as going through the filesystem means at least three system calls (open, read, close) per file.

That's not to say filesystems couldn't be improved.