Comment by dataflow
7 years ago
Path-based I/O seems quite dangerous to me. If everything was path-based, you'd easily have inherent race conditions. You want to delete a directory? You stat() all the files, they look empty, so you delete them... but in between, another process writes to some file (or maybe the user forgets the file is being deleted and saves to something there), and suddenly you've deleted data you didn't expect. When you do things in a handle-based fashion, you know you're always referring to the same file (and can lock it to prevent updates, etc.), even if files are being moved around.
However, to answer your question of why removing a directory is slow... if you mean it's slow inside Explorer, a lot of it is shell-level processing (shell hooks etc.), not the file I/O itself. Especially if they're slow hooks -- e.g. if you have TortoiseGit with its cache enabled, for example, it can easily slow down deletions by a factor of 100x. But regarding the file I/O part, if it's really that slow at all, I think it's partly because the user-level API is path-based (because that's what people find easier to work with), whereas the system calls are mostly handle-based (because that's the more robust thing, as explained above... though it can also be faster, since a lot of work is already done for the handle and doesn't need to be re-performed on every access), so merely traversing the directory a/b/c requires opening and closing a, then opening and closing a/b, then opening and closing a/b/c, but even opening a/b/c requires internally processing a and b again, since they may no longer be the same things as before... this is O(n^2) in the directory depth. If you reduce it to O(n) by using the system calls and providing parent directory handles directly (NtOpenFile() with OBJECT_ATTRIBUTES->RootDirectory) then I think it should be faster and more robust.
You stat() all the files, they look empty, so you delete them... but in between, another process writes to some file (or maybe the user forgets the file is being deleted and saves to something there), and suddenly you've deleted data you didn't expect
This is fundamentally not any different between the systems, race conditions can happen either way. The user could write data to file right before the deletion recurses to the same directory and the handle-based deletion happens. Similarly the newly written data would be wiped out unintentionally.
For files for which access from different processes must be controlled explicitly there is locking. No filesystem or VFS is going to protect you from accidentally deleting stuff you're still using in another context.
> [...] The user could write data to file right before the deletion recurses to the same directory and the handle-based deletion happens. Similarly the newly written data would be wiped out unintentionally. [...] No filesystem or VFS is going to protect you from accidentally deleting stuff you're still using in another context.
...what? No file system is going to protect you from accidentally deleting in-use files? But that's exactly what Windows does: it prevent you from deleting in-use files. That's what everyone here has been complaining about. File sharing modes let you lock files to make sure they're not written to (and/or read from) before being deleted, it very much need not be the case that the user could write to a file before it's deleted.
Read my comment again.
There is an inherent race condition if one program is using a file and another program is deleting it without caring about whether the file is being accessed by other programs.
At that point, all bets are off there regardless of whether the files are accessed by paths or handles.
Windows protects the file from deletion at the exact same time as it is being accessed but does not protect the file from being deleted after it has been accessed. In wall-clock statistics the latter is the way more likely scenario.
So, if an editor saves a document to disk, and another program then deletes the document the editor will happily exit without saving it again thinking that it hasn't been changed.
It doesn't particularly matter whether the two programs clash exactly at the time of saving/deletion or not. The problem exists in the lack of information between the programs and no file system is indeed going to protect you from that.
2 replies →
> This is fundamentally not any different between the systems, race conditions can happen either way. The user could write data to file right before the deletion recurses to the same directory and the handle-based deletion happens.
When you hold a handle to a file or directory, you get to decide on the degree of shared access with any other users for the duration that handle is held (FILE_SHARE_*). So this does solve the concurrency problem, by allowing you to, effectively hold a lock on the file until you're done.