Comment by Animats

10 years ago

This problem can be fixed. We need to rethink file system semantics. Here's an approach:

Files are of one of the following types:

Unit files

For a unit file, the unit of consistency is the entire file. Unit files can be created or replaced, but not modified. Opening a unit file for writing means creating a new file. When the new file is closed successfully, the new version replaces the old version atomically. If anything goes wrong, including a system crash, between create and successful close, including program abort, the old version remains and the new version is deleted. File systems are required to maintain that guarantee.

Opens for read while updating is in progress reference the old version. Thus, all readers always see a consistent version.

They're never modified in place once written. It's easy for a file system to implement unit file semantics. The file system can cache or defer writes. There's no need to journal. The main requirement is that the close operation must block until all writes have committed to disk, then return a success status only if nothing went wrong.

In practice, most files are unit files. Much effort goes into trying to get unit file semantics - ".part" files, elaborate file renaming rituals to try to get an atomic rename (different for each OS and file system), and such. It would be easier to just provide unit file semantics. That's usually what you want.

Log files

Log files are append-only. The unit of consistency is one write. The file system is required to guarantee that, after a program abort or crash, the file will end cleanly at the end of a write. A "fsync" type operation adds the guarantee that the file is consistent to the last write. A log file can be read while being written if opened read-only. Readers can seek, but writers cannot. Append is always at the end of the file, even if multiple processes are writing the same file.

This, of course, is what you want for log files.

Temporary files

Temporary files disappear in a crash. There's no journaling or recovery. Random read/write access is allowed. You're guaranteed that after a crash, they're gone.

Managed files

Managed files are for databases and programs that care about exactly when data is committed. A "write" API is provided which returns a status when the write is accepted, and then makes an asynchronous callback when the write is committed and safely on disk. This allows the database program to know which operations the file system has completed, but doesn't impose an ordering restriction on the file system.

This is what a database implementor needs - solid info about if and when a write has committed. If writes have to be ordered, the database program can wait for the first write to be committed before starting the second one. If something goes wrong after a write request was submitted, the caller gets status info in the callback.

This would be much better than the present situation of trying to decide when a call to "fsync" is necessary. It's less restrictive in terms of synchronization - "fsync" waits for everything to commit, which is often more than is needed just to finish one operation.

This could be retrofitted to POSIX-type systems. If you start with "creat" and "O_CREAT" you get a unit file by default, unless you specify "O_APPEND", in which case you get a log file. Files in /tmp are temporary files. Managed files have to be created with some new flag. Only a small number of programs use managed files, and they usually know who they are.