Comment by vog
10 years ago
I wonder how this approach (single file + log) compares to the other usual approach (write second file, move over first):
1. Write changed the data into a temporary file in the same directory (don't touch the original file)
2. Move new file over old file
Does this lead to a simpler strategy that is easier to reason about, where it is less likely for programmer to get it wrong? At least I see this strategy being applied more often than the "single file + log" approach.
The obvious downside is that this temporarily uses twice the size of the dataset. However, that is usually mitigated by splitting the data into multiple files, and/or applying this only to applications that don't need to store gigabytes in the first place.
That's not guaranteed to work in the face of crashes. The problem is that the directory update could get flushed to disk before the file data.
This is the fundamental problem: When you allow the OS (or the compiler, or the CPU) to re-order operations in the name of efficiency you lose control over intermediate states, and so you cannot guarantee that these intermediates states are consistent with respect to the semantics that you care about. And this is the even more fundamental problem: our entire programming methodology has revolved around describing what we want the computer to do rather than what we want to to achieve. Optimizations then have to reverse-engineer our instructions and make their best guesses as to what we really meant (e.g. "This is dead code. It cannot possibly affect the end result. Therefore it can be safely eliminated.) Sometimes (often?) those guesses are wrong. When they are, we typically only find out about it after the mismatch between our expectations and the computer's have manifested themselves in some undesirable (and often unrecoverable) way.
"That's not guaranteed to work in the face of crashes. The problem is that the directory update could get flushed to disk before the file data."
No, it can work, provided that the temporary file is fsynced before being renamed, the parent directory is fsynced after renaming the file, and that the application only considers the rename to have taken place after the parent directory is fsynced (not after the rename call itself).
FSYNC is not enough. You also have to make sure that NCQ is disabled:
https://en.wikipedia.org/wiki/Native_Command_Queuing
1 reply →
Good summary of the situation. It's why I fought out-of-order execution at hardware and OS levels as much as I could. Even went out of way to use processors that didn't do it. Market pushed opposite stuff into dominance. Then came the inevitable software re-writes to get predictability and integrity out of what shouldn't have problems in the first place. It's all ridiculous.
It bugged me that Sublime Text used to do these so-called atomic saves by default since it screwed with basic unix expectations like fstat and fseek meaningfully working (like a tail -f implementation could boil down to[0]). A concurrently running process using those calls would be off in lala-land as soon as the text file was modified and saved: it would never pick up any modifications, because it and the editor weren't even dealing with the same file any more.
[0] Here's follow(1) in my homemade PL:
GNU tail lets you say "--follow=descriptor" to follow the content of the file no matter how it gets renamed, or "--follow=name" to re-open the same filename when a different file gets renamed onto it.
That's the difference between 'tail -f' and 'tail -F' and is implemented in every tail I know of.
1 reply →
Functional versus imperative concurrent shared data approaches provide a good analogy:
* Single file + log: fine grained locking in a shared C data structure. Yuck!
* Write new then move: transactional reference to a shared storage location, something like Clojure's refs. Easy enough.
The latter clearly provides the properties we'd like, the former may but it's a lot more complicated to verify and there are tons of corner cases. So I think move new file over old file is the simpler strategy and way easier to reason about.
The obvious downside is that this temporarily uses twice the size of the dataset. However, that is usually mitigated by splitting the data into multiple files, and/or applying this only to applications that don't need to store gigabytes in the first place.
Clojure's approach again provides an interesting solution to saving space. Taking the idea of splitting data into multiple files to the logical conclusion, you end up with the structure sharing used in efficient immutable data structures.
Your solution is slower; also, you need to fsync()/fdatasync() the new file before moving, at least on some systems (http://lwn.net/Articles/322823/ is the best reference I can find right now), and you need to fsync() the directory if you wish the rename to be durable (as opposed to just all-or-nothing.)
In general this approach should work fine, but devil is in detail: 1. You have to flush change to temporary file before move because otherwise you may get empty file: OS may reorder move and write operations 2. After move you have to flush parent directory of destination file on Posix. Windows have special flag for MoveFileEx() to ensure that operation is done or you have to call FlushFileBuffers() for destination file.
Linked paper mention the many popular programs forgets about (1).