Comment by kentonv

10 years ago

The article's table of filesystem semantics is missing at least one X: Appends on ext4-ordered are not atomic. When you append to a file, the file size (metadata) and content (data) must both be updated. Metadata is flushed every 5 seconds or so, data can sit in cache for more like 30. So the file size update may hit disk 25s before the data does, and if you crash during that time, then on recovery you'll find the data has a bunch of zero bytes appended instead of what you expected.

(I like to test my storage code by running it on top of a network block device and then simulating power failure by cutting the connection randomly. This is one problem I found while doing that.)

3 comments

kentonv

acqq 10 years ago

Wow, 5 and 30 seconds before metadata and data flush? It sounds unbelievably long. If it's true, almost every power loss results in a data loss of whatever was written in the last 15 seconds, on average? Is it so bad?

I'd expect more "smartness" of Linux, like, as soon as there is no "write pressure" to flush earlier.

kentonv 10 years ago
> If it's true, almost every power loss results in a data loss of whatever was written in the last 15 seconds, on average? Is it so bad?
No, because correct programs use sync() and/or fsync() to force timely flushes.
A good database should not reply successfully to a write request until the write has been fully flushed to disk, so that an "acknowledged write" can never be lost. Also, it should perform write and sync operations in such a sequence that it cannot be left in a state where it is unable to recover -- that is, if a power outage happens during the transaction, then on recovery the database is either able to complete the transaction or undo it.
The basic way to accomplish this is to use a journal: each transaction is first appended to the journal and the journal synced to disk. Once the transaction is fully on-disk in the journal, then the database knows that it cannot "forget" the transaction, so it can reply successfully and work on updating the "real" data at its leisure.
Of course, if you're working with something that is not a database, then who knows whether it syncs correctly. (For that matter, even if you're working with a database, many have been known to get it wrong, sometimes intentionally in the name of performance. Be sure to read the docs.)
For traditional desktop apps that load and save whole individual files at a time, the "write to temporary then rename" approach should generally get the job done (technically you're supposed to fsync() between writing and renaming, but many filesystems now do this implicitly). For anything more complicated, use sqlite or a full database.
> I'd expect more "smartness" of Linux, like, as soon as there is no "write pressure" to flush earlier.
Well, this would only mask bugs, not fix them -- it would narrow the window during which a failure causes loss. Meanwhile it would really harm performance in a few ways.
When writing a large file to disk sequentially, the filesystem often doesn't know in advance how much you're going to write, but it cannot make a good decision on where to put the file until it knows how big it will be. So filesystems implement "delayed allocation", where they don't actually decide where to put the file until they are forced to flush it. The longer the flush time, the better. If we're talking about a large file transfer, the file is probably useless if it isn't fully downloaded yet, so flushing it proactively would be pointless.
Also flushing small writes rather than batching might mean continuously rewriting the same sector (terrible for SSDs!) or consuming bandwidth to a network drive that is shared with other clients. Etc.
- acqq 10 years ago
  
  > this would only mask bugs, not fix them
  If I get a corruption once in 100 outages instead of on every one, I'm satisfied. That it "masks" anything is not an argument at all.
  The writes happen in bursts. The behavior of bursts won't change if one more write is done after the burst is over (an example) only a second later instead of waiting 30.
  The "delayed allocation" is a red herring: in the optimal case, the software can instruct the filesystem to preallocate the whole file size without having to actually fill the content. If it's not common by some specific applications on Linux, that's the place to fix it.