← Back to context

Comment by acqq

10 years ago

Wow, 5 and 30 seconds before metadata and data flush? It sounds unbelievably long. If it's true, almost every power loss results in a data loss of whatever was written in the last 15 seconds, on average? Is it so bad?

I'd expect more "smartness" of Linux, like, as soon as there is no "write pressure" to flush earlier.

> If it's true, almost every power loss results in a data loss of whatever was written in the last 15 seconds, on average? Is it so bad?

No, because correct programs use sync() and/or fsync() to force timely flushes.

A good database should not reply successfully to a write request until the write has been fully flushed to disk, so that an "acknowledged write" can never be lost. Also, it should perform write and sync operations in such a sequence that it cannot be left in a state where it is unable to recover -- that is, if a power outage happens during the transaction, then on recovery the database is either able to complete the transaction or undo it.

The basic way to accomplish this is to use a journal: each transaction is first appended to the journal and the journal synced to disk. Once the transaction is fully on-disk in the journal, then the database knows that it cannot "forget" the transaction, so it can reply successfully and work on updating the "real" data at its leisure.

Of course, if you're working with something that is not a database, then who knows whether it syncs correctly. (For that matter, even if you're working with a database, many have been known to get it wrong, sometimes intentionally in the name of performance. Be sure to read the docs.)

For traditional desktop apps that load and save whole individual files at a time, the "write to temporary then rename" approach should generally get the job done (technically you're supposed to fsync() between writing and renaming, but many filesystems now do this implicitly). For anything more complicated, use sqlite or a full database.

> I'd expect more "smartness" of Linux, like, as soon as there is no "write pressure" to flush earlier.

Well, this would only mask bugs, not fix them -- it would narrow the window during which a failure causes loss. Meanwhile it would really harm performance in a few ways.

When writing a large file to disk sequentially, the filesystem often doesn't know in advance how much you're going to write, but it cannot make a good decision on where to put the file until it knows how big it will be. So filesystems implement "delayed allocation", where they don't actually decide where to put the file until they are forced to flush it. The longer the flush time, the better. If we're talking about a large file transfer, the file is probably useless if it isn't fully downloaded yet, so flushing it proactively would be pointless.

Also flushing small writes rather than batching might mean continuously rewriting the same sector (terrible for SSDs!) or consuming bandwidth to a network drive that is shared with other clients. Etc.

  • > this would only mask bugs, not fix them

    If I get a corruption once in 100 outages instead of on every one, I'm satisfied. That it "masks" anything is not an argument at all.

    The writes happen in bursts. The behavior of bursts won't change if one more write is done after the burst is over (an example) only a second later instead of waiting 30.

    The "delayed allocation" is a red herring: in the optimal case, the software can instruct the filesystem to preallocate the whole file size without having to actually fill the content. If it's not common by some specific applications on Linux, that's the place to fix it.