← Back to context

Comment by corysama

5 years ago

“Atomic transactions” is a feature needs formal support in random file formats way more often than people realize. Simply writing to a file at all in an guaranteed-atomic way is much harder than it looks. That guarantee becomes important when your app gets widely distributed. If you have a million users of your free mobile app, 1 in a million events happen every day. For example: random hardware shutdown midway through a file write operation. How is your app going to react when it reads back that garbled file?

I’ve used SQLite on mobile apps to mitigate this problem. I’ve used LMDB on a cloud app where the server was recording a lot of data, but also rebooting unexpectedly. Would recommend. I’ve also gone through the process of crafting an atomic file write routine in C. https://danluu.com/file-consistency/ It was “fun” if your idea of fun is responding to the error code of fclose(), but I would not recommend...

Posix (and I think windows) guarantee this atomically and durably overwrites a file:

tempfile = mkstemp(filename-XXXX)

write(tempfile)

fsync(tempfile)

close(tempfile)

rename(tempfile, filename)

sync()

Assume the entire write failed if any of the above return an error.

In some systems (nfs and ext3 come to mind), you can skip the fsync and/or sync, but don’t do that. It doesn’t make things significantly faster on the systems where it’s safe, but it definitely will lose data on other systems.

The only loophole I know of is that the final sync can fail, then return anyway. If that happens, the file system is probably hosed anyway.

  • You need a recovery step on startup to retry the rename if tempfile is complete, or delete it if it isn't.

    That means you need a way to verify that tempfile is complete. I do that by removing filename after completing tempfile. And that requires a placeholder for filename if it didn't already exist (e.g. a symlink to nowhwere).

    On crash, rename may leave both files in place.

    This technique doesn't work if you have hardlinks to filename which should refer to the new file.

    • Regardless of whether the tempfile is complete, you can just ignore (or delete) it on startup. From the caller's perspective, the save operation doesn't succeed until the rename is done and written to disk.

I second the recomendation of LMDB. With one important caveat: under heavy write load it is perfect demonstration of brokenness of semaphore implementation on freebsd and macos.

  • In what way is LMDB better than eg SQLite or Redis? For what kinds of use cases would you recommend it?

    • LMDB and SQLite are not directly comparable. LMDB is a transactional B+tree-based key/value store. SQLite is an implementation of a transactional SQL data model on top of a B+tree-based key/value store, so it is logically at least one abstraction layer higher than LMDB. (Key/value stores underlie pretty much all of the other data models you'll ever use.)

      That aside - LMDB is not just smaller, faster, and more reliable than SQLite, it is also smaller/faster/more reliable than SQLite's own B+tree implementation, and SQLite can be patched to use LMDB instead of its own B+tree code, resulting in a smaller/faster footprint for SQLite itself.

      Proof of concept was done here https://github.com/LMDB/sqlightning

      A new team has picked this up and carried it forward https://github.com/LumoSQL/LumoSQL

      Generally, unless your application has fairly simple data storage needs, it's better to use some other data model built on top of LMDB than to use it (or any K/V store) directly. (But if building data storage servers and implementing higher level data models is your thing, then you'd most likely be building directly on top of LMDB.)

    • It's very simple. Single C file implementation. Binary blob keys : binary blob values. The end. If all you need is a bunch of blobs written and read back reliably while in a statistically unreliable situation, LMDB is great.

      3 replies →

> For example: random hardware shutdown midway through a file write operation. How is your app going to react when it reads back that garbled file?

Don't filesystem journals ensure that you can't get a garbled file from sudden shutdowns?

  • They ensure you don't get a garbled filesystem.

    They also expose an API that allows you, if you're very careful and really know what you're doing (like danluu or the SQLite author), to write performant code that won't garble files on random shutdowns. But most programmers at most times would rather just let the OS make smart decisions about performance at the risk of garbling the file, or if they really need Durability, just use a library that provides a higher level API that takes care of it, like LMDB or an RDBMS like SQLite.

    To not get your file garbled, you need to use an API that knows about legal vs. illegal states of the file. So either the API gets a complete memory image of the file content at a legal point in time and rewrites it, or it has to know more about the file format than "it's a stream of bytes you can read or write with random access".

    Popular APIs to write files are either cursor based (usually with buffering at the programming language standard library level, I think, which takes control of Durability away from the programmer) or memory mapped (which realllly takes control of Durability from the programmer).

    SQLite uses the cursor API and is very careful about buffer flushing, enabling it to promise Durability. Also, to not need to rewrite the whole file for each change, it does it's own Journaling inside the file* - like most RDBMSs do.

    * Well, it has a mode where it uses a more advanced technique instead to achieve the same guarantees with better performance

    • > They ensure you don't get a garbled filesystem.

      Well, they do that, but they also protect data to reasonable degrees. For example, ext3/4's default journaling mode "ordered" protects against corruption when appending data or creating new files. It admittedly doesn't protect when doing direct overwrites (journaling mode "journal" does, however), but I'm pretty sure people generally avoid doing direct overwrites anyway, and instead write to a new file and rename over the old one.

      I'm not sure if it would protect files that are clobbered with O_TRUNC on opening (like when using > in the shell). I would imagine that using O_TRUNC causes new blocks to be used and so the old data isn't overwritten and it isn't discarded because the old file metadata which would identify the old blocks corresponding to the file would be backed up in the journal.

      > They also expose an API that allows you, if you're very careful and really know what you're doing (like danluu or the SQLite author), to write performant code that won't garble files on random shutdowns.

      As far as I see for the general case, being "very careful and really knowing what you're doing" consists of just avoiding direct overwrites. Of course, a single file that persists data by the needs of software similar to a web server (small updates to a big file in a long-running process) is going to want the performance benefits of direct overwrites. I can totally see SQLite needing special care. However, I don't think those needs apply to all applications.

      2 replies →

  • Not really. If you have a file format that requires, e.g., changes to be done in two places then it's reasonable to write to one place, have the system shut down never having written to the second place, and now have a corrupt file.

    The journal ensures (helps ensure?) that individual file operations either happen or don't and can improve write performance, but it can't possibly know that you need to write, e.g., two separate 20TB streams to have a non-corrupt file.

    • For a single file, I thought that write operations were committed when e.g. closing the file or doing fsync, but now I'm not sure. I wonder if the system is free to commit immediately after a write() ends.

      Based on your scenario, if an application-level "change" involves updating 2 files, interpreting the update of only one file and not the other as a corruption, you're right that filesystem journaling wouldn't suffice. However, in that case it wouldn't be that a single file was corrupted.

      Still, I wonder about the other case, about when the filesystem decides to commit.

      1 reply →

Windows had a feature that had snapshot-based transactions across the entire file system (NTFS-only, though). Unfortunately, it has been deprecated... it's such a shame that we can't seem to move on from the notion of filesystem from 40 years ago, apparently.