← Back to context

Comment by jolmg

5 years ago

> For example: random hardware shutdown midway through a file write operation. How is your app going to react when it reads back that garbled file?

Don't filesystem journals ensure that you can't get a garbled file from sudden shutdowns?

They ensure you don't get a garbled filesystem.

They also expose an API that allows you, if you're very careful and really know what you're doing (like danluu or the SQLite author), to write performant code that won't garble files on random shutdowns. But most programmers at most times would rather just let the OS make smart decisions about performance at the risk of garbling the file, or if they really need Durability, just use a library that provides a higher level API that takes care of it, like LMDB or an RDBMS like SQLite.

To not get your file garbled, you need to use an API that knows about legal vs. illegal states of the file. So either the API gets a complete memory image of the file content at a legal point in time and rewrites it, or it has to know more about the file format than "it's a stream of bytes you can read or write with random access".

Popular APIs to write files are either cursor based (usually with buffering at the programming language standard library level, I think, which takes control of Durability away from the programmer) or memory mapped (which realllly takes control of Durability from the programmer).

SQLite uses the cursor API and is very careful about buffer flushing, enabling it to promise Durability. Also, to not need to rewrite the whole file for each change, it does it's own Journaling inside the file* - like most RDBMSs do.

* Well, it has a mode where it uses a more advanced technique instead to achieve the same guarantees with better performance

  • > They ensure you don't get a garbled filesystem.

    Well, they do that, but they also protect data to reasonable degrees. For example, ext3/4's default journaling mode "ordered" protects against corruption when appending data or creating new files. It admittedly doesn't protect when doing direct overwrites (journaling mode "journal" does, however), but I'm pretty sure people generally avoid doing direct overwrites anyway, and instead write to a new file and rename over the old one.

    I'm not sure if it would protect files that are clobbered with O_TRUNC on opening (like when using > in the shell). I would imagine that using O_TRUNC causes new blocks to be used and so the old data isn't overwritten and it isn't discarded because the old file metadata which would identify the old blocks corresponding to the file would be backed up in the journal.

    > They also expose an API that allows you, if you're very careful and really know what you're doing (like danluu or the SQLite author), to write performant code that won't garble files on random shutdowns.

    As far as I see for the general case, being "very careful and really knowing what you're doing" consists of just avoiding direct overwrites. Of course, a single file that persists data by the needs of software similar to a web server (small updates to a big file in a long-running process) is going to want the performance benefits of direct overwrites. I can totally see SQLite needing special care. However, I don't think those needs apply to all applications.

    • When I said "allows you, if you're very careful and really know what you're doing, to write performant code that won't garble files", by "performant" I was alluding to direct overwrites. If you don't need direct overwrites (perhaps because your savefiles are tiny), then no problem. If you do, you should use SQLite or LMDB or something, unless you work at Oracle or somewhere else where your job is to compete with them.

      The example I had in mind was Word, which gave up on direct overwrites and managing essentially their own filesystem-in-a-file in favor of zipped XML, which is really good enough when writing a three-page letter, but terrible when writing a book like my mother is. Had they used SQLite as a file format, we would've gotten orders-of-magnitude faster save on software billions of people use every day.

    • > As far as I see for the general case, being "very careful and really knowing what you're doing" consists of just avoiding direct overwrites.

      That's a dangerous thing to say. There are many ways to mess up your data, without directly overwriting old data.

      If you write a new file, close, then rename, on a typical linux filesystem, mounted with reasonable options, on compliant hardware, I think you should have either the old or new version of the file on power loss, even if you don't sync the proper things in proper order, but that's only because of special handling of the common pattern. See e.g. xfs 0 size file debacle.

      Not an expert.

Not really. If you have a file format that requires, e.g., changes to be done in two places then it's reasonable to write to one place, have the system shut down never having written to the second place, and now have a corrupt file.

The journal ensures (helps ensure?) that individual file operations either happen or don't and can improve write performance, but it can't possibly know that you need to write, e.g., two separate 20TB streams to have a non-corrupt file.

  • For a single file, I thought that write operations were committed when e.g. closing the file or doing fsync, but now I'm not sure. I wonder if the system is free to commit immediately after a write() ends.

    Based on your scenario, if an application-level "change" involves updating 2 files, interpreting the update of only one file and not the other as a corruption, you're right that filesystem journaling wouldn't suffice. However, in that case it wouldn't be that a single file was corrupted.

    Still, I wonder about the other case, about when the filesystem decides to commit.

    • The system is free to commit immediately after a write() -- fsync or closing the file simply guarantees commits.

      >Based on your scenario, if an application-level "change" involves updating 2 files

      It could be two parts of the same file too. E.g. if you're using a single file with something like recutils with a single file to implement double-entry accounting and only commit one entry. You'll at least be able to detect the corruption in that case (not that you can in general), but you won't be able to fix it using only the contents of the file.