Pretty much spot on. Local in-kernel file systems are hard, partly because of their essential nature and partly because of their history. A lot of the codebases involved still show their origins on single-core systems and pre-NCQ SATA disks, and the development/testing methods are from the same era. The developers always have time to improve the numbers on some ancient micro-benchmark, but new features often get pushed to the LVM layer (snapshots), languish for ages (unions/overlays), or are simply ignored (anything alternative to the fsync sledgehammer).
The only way a distributed file system such as I work on can provide sane behavior and decent performance to our users is to use local file systems only for course-grain space allocation and caching. Sometimes those magic incantations from ten-year-old LKML posts don't really work, because they were never really tested for more than a couple of simple cases. Other times they have unexpected impacts on performance or space consumption. Usually it's easier and/or safer just to do as much as possible ourselves. Databases - both local and distributed - are in pretty much the same boat.
Some day, I hope, all of this scattered and repeated effort will be combined into a common library that Does All The Right Things (which change over time) and adds features with a common API. It's not quite as good as if the stuff in the kernel had been done right, but I think it's the best we can hope for at this point.
AIUI, ZFS was explicitly designed to deal with this sort of data corruption - one of the descriptions of the design I've heard is "read() will return either the contents of a previous successful write() or an error". That would (in principle) prevent the file containing "a boo" or "a far" at any point.
It looks like one of the authors cited in this article has written a paper analysing ZFS - though they admittedly don't test its behaviour on crashes. Citation here, in PDF form:
Right, Copy-On-Write filesystems (ZFS, Bttr) are explicitly designed to prevent that kind of corruption by never editing blocks in place, but rather copying the contents to a new block and using a journaled metadata update to point the file at it's new block.
ZFS also includes features around checksumming of the metadata. "Silent" write errors become loud the next time data is accessed and the checksums don't match. This can't prevent all errors, but has some very nice data integrity properties - Combined with it's RAID format, you can likely recover from most any failures, and with RAIDZ2, you can recover from a scattered failures on all drives even if one drive has completely died. This is actually fairly common - Modern drives are very large, and rust is more susceptible to 'cosmic rays' than one might think.
There is an easy way to write data without corruption. First copy your file-to-be-changed as a temporary file or create a temporary file. Then modify the temporary file and write whatever you want in it. Finally, use rename() to atomically replace the old file by the temporary one.
The same logic also apply to directories, although you will have to use links or symlinks to have something really atomic.
It may not work on strangely configured systems, like if your files are spread over different devices over the network (or maybe with NFS). But in those cases you will be able to detect it if you catch errors of rename() and co (and you should catch them of course). So no silver bullet here, but still a good shot.
I'm surprised rename() wasn't mentioned in the article, it's a well known technique to atomically update a file, which is very practical for small-ish files.
Note that in the general case, you should fsync() the temporary file before you rename() it over the original - but ext3 and ext4 in writeback mode added a heuristic to do that automatically, because ext3 in the default ordered mode would effectively do that and many applications came to assume it.
rename is atomic, but it is not guaranteed to be durable. In order for rename to be durable, I've learned that you have to fsync the parent directory.
I was saddened when I learned this. I used to use this old trick for a lot of my systems. I learned it from reading the code of old well-written unix systems programs.
It also doesn't work if you want to support a non-trivial update rate, or if there's any possibility of contention with another process trying to do the same thing. It's the sort of thing that app writers get addicted to, because it does work in the super-easy case, but it doesn't help at all when you need to solve harder storage problems than updating a single config file once in a while.
"How is it that desktop mail clients are less reliable than gmail...?"
Made me chuckle. I've been told off by a former Googler colleague enough times now to have learned that Gmail is more complex than anyone imagines on a first guess, in order to be "reliable".
It is certainly the google service that I use the most. In a decade of quite heavy usage I remember one outage of a 1-2 hours (with no data loss). To me this is the gold standard that the rest of us should aspire to. :)
Lately (last year or so) I've started to notice substantial data loss. Either old mails completely missing or large mails being truncated (destroying inline images f.ex.)
So to anyone relying on gmail for safe keeping: Don't.
Interesting that none of the cited software uses maildir.
Breaking a mbox is an extremely simple thing, as the format leaves no possibility of error checking, parallel writing, rewriting lost data, or anything else.
Outlook's mail folders are marginally better, allowing for error detection, but really, that's a lame first paragraph for introducing a great article.
I've recently been playing with nbdkit, which is basically FUSE but for block devices rather than file systems.
I was shocked to discover that mke2fs doesn't check the return value of its final fsync call. This is compounded by the fact that pwrite calls don't fail across NBD (the writes are cached, so the caller's stack is long gone by the time the get flushed across the network and fails...)
As a test, I created an nbdkit plugin which simply throws away every write. Guess what? mke2fs will happily create a file system on such a block device and not report failure. You only discover a problem when you try to mount the file system.
The article's table of filesystem semantics is missing at least one X: Appends on ext4-ordered are not atomic. When you append to a file, the file size (metadata) and content (data) must both be updated. Metadata is flushed every 5 seconds or so, data can sit in cache for more like 30. So the file size update may hit disk 25s before the data does, and if you crash during that time, then on recovery you'll find the data has a bunch of zero bytes appended instead of what you expected.
(I like to test my storage code by running it on top of a network block device and then simulating power failure by cutting the connection randomly. This is one problem I found while doing that.)
Wow, 5 and 30 seconds before metadata and data flush? It sounds unbelievably long. If it's true, almost every power loss results in a data loss of whatever was written in the last 15 seconds, on average? Is it so bad?
I'd expect more "smartness" of Linux, like, as soon as there is no "write pressure" to flush earlier.
> If it's true, almost every power loss results in a data loss of whatever was written in the last 15 seconds, on average? Is it so bad?
No, because correct programs use sync() and/or fsync() to force timely flushes.
A good database should not reply successfully to a write request until the write has been fully flushed to disk, so that an "acknowledged write" can never be lost. Also, it should perform write and sync operations in such a sequence that it cannot be left in a state where it is unable to recover -- that is, if a power outage happens during the transaction, then on recovery the database is either able to complete the transaction or undo it.
The basic way to accomplish this is to use a journal: each transaction is first appended to the journal and the journal synced to disk. Once the transaction is fully on-disk in the journal, then the database knows that it cannot "forget" the transaction, so it can reply successfully and work on updating the "real" data at its leisure.
Of course, if you're working with something that is not a database, then who knows whether it syncs correctly. (For that matter, even if you're working with a database, many have been known to get it wrong, sometimes intentionally in the name of performance. Be sure to read the docs.)
For traditional desktop apps that load and save whole individual files at a time, the "write to temporary then rename" approach should generally get the job done (technically you're supposed to fsync() between writing and renaming, but many filesystems now do this implicitly). For anything more complicated, use sqlite or a full database.
> I'd expect more "smartness" of Linux, like, as soon as there is no "write pressure" to flush earlier.
Well, this would only mask bugs, not fix them -- it would narrow the window during which a failure causes loss. Meanwhile it would really harm performance in a few ways.
When writing a large file to disk sequentially, the filesystem often doesn't know in advance how much you're going to write, but it cannot make a good decision on where to put the file until it knows how big it will be. So filesystems implement "delayed allocation", where they don't actually decide where to put the file until they are forced to flush it. The longer the flush time, the better. If we're talking about a large file transfer, the file is probably useless if it isn't fully downloaded yet, so flushing it proactively would be pointless.
Also flushing small writes rather than batching might mean continuously rewriting the same sector (terrible for SSDs!) or consuming bandwidth to a network drive that is shared with other clients. Etc.
This problem can be fixed. We need to rethink file system semantics. Here's an approach:
Files are of one of the following types:
Unit files
For a unit file, the unit of consistency is the entire file. Unit files can be created or replaced, but not modified. Opening a unit file for writing means creating a new file. When the new file is closed successfully, the new version replaces the old version atomically. If anything goes wrong, including a system crash, between create and successful close, including program abort, the old version remains and the new version is deleted. File systems are required to maintain that guarantee.
Opens for read while updating is in progress reference the old version. Thus, all readers always see a consistent version.
They're never modified in place once written. It's easy for a file system to implement unit file semantics. The file system can cache or defer writes. There's no need to journal. The main requirement is that the close operation must block until all writes have committed to disk, then return a success status only if nothing went wrong.
In practice, most files are unit files. Much effort goes into trying to get unit file semantics - ".part" files, elaborate file renaming rituals to try to get an atomic rename (different for each OS and file system), and such. It would be easier to just provide unit file semantics. That's usually what you want.
Log files
Log files are append-only. The unit of consistency is one write. The file system is required to guarantee that, after a program abort or crash, the file will end cleanly at the end of a write. A "fsync" type operation adds the guarantee that the file is consistent to the last write. A log file can be read while being written if opened read-only. Readers can seek, but writers cannot. Append is always at the end of the file, even if multiple processes are writing the same file.
This, of course, is what you want for log files.
Temporary files
Temporary files disappear in a crash. There's no journaling or recovery. Random read/write access is allowed. You're guaranteed that after a crash, they're gone.
Managed files
Managed files are for databases and programs that care about exactly when data is committed. A "write" API is provided which returns a status when the write is accepted, and then makes an asynchronous callback when the write is committed and safely on disk. This allows the database program to know which operations the file system has completed, but doesn't impose an ordering restriction on the file system.
This is what a database implementor needs - solid info about if and when a write has committed. If writes have to be ordered, the database program can wait for the first write to be committed before starting the second one. If something goes wrong after a write request was submitted, the caller gets status info in the callback.
This would be much better than the present situation of trying to decide when a call to "fsync" is necessary. It's less restrictive in terms of synchronization - "fsync" waits for everything to commit, which is often more than is needed just to finish one operation.
This could be retrofitted to POSIX-type systems. If you start with "creat" and "O_CREAT" you get a unit file by default, unless you specify "O_APPEND", in which case you get a log file. Files in /tmp are temporary files. Managed files have to be created with some new flag. Only a small number of programs use managed files, and they usually know who they are.
This would solve most of the problems mentioned in the original post.
I worked at a storage company and the scariest thing I learned is that your data can be corrupt even though the drive itself says that the data was written correctly. The only way to really be sure is to check your files after writing them that they match. Now whenever I do a backup, I always go through them one more time and do a byte-by-byte comparison before being assured that it's okay.
This is true. Which is why we really, really need checksummed filesystems. I am very worried that this hasn't made its way into mainstream computing yet, especially given the growing drive sizes and massive CPU speed increases.
I run a 10x3TB ZFS raidz2 array at home. I've seen 18 checksum errors at the device level in the last year - these are corruption from the device that ZFS detected with a checksum, and was able to correct using redundancy. If you're not checksumming at some level in your system, you should be outsourcing your storage to someone else; consumer level hardware with commodity file systems isn't good enough.
> The only way to really be sure is to check your files after
> writing them that they match.
This is assuming that the underlying block device would forcibly flush
those queued writes to disk and then re-read them again rather than
just serve them up directly from the pending write queue directly
without flushing them first.
You generally can't make that assumption about a black box, so reading
back your writes guarantees nothing.
Unless you're intimately familiar with your underlying block device
you really can't guarantee anything about writes going to physical
hardware. All you can do is read its documentation and hope for the
best.
If you need a general hack to that's pretty much guaranteed to flush
your writes to a physical disk it would be something like:
After your write, append X random bytes to a file where X is
greater than your block device's advertised internal memory, then
call fsync().
Even then you have no guarantees that those writes wouldn't be flushed to the medium while leaving the writes you care about in the block device's internal memory.
This is why end-to-end data integrity with something like T10-PI is a necessity. The kernel block-layer already generates and validates the integrity for us, if the underlying drive supports it, but all major filesystems really need to start supporting it as well.
I don't think that's a necessity for all workflows. Just think about it, that would require all of us buying enterprise 520 or 528 byte sector drives to store the extra checksum information, and a whole new API up to the application level to confirm, point to point, that the data in the app is the data on the drive on writes, and the data on the drive is the data in the app on reads. It's not like T10/PI comes for free just by doing any one thing, it implies changes throughout the chain.
Great write-up and probably explains some issues in my apps a while back. Like that my long-time favorite, XFS, totally kicks ass in the comparisons. I agree on using static analysis and other techniques as I regularly push that here. What threw me is this line:
"that when they came up with threads, locks, and conditional variables at PARC, they thought that they were creating a programming model that anyone could use, but that there’s now decades of evidence that they were wrong. We’ve accumulated a lot of evidence that humans are very bad at reasoning at these kinds of problems, which are very similar to the problems you have when writing correct code to interact with current filesystems."
There were quite a few ways of organizing systems, including objects and functions, that the old guard came up with. UNIX's popularity and organization style of some others pushed the file-oriented approach from mainstream into outright dominance. However, many of us long argued it was a bad idea and alternatives exist that just need work on the implementation side. We actually saw some of those old ideas put into practice in data warehousing, NoSQL, "the cloud," and so on. Just gotta do more as we don't have a technical reason for dealing with the non-sense in the write-up: just avoiding getting our hands dirty with the replacements.
It's written in 2004 so I don't know how current it is, but it makes
the point that XFS makes certain performance & safety guarantees
essentially assuming that you're running on hardware that has a UPS
with the ability to interrupt the OS saying "oops, we're out of
power".
It was designed by SGI for high-end workstations and supercomputers with long-running jobs (esp render farms). So, that doesn't surprise me. However, it's nice to have all the assumptions in the open and preferrably in user/admin guides. Another issue was it zeroing out stuff on occasion but they fixed that.
2004 is not current for XFS, that is a decade ago! However, disks finishing writes and not lying about having done it is a critical need for all FS. For some like ext3 you would notice it less as it was flush happy.
XFS is becoming the sane default filesystem for servers as it allocates nodes more consistently than the other current mainstream linux options on multidisk systems. Basically small servers now have more disk space and performance than the large systems of 2004. So XFS stayed put in where it starts to make sense, but systems grew to meet its sweetspot much often.
In plan9 mailbox files, like many others, are append only.
All files are periodically (default daily) written to a block coalescing worm drive and you can rewind the state of the file system to any date on a per process basis, handy for diffing your codebase etc.
For a while the removal of the "rm" command was considered to underline the permanence of files but the removal of temporary data during the daytime hours turned out to be more pragmatic.
How does Plan 9 deal the equivalent of this append-only pattern potentially causing corruption on Unix if you have multiple writers and the writes are larger than PIPE_BUF (4k by default on Linux)?
Most users of this pattern (concurrent updates to log files) get away with it because their updates are smaller than 4k, but if you're trying to write something as big as an E-Mail with this pattern you can trivially get interleaved writes resulting in corruption.
Surely filesystems are going to go through a massive change when SSDs push standard spinning disks into the history books? They must carry a lot of baggage for dealing with actual spinning disks, much of which is just overhead for super-fast solid state drives. Hopefully this will allow interesting features not possible on spinning disks, like better atomic operations.
"IotaFS: Exploring File System Optimizations for SSDs"
Our hypothesis in beginning this research was simply that
the complex optimizations applied in current file system
technology doesn’t carry over to SSDs given such dramatic changes in performance characteristics. To explore
this hypothesis, we created a very simple file system
research vehicle, IotaFS, based on the incredibly simple and small Minix file system, and found that with a
few modifications we were able to achieve comparable
performance to modern file systems including Ext3 and
ReiserFS, without being nearly as complex.
Yeah btrfs 'ssd' mount option does less for the same reasoning, but still does include checksums for metadata and data because SSDs have at least as much likelihood of non-deterministically returning your data as spinning rust. So even if it doesn't fix the corruptions (which requires additional copies or parity), at least there's an independent way of being informed of problems.
I wonder how this approach (single file + log) compares to the other usual approach (write second file, move over first):
1. Write changed the data into a temporary file in the same directory (don't touch the original file)
2. Move new file over old file
Does this lead to a simpler strategy that is easier to reason about, where it is less likely for programmer to get it wrong? At least I see this strategy being applied more often than the "single file + log" approach.
The obvious downside is that this temporarily uses twice the size of the dataset. However, that is usually mitigated by splitting the data into multiple files, and/or applying this only to applications that don't need to store gigabytes in the first place.
That's not guaranteed to work in the face of crashes. The problem is that the directory update could get flushed to disk before the file data.
This is the fundamental problem: When you allow the OS (or the compiler, or the CPU) to re-order operations in the name of efficiency you lose control over intermediate states, and so you cannot guarantee that these intermediates states are consistent with respect to the semantics that you care about. And this is the even more fundamental problem: our entire programming methodology has revolved around describing what we want the computer to do rather than what we want to to achieve. Optimizations then have to reverse-engineer our instructions and make their best guesses as to what we really meant (e.g. "This is dead code. It cannot possibly affect the end result. Therefore it can be safely eliminated.) Sometimes (often?) those guesses are wrong. When they are, we typically only find out about it after the mismatch between our expectations and the computer's have manifested themselves in some undesirable (and often unrecoverable) way.
"That's not guaranteed to work in the face of crashes. The problem is that the directory update could get flushed to disk before the file data."
No, it can work, provided that the temporary file is fsynced before being renamed, the parent directory is fsynced after renaming the file, and that the application only considers the rename to have taken place after the parent directory is fsynced (not after the rename call itself).
Good summary of the situation. It's why I fought out-of-order execution at hardware and OS levels as much as I could. Even went out of way to use processors that didn't do it. Market pushed opposite stuff into dominance. Then came the inevitable software re-writes to get predictability and integrity out of what shouldn't have problems in the first place. It's all ridiculous.
It bugged me that Sublime Text used to do these so-called atomic saves by default since it screwed with basic unix expectations like fstat and fseek meaningfully working (like a tail -f implementation could boil down to[0]). A concurrently running process using those calls would be off in lala-land as soon as the text file was modified and saved: it would never pick up any modifications, because it and the editor weren't even dealing with the same file any more.
GNU tail lets you say "--follow=descriptor" to follow the content of the file no matter how it gets renamed, or "--follow=name" to re-open the same filename when a different file gets renamed onto it.
Functional versus imperative concurrent shared data approaches provide a good analogy:
* Single file + log: fine grained locking in a shared C data structure. Yuck!
* Write new then move: transactional reference to a shared storage location, something like Clojure's refs. Easy enough.
The latter clearly provides the properties we'd like, the former may but it's a lot more complicated to verify and there are tons of corner cases. So I think move new file over old file is the simpler strategy and way easier to reason about.
The obvious downside is that this temporarily uses twice the size of the dataset. However, that is usually mitigated by splitting the data into multiple files, and/or applying this only to applications that don't need to store gigabytes in the first place.
Clojure's approach again provides an interesting solution to saving space. Taking the idea of splitting data into multiple files to the logical conclusion, you end up with the structure sharing used in efficient immutable data structures.
Your solution is slower; also, you need to fsync()/fdatasync() the new file before moving, at least on some systems (http://lwn.net/Articles/322823/ is the best reference I can find right now), and you need to fsync() the directory if you wish the rename to be durable (as opposed to just all-or-nothing.)
In general this approach should work fine, but devil is in detail:
1. You have to flush change to temporary file before move because otherwise you may get empty file: OS may reorder move and write operations
2. After move you have to flush parent directory of destination file on Posix. Windows have special flag for MoveFileEx() to ensure that operation is done or you have to call FlushFileBuffers() for destination file.
Linked paper mention the many popular programs forgets about (1).
The article only mentions Linux/Posix systems, are the same problems also present in Windows/NTFS? I was under the impression that, for example, renames on ntfs were crash safe and atomic, which would make the "write temp file then rename to target" work even if the power is cut?
NTFS is not very safe. No data integrity checksums. I think it's about same level as ext4 mostly. Meaning not very good. One shouldn't trust any critical data on NTFS without checksums and duplication.
NTFS consistency checks and recovery are pretty good. But they won't bring your data back.
Microsoft's ReFS (Reliable File System) might give storage reliability to Windows one day. On Linux you can use ZFS or btrfs (with some reservations) today.
If power is cut during NTFS logfile update, all bets are off. Hard disks can and will do weird things when losing power unexpectedly. That includes writing incorrect data, corrupting nearby blocks, corrupting any blocks, etc. That includes earlier logfile data, including checkpoints.
The article makes me wonder whether there's enough abstraction being done via the VFS layer, because all this fsync business that application developers seem to have to do can be so workload and file system specific. And I think that's asking too much from application developers. You might have to fsync the parent dir? That's annoying.
I wonder if the article and papers its based on account for how the VFS actually behaves, and then if someone wanting to do more research in this area could investigate this accounting for the recent VFS changes. On Linux in particular I think this is a bigger issue because there are so many file systems the user or distro could have picked, totally unbeknownst to and outside the control of the app developer.
That's definitely asking too much of app developers. Every time someone complains about any of this, the filesystem developers come back with a bit of lore about a non-obvious combination of renameat (yes that's a real thing) and fsync on the parent directory, or some particular flavor of fallocate, or just use AIO and manage queues yourself, or whatever depending on exactly which bogus behavior you're trying to work around. At best it's an unnecessary PITA. At worst it doesn't even do what they claimed, so now you've wasted even more time. Most often it's just non-portable (I'm so sick of hearing about XFS-specific ioctls as the solution to everything) or performs abominably because of fsync entanglement or some other nonsense.
We have libraries to implement "best practices" for network I/O, portable across systems that use poll or epoll or kqueues with best performance on each etc. What we need is the same thing for file I/O. However imperfect it might be, it would be better than where we are now.
Very rudimentary, but a way for an application developer to specify categories of perform/safety ratio operations. An app developer might have a simple app that only cares about performance, only cares about safety, there'd be a default in between both. Another app developer might have mixed needs depending on the type of data the app is generating. But in this way, if they write the app with a category of A (let's say that means highest safety at expense of performance) and their benchmarking determines this is crap, and they have to go to category B for writes, that's a simpler change that going back through their code and refactoring a pile of fsyncs or FUA writes.
I mean, I thought this was a major reason for VFS abstraction between the application and kernel anyway. It's also an example of the distinction between open source and free (libre). If as an application developer you have to know such esoterics to sanely optimize, you in fact aren't really free to do what you want. You have to go down a particular rabbit hole and optimize for that fs, at the expense of others. That's not fair choice to have to make.
The inherent issue is that there's a huge performance benefit to be gained by batching updates. FS safety will always come at the cost of performance.
The article doesn't say but I suspect most of the issues it mentions can be mitigated by mounting with the "sync" and "dirsync" options, but that absolutely kills performance.
The APIs involved could definitely be friendlier, but the app developer is using an API that's explicitly performance oriented by default at the cost of safety, and needs to opt-in to get safer writes. Whether the default should be the other way around is one matter, but ultimately someone has to pick which one they want and live with the consequences.
One of the naive assumptions that most of us make is that if there's a power failure none of the data that was
fsynced successfully before will be corrupted.
One thing I've always wanted to try and never had time is to build SQlite as a kernel module, talking directly to the block device layer, and then implement a POSIX file system on top of it.
It wouldn't solve problems with the block device layer itself, but it'd be interesting to see how robust and/or fast it was in real life.
SQLite itself relies on a filesystem for the rollback journal or write-ahead log. So you'd need some kind of abstraction between the block device and SQLite. Might as well just use the existing filesystem and keep SQLite in user space, since that works.
While I experienced this pain first hand I'm not sure that the FS deserved a 100% of the blame. There is enough to blame to go around for the userspace, filesytems, block device layer and disk controllers, hardware & firmware.
If you start at the bottom of the stack you have all sorts of issues with drives, their firmware and controllers doing their own levels of caching & re-ordering. Over the years the upper layer developers (block layer / fs) had to deal with hardware that simply lies about what happened or is just plain buggy.
I don't program much in C, or use direct system calls for files. Mostly I use Java.
Does anyone know if any of this applies to Java's IO operations. I'm sure you can force some of this behaviour, but for instance: The flush method on OutputStream, will it ensure proper sync, or is that again dependent on the OS and file system as described in the article for syscalls?
It's only logical if you think about the bigger picture. Does Java have access to the underlying disk device or does it work with the filesystem? Which component is responsible for the filesystem?
Ah, fond memories of async ext2 corrupting root filesystems beyond recognition... I think we 'invented' 'disposable infrastructure' back in 2002 because the filesystem forced us to... The MegaRAID cards that would eat their configuration along will all your data didn't help either..
Can't remember if we switched to ext3 in ordered or data-journaled mode but it made an immense difference...
IIRC I have seen this discussion before. And the answer was do a fsync. But for sakes of performance we want to be able to issue write-barriers to a group of file handles. So we know the commands will be ran in order, and no other order.
Correct, we want to be able to write in deterministic order. SCSI has supported this natively for decades. Unfortunately SATA doesn't, and the Linux kernel pretty much doesn't, because it can't rely on the storage devices to support it.
Which is just lame, its mostly because the block layer doesn't support file based fencing. A mistake made a decade an a half ago, and no one has the will/political power to fix it.
If the block layer supported it, solving the problem of fencing an ATA device would be as simple as issuing a whole device flush instead of a SYNC_CACHE with range. Which for soft raid devices would make a huge impact because only the devices with dirty data need to be flushed.
Of course the excuse today, is that most scsi devices can't do range based sync either, and just fall back to whole device because no one uses the command. Chicken/egg.
Isn't block level duplication/checksumming like RAID supposed to solve this hardware unreliability? I understand that by default RAID is not used on end user desktops.
The layers that ZFS violates were created years before the failure modes were well understood that filesystem-based checksums address. I'm not sure how you _can_ solve these issues without violating the old layers.
In particular: checksumming blocks alongside the blocks themselves (as some controllers and logical volume managers do) handles corruption within blocks, but it cannot catch dropped or misdirected writes. You need the checksum stored elsewhere, where you have some idea what the data _should_ be. Once we (as an industry) learned that those failure modes do happen and are important to address, the old layers no longer made sense. (The claim about ZFS is misleading: ZFS _is_ thoughtfully layered -- it's just that the layers are different, and more appropriate given the better understanding of filesystem failure modes that people had when it was developed.)
Yeah, that's a common error among Markdown writers. It's too easy to forget the last bracket, especially if you are putting a link inside a parenthetical comment in the first place.
Fortunately, it's easy to detect programmatically. I have a little shell script which flags problems in my Markdown files: http://gwern.net/markdown-lint.sh
In this case, you can use Pandoc | elinks -dump to get a rendered text version, and then simply grep in the plain text for various things like "-e '(http' -e ')http' -e '[http' -e ']http'"
For me, Vim's highlighting and concealing prevents this class of typos. It also makes it more pleasant to read the source, as it hides the links unless I'm editing the line.
The article is a troll with some good information.
I've been a heavy mail user for years... Never encountered data loss due to file system problem, and honestly I can't think of a time in the last decade where anyone I'm acquainted with has. (And I ran a very large mail system for a long time)
Hell, I've been using OSX predominately for years now, and that garbage file system hasn't eaten any data yet!
There are problems, even fundamental problems, but if someone is literally unable to use any mail client, you need to look at the user before the file system.
Where I think you're seeing "that garbage file system" not eating your data has a lot to do with no crashes or power losses. It has evolved a good deal since HFS and even HFS+ days, no one uses either of those anymore. It's all HFSJ, with a scant number using HFSX.
20 years ago Mac OS crashed often, and had a file system not designed to account for that. OS X even shipped with non-journaled HFS+. It was only into the 3rd major release of OS X that journaling appeared. Data corruptions, I feel, dropped massively, because the OS didn't crash nearly as often, but did still crash. In the last 4-5 years I'd say I get maybe one or two kernel panics per year on OS X, which is a lot less than I get on Linux desktops. But even still on Linux desktops, I can't say when I've seen file system corruption not attributable to pre-existing hardware issues.
PG was quite proud of having just used the file system as a database with Viaweb, claiming that "The Unix file system is pretty good at not losing your data, especially if you put the files on a Netapp." His entire response is worth reading in light of the above article, if only to get a taste of how simultaneously arrogant and clueless PG can be: http://www.paulgraham.com/vwfaq.html
PG and company still think this is a great idea, because the software this forum runs on apparently also does the file system-as-database thing.
No wonder Viaweb was rewritten in C++.
EDIT: to those downvoting, my characterization of PG's response is correct. Its arrogance is undeniable, with its air of "look how smart I am; I see through the marketing hype everyone else falls for," as is its cluelessness, with PG advocating practices that will cause data loss. Viaweb was probably a buggy, unstable mess that needed a rewrite.
being hypervigilent about downvotes on your throwaway
Well, FWIW, it's working. At this instant his total karma is +10, he only has 2 posts, and the 2nd post is slightly greyed. So that means that his original comment is now in the range of +10.
Sadly, the fact that I'm replying here probably means that I need to get a life!
It's clear you're ignorant of the capabilities netapp filers had, even back then.
WAFL uses copy on write trees uniformly for all data and metadata. No currently live block is ever modified in place. By default a checkpoint/snapshot is generated every 10 seconds. NFS operations since the last checkpoint are written into NVRAM as a logical recovery log. Checkpoints are advanced and reclaimed in a double buffering pattern, ensuring there's always a complete snapshot available for recovery. The filer hardware also has features dedicated to operating as a high availability pair with nearly instant failover.
The netapp appliances weren't/aren't perfect, but they are far better than you're assuming. They were designed to run 24/7/365 on workloads close to the hardware bandwidth limits. For most of the 2000's, buying a pair of netapps was a simple way to just not have issues hosting files reliably.
Perhaps you should take your own advice and dial back the arrogance a bit.
Unless the file IO is done correctly, even the best file system won't save you from data loss, such as the kind that can result from sudden power failure, which is what the article talks about.
PG obviously thinks RDBMSs are just unnecessary middlemen between you and the same file system. He doesn't realize that even if they ultimately use the same file system you do, they likely don't use it the way you do. Maybe Viaweb used something like the last snippet of code in the article, but I doubt it.
I was curious why pg did it that way. Here's a brief comment from him:
>What did Viaweb use?
>pg 3160 days ago
>Keep everything in memory in the usual sort of data structures (e.g. hash tables). Save changes to disk, but never read from disk except at startup.
So similar to what Redis does today but a decade before Redis and likely faster than the databases of the day. Could have been important with loads of users editing their stores on slow hardware. Anyway it worked, they beat their 20 odd competitors and got rich. I'm sceptical that it was a poor design choice.
I downvoted you for lowering the tone of the conversation with personal insults. If you had just said something like 'note that PG's advice about using the Unix file system as a database is not now considered best practice,' that would have been fine.
Your comment begs the question: Do you fall for the marketing hype? And if not, then do you think you should keep quiet about stuff that works?
At the time, IMHO, PG was indeed smart to be one of the few using FreeBSD as opposed whatever the majority were using.
But he has admitted they struggled with setting up RAID. They were probably not too experienced with FreeBSD. I am sure they had their fair share of troubles.
PG's essays and his taste in software are great and the software he writes may be elegant, but that does not necessarily mean it is robust.
Best filesystem I have experienced on UNIX is tmpfs. Backing up to permanent storage is still error-prone, even in 2015.
All of this is great, except the first two sentences:
> I haven’t used a desktop email client in years. None of them could handle the volume of email I get without at least occasionally corrupting my mailbox.
If I were to get so many emails that it was corrupt my mailbox, I'd first ask myself why, and how to stop that.
I wouldn't. If your daily workflow includes a high volume of email, then it does.
What I would ask is: is there a way I can solve this problem without having to totally rearrange how I use email? I'm curious whether the author looked at, say, running a personal MDA that served mail over IMAP, so that it could be interacted with via a desktop email client, without requiring that client to serve as the point of truth. Not to say that corruption couldn't still happen that way, but Thunderbird (for example) can be configured to store only a subset of messages locally, or none at all. With a reasonably fast connection to the MDA, this seems like a possibly workable solution.
> If your daily workflow includes a high volume of email, then it does.
Not necessarily. For example, if you have monitoring software sending you all email notifications, you could change that to just write records to a database instead.
Pretty much spot on. Local in-kernel file systems are hard, partly because of their essential nature and partly because of their history. A lot of the codebases involved still show their origins on single-core systems and pre-NCQ SATA disks, and the development/testing methods are from the same era. The developers always have time to improve the numbers on some ancient micro-benchmark, but new features often get pushed to the LVM layer (snapshots), languish for ages (unions/overlays), or are simply ignored (anything alternative to the fsync sledgehammer).
The only way a distributed file system such as I work on can provide sane behavior and decent performance to our users is to use local file systems only for course-grain space allocation and caching. Sometimes those magic incantations from ten-year-old LKML posts don't really work, because they were never really tested for more than a couple of simple cases. Other times they have unexpected impacts on performance or space consumption. Usually it's easier and/or safer just to do as much as possible ourselves. Databases - both local and distributed - are in pretty much the same boat.
Some day, I hope, all of this scattered and repeated effort will be combined into a common library that Does All The Right Things (which change over time) and adds features with a common API. It's not quite as good as if the stuff in the kernel had been done right, but I think it's the best we can hope for at this point.
What distributed filesystem do you work on?
I'm on the Gluster team.
1 reply →
AIUI, ZFS was explicitly designed to deal with this sort of data corruption - one of the descriptions of the design I've heard is "read() will return either the contents of a previous successful write() or an error". That would (in principle) prevent the file containing "a boo" or "a far" at any point.
It looks like one of the authors cited in this article has written a paper analysing ZFS - though they admittedly don't test its behaviour on crashes. Citation here, in PDF form:
http://pages.cs.wisc.edu/~kadav/zfs/zfsrel.pdf
(edited to add: This only deals with the second part of this article. The first part would still be important even on ZFS)
Right, Copy-On-Write filesystems (ZFS, Bttr) are explicitly designed to prevent that kind of corruption by never editing blocks in place, but rather copying the contents to a new block and using a journaled metadata update to point the file at it's new block.
ZFS also includes features around checksumming of the metadata. "Silent" write errors become loud the next time data is accessed and the checksums don't match. This can't prevent all errors, but has some very nice data integrity properties - Combined with it's RAID format, you can likely recover from most any failures, and with RAIDZ2, you can recover from a scattered failures on all drives even if one drive has completely died. This is actually fairly common - Modern drives are very large, and rust is more susceptible to 'cosmic rays' than one might think.
There is an easy way to write data without corruption. First copy your file-to-be-changed as a temporary file or create a temporary file. Then modify the temporary file and write whatever you want in it. Finally, use rename() to atomically replace the old file by the temporary one.
The same logic also apply to directories, although you will have to use links or symlinks to have something really atomic.
It may not work on strangely configured systems, like if your files are spread over different devices over the network (or maybe with NFS). But in those cases you will be able to detect it if you catch errors of rename() and co (and you should catch them of course). So no silver bullet here, but still a good shot.
I'm surprised rename() wasn't mentioned in the article, it's a well known technique to atomically update a file, which is very practical for small-ish files.
Note that in the general case, you should fsync() the temporary file before you rename() it over the original - but ext3 and ext4 in writeback mode added a heuristic to do that automatically, because ext3 in the default ordered mode would effectively do that and many applications came to assume it.
rename is atomic, but it is not guaranteed to be durable. In order for rename to be durable, I've learned that you have to fsync the parent directory.
I was saddened when I learned this. I used to use this old trick for a lot of my systems. I learned it from reading the code of old well-written unix systems programs.
I guess that's not viable if your files are huge monolithic databases.
It also doesn't work if you want to support a non-trivial update rate, or if there's any possibility of contention with another process trying to do the same thing. It's the sort of thing that app writers get addicted to, because it does work in the super-easy case, but it doesn't help at all when you need to solve harder storage problems than updating a single config file once in a while.
"How is it that desktop mail clients are less reliable than gmail...?"
Made me chuckle. I've been told off by a former Googler colleague enough times now to have learned that Gmail is more complex than anyone imagines on a first guess, in order to be "reliable".
It is certainly the google service that I use the most. In a decade of quite heavy usage I remember one outage of a 1-2 hours (with no data loss). To me this is the gold standard that the rest of us should aspire to. :)
Lately (last year or so) I've started to notice substantial data loss. Either old mails completely missing or large mails being truncated (destroying inline images f.ex.)
So to anyone relying on gmail for safe keeping: Don't.
9 replies →
Interesting that none of the cited software uses maildir.
Breaking a mbox is an extremely simple thing, as the format leaves no possibility of error checking, parallel writing, rewriting lost data, or anything else.
Outlook's mail folders are marginally better, allowing for error detection, but really, that's a lame first paragraph for introducing a great article.
I've recently been playing with nbdkit, which is basically FUSE but for block devices rather than file systems.
I was shocked to discover that mke2fs doesn't check the return value of its final fsync call. This is compounded by the fact that pwrite calls don't fail across NBD (the writes are cached, so the caller's stack is long gone by the time the get flushed across the network and fails...)
As a test, I created an nbdkit plugin which simply throws away every write. Guess what? mke2fs will happily create a file system on such a block device and not report failure. You only discover a problem when you try to mount the file system.
The article's table of filesystem semantics is missing at least one X: Appends on ext4-ordered are not atomic. When you append to a file, the file size (metadata) and content (data) must both be updated. Metadata is flushed every 5 seconds or so, data can sit in cache for more like 30. So the file size update may hit disk 25s before the data does, and if you crash during that time, then on recovery you'll find the data has a bunch of zero bytes appended instead of what you expected.
(I like to test my storage code by running it on top of a network block device and then simulating power failure by cutting the connection randomly. This is one problem I found while doing that.)
Wow, 5 and 30 seconds before metadata and data flush? It sounds unbelievably long. If it's true, almost every power loss results in a data loss of whatever was written in the last 15 seconds, on average? Is it so bad?
I'd expect more "smartness" of Linux, like, as soon as there is no "write pressure" to flush earlier.
> If it's true, almost every power loss results in a data loss of whatever was written in the last 15 seconds, on average? Is it so bad?
No, because correct programs use sync() and/or fsync() to force timely flushes.
A good database should not reply successfully to a write request until the write has been fully flushed to disk, so that an "acknowledged write" can never be lost. Also, it should perform write and sync operations in such a sequence that it cannot be left in a state where it is unable to recover -- that is, if a power outage happens during the transaction, then on recovery the database is either able to complete the transaction or undo it.
The basic way to accomplish this is to use a journal: each transaction is first appended to the journal and the journal synced to disk. Once the transaction is fully on-disk in the journal, then the database knows that it cannot "forget" the transaction, so it can reply successfully and work on updating the "real" data at its leisure.
Of course, if you're working with something that is not a database, then who knows whether it syncs correctly. (For that matter, even if you're working with a database, many have been known to get it wrong, sometimes intentionally in the name of performance. Be sure to read the docs.)
For traditional desktop apps that load and save whole individual files at a time, the "write to temporary then rename" approach should generally get the job done (technically you're supposed to fsync() between writing and renaming, but many filesystems now do this implicitly). For anything more complicated, use sqlite or a full database.
> I'd expect more "smartness" of Linux, like, as soon as there is no "write pressure" to flush earlier.
Well, this would only mask bugs, not fix them -- it would narrow the window during which a failure causes loss. Meanwhile it would really harm performance in a few ways.
When writing a large file to disk sequentially, the filesystem often doesn't know in advance how much you're going to write, but it cannot make a good decision on where to put the file until it knows how big it will be. So filesystems implement "delayed allocation", where they don't actually decide where to put the file until they are forced to flush it. The longer the flush time, the better. If we're talking about a large file transfer, the file is probably useless if it isn't fully downloaded yet, so flushing it proactively would be pointless.
Also flushing small writes rather than batching might mean continuously rewriting the same sector (terrible for SSDs!) or consuming bandwidth to a network drive that is shared with other clients. Etc.
1 reply →
This problem can be fixed. We need to rethink file system semantics. Here's an approach:
Files are of one of the following types:
Unit files
For a unit file, the unit of consistency is the entire file. Unit files can be created or replaced, but not modified. Opening a unit file for writing means creating a new file. When the new file is closed successfully, the new version replaces the old version atomically. If anything goes wrong, including a system crash, between create and successful close, including program abort, the old version remains and the new version is deleted. File systems are required to maintain that guarantee.
Opens for read while updating is in progress reference the old version. Thus, all readers always see a consistent version.
They're never modified in place once written. It's easy for a file system to implement unit file semantics. The file system can cache or defer writes. There's no need to journal. The main requirement is that the close operation must block until all writes have committed to disk, then return a success status only if nothing went wrong.
In practice, most files are unit files. Much effort goes into trying to get unit file semantics - ".part" files, elaborate file renaming rituals to try to get an atomic rename (different for each OS and file system), and such. It would be easier to just provide unit file semantics. That's usually what you want.
Log files
Log files are append-only. The unit of consistency is one write. The file system is required to guarantee that, after a program abort or crash, the file will end cleanly at the end of a write. A "fsync" type operation adds the guarantee that the file is consistent to the last write. A log file can be read while being written if opened read-only. Readers can seek, but writers cannot. Append is always at the end of the file, even if multiple processes are writing the same file.
This, of course, is what you want for log files.
Temporary files
Temporary files disappear in a crash. There's no journaling or recovery. Random read/write access is allowed. You're guaranteed that after a crash, they're gone.
Managed files
Managed files are for databases and programs that care about exactly when data is committed. A "write" API is provided which returns a status when the write is accepted, and then makes an asynchronous callback when the write is committed and safely on disk. This allows the database program to know which operations the file system has completed, but doesn't impose an ordering restriction on the file system.
This is what a database implementor needs - solid info about if and when a write has committed. If writes have to be ordered, the database program can wait for the first write to be committed before starting the second one. If something goes wrong after a write request was submitted, the caller gets status info in the callback.
This would be much better than the present situation of trying to decide when a call to "fsync" is necessary. It's less restrictive in terms of synchronization - "fsync" waits for everything to commit, which is often more than is needed just to finish one operation.
This could be retrofitted to POSIX-type systems. If you start with "creat" and "O_CREAT" you get a unit file by default, unless you specify "O_APPEND", in which case you get a log file. Files in /tmp are temporary files. Managed files have to be created with some new flag. Only a small number of programs use managed files, and they usually know who they are.
This would solve most of the problems mentioned in the original post.
I worked at a storage company and the scariest thing I learned is that your data can be corrupt even though the drive itself says that the data was written correctly. The only way to really be sure is to check your files after writing them that they match. Now whenever I do a backup, I always go through them one more time and do a byte-by-byte comparison before being assured that it's okay.
This is true. Which is why we really, really need checksummed filesystems. I am very worried that this hasn't made its way into mainstream computing yet, especially given the growing drive sizes and massive CPU speed increases.
I run a 10x3TB ZFS raidz2 array at home. I've seen 18 checksum errors at the device level in the last year - these are corruption from the device that ZFS detected with a checksum, and was able to correct using redundancy. If you're not checksumming at some level in your system, you should be outsourcing your storage to someone else; consumer level hardware with commodity file systems isn't good enough.
14 replies →
Fortunately, zfs on Linux is excellent, and is a two-liner on modern Ububtu LTS. (add PPA, install zfs.)
4 replies →
This is assuming that the underlying block device would forcibly flush those queued writes to disk and then re-read them again rather than just serve them up directly from the pending write queue directly without flushing them first.
You generally can't make that assumption about a black box, so reading back your writes guarantees nothing.
Unless you're intimately familiar with your underlying block device you really can't guarantee anything about writes going to physical hardware. All you can do is read its documentation and hope for the best.
If you need a general hack to that's pretty much guaranteed to flush your writes to a physical disk it would be something like:
Even then you have no guarantees that those writes wouldn't be flushed to the medium while leaving the writes you care about in the block device's internal memory.
This is why end-to-end data integrity with something like T10-PI is a necessity. The kernel block-layer already generates and validates the integrity for us, if the underlying drive supports it, but all major filesystems really need to start supporting it as well.
I don't think that's a necessity for all workflows. Just think about it, that would require all of us buying enterprise 520 or 528 byte sector drives to store the extra checksum information, and a whole new API up to the application level to confirm, point to point, that the data in the app is the data on the drive on writes, and the data on the drive is the data in the app on reads. It's not like T10/PI comes for free just by doing any one thing, it implies changes throughout the chain.
Great write-up and probably explains some issues in my apps a while back. Like that my long-time favorite, XFS, totally kicks ass in the comparisons. I agree on using static analysis and other techniques as I regularly push that here. What threw me is this line:
"that when they came up with threads, locks, and conditional variables at PARC, they thought that they were creating a programming model that anyone could use, but that there’s now decades of evidence that they were wrong. We’ve accumulated a lot of evidence that humans are very bad at reasoning at these kinds of problems, which are very similar to the problems you have when writing correct code to interact with current filesystems."
There were quite a few ways of organizing systems, including objects and functions, that the old guard came up with. UNIX's popularity and organization style of some others pushed the file-oriented approach from mainstream into outright dominance. However, many of us long argued it was a bad idea and alternatives exist that just need work on the implementation side. We actually saw some of those old ideas put into practice in data warehousing, NoSQL, "the cloud," and so on. Just gotta do more as we don't have a technical reason for dealing with the non-sense in the write-up: just avoiding getting our hands dirty with the replacements.
I think you'll find this an interesting read then: http://teh.entar.net/~nick/mail/why-reiserfs-is-teh-sukc
It's written in 2004 so I don't know how current it is, but it makes the point that XFS makes certain performance & safety guarantees essentially assuming that you're running on hardware that has a UPS with the ability to interrupt the OS saying "oops, we're out of power".
It was designed by SGI for high-end workstations and supercomputers with long-running jobs (esp render farms). So, that doesn't surprise me. However, it's nice to have all the assumptions in the open and preferrably in user/admin guides. Another issue was it zeroing out stuff on occasion but they fixed that.
2004 is not current for XFS, that is a decade ago! However, disks finishing writes and not lying about having done it is a critical need for all FS. For some like ext3 you would notice it less as it was flush happy.
XFS is becoming the sane default filesystem for servers as it allocates nodes more consistently than the other current mainstream linux options on multidisk systems. Basically small servers now have more disk space and performance than the large systems of 2004. So XFS stayed put in where it starts to make sense, but systems grew to meet its sweetspot much often.
1 reply →
In plan9 mailbox files, like many others, are append only.
All files are periodically (default daily) written to a block coalescing worm drive and you can rewind the state of the file system to any date on a per process basis, handy for diffing your codebase etc.
For a while the removal of the "rm" command was considered to underline the permanence of files but the removal of temporary data during the daytime hours turned out to be more pragmatic.
How does Plan 9 deal the equivalent of this append-only pattern potentially causing corruption on Unix if you have multiple writers and the writes are larger than PIPE_BUF (4k by default on Linux)?
Most users of this pattern (concurrent updates to log files) get away with it because their updates are smaller than 4k, but if you're trying to write something as big as an E-Mail with this pattern you can trivially get interleaved writes resulting in corruption.
Exclusive locking
Surely filesystems are going to go through a massive change when SSDs push standard spinning disks into the history books? They must carry a lot of baggage for dealing with actual spinning disks, much of which is just overhead for super-fast solid state drives. Hopefully this will allow interesting features not possible on spinning disks, like better atomic operations.
"IotaFS: Exploring File System Optimizations for SSDs"
Our hypothesis in beginning this research was simply that the complex optimizations applied in current file system technology doesn’t carry over to SSDs given such dramatic changes in performance characteristics. To explore this hypothesis, we created a very simple file system research vehicle, IotaFS, based on the incredibly simple and small Minix file system, and found that with a few modifications we were able to achieve comparable performance to modern file systems including Ext3 and ReiserFS, without being nearly as complex.
http://web.stanford.edu/~jdellit/default_files/iotafs.pdf
Yeah btrfs 'ssd' mount option does less for the same reasoning, but still does include checksums for metadata and data because SSDs have at least as much likelihood of non-deterministically returning your data as spinning rust. So even if it doesn't fix the corruptions (which requires additional copies or parity), at least there's an independent way of being informed of problems.
I wonder how this approach (single file + log) compares to the other usual approach (write second file, move over first):
1. Write changed the data into a temporary file in the same directory (don't touch the original file)
2. Move new file over old file
Does this lead to a simpler strategy that is easier to reason about, where it is less likely for programmer to get it wrong? At least I see this strategy being applied more often than the "single file + log" approach.
The obvious downside is that this temporarily uses twice the size of the dataset. However, that is usually mitigated by splitting the data into multiple files, and/or applying this only to applications that don't need to store gigabytes in the first place.
That's not guaranteed to work in the face of crashes. The problem is that the directory update could get flushed to disk before the file data.
This is the fundamental problem: When you allow the OS (or the compiler, or the CPU) to re-order operations in the name of efficiency you lose control over intermediate states, and so you cannot guarantee that these intermediates states are consistent with respect to the semantics that you care about. And this is the even more fundamental problem: our entire programming methodology has revolved around describing what we want the computer to do rather than what we want to to achieve. Optimizations then have to reverse-engineer our instructions and make their best guesses as to what we really meant (e.g. "This is dead code. It cannot possibly affect the end result. Therefore it can be safely eliminated.) Sometimes (often?) those guesses are wrong. When they are, we typically only find out about it after the mismatch between our expectations and the computer's have manifested themselves in some undesirable (and often unrecoverable) way.
"That's not guaranteed to work in the face of crashes. The problem is that the directory update could get flushed to disk before the file data."
No, it can work, provided that the temporary file is fsynced before being renamed, the parent directory is fsynced after renaming the file, and that the application only considers the rename to have taken place after the parent directory is fsynced (not after the rename call itself).
2 replies →
Good summary of the situation. It's why I fought out-of-order execution at hardware and OS levels as much as I could. Even went out of way to use processors that didn't do it. Market pushed opposite stuff into dominance. Then came the inevitable software re-writes to get predictability and integrity out of what shouldn't have problems in the first place. It's all ridiculous.
It bugged me that Sublime Text used to do these so-called atomic saves by default since it screwed with basic unix expectations like fstat and fseek meaningfully working (like a tail -f implementation could boil down to[0]). A concurrently running process using those calls would be off in lala-land as soon as the text file was modified and saved: it would never pick up any modifications, because it and the editor weren't even dealing with the same file any more.
[0] Here's follow(1) in my homemade PL:
GNU tail lets you say "--follow=descriptor" to follow the content of the file no matter how it gets renamed, or "--follow=name" to re-open the same filename when a different file gets renamed onto it.
2 replies →
Functional versus imperative concurrent shared data approaches provide a good analogy:
* Single file + log: fine grained locking in a shared C data structure. Yuck!
* Write new then move: transactional reference to a shared storage location, something like Clojure's refs. Easy enough.
The latter clearly provides the properties we'd like, the former may but it's a lot more complicated to verify and there are tons of corner cases. So I think move new file over old file is the simpler strategy and way easier to reason about.
The obvious downside is that this temporarily uses twice the size of the dataset. However, that is usually mitigated by splitting the data into multiple files, and/or applying this only to applications that don't need to store gigabytes in the first place.
Clojure's approach again provides an interesting solution to saving space. Taking the idea of splitting data into multiple files to the logical conclusion, you end up with the structure sharing used in efficient immutable data structures.
Your solution is slower; also, you need to fsync()/fdatasync() the new file before moving, at least on some systems (http://lwn.net/Articles/322823/ is the best reference I can find right now), and you need to fsync() the directory if you wish the rename to be durable (as opposed to just all-or-nothing.)
In general this approach should work fine, but devil is in detail: 1. You have to flush change to temporary file before move because otherwise you may get empty file: OS may reorder move and write operations 2. After move you have to flush parent directory of destination file on Posix. Windows have special flag for MoveFileEx() to ensure that operation is done or you have to call FlushFileBuffers() for destination file.
Linked paper mention the many popular programs forgets about (1).
The article only mentions Linux/Posix systems, are the same problems also present in Windows/NTFS? I was under the impression that, for example, renames on ntfs were crash safe and atomic, which would make the "write temp file then rename to target" work even if the power is cut?
NTFS is not very safe. No data integrity checksums. I think it's about same level as ext4 mostly. Meaning not very good. One shouldn't trust any critical data on NTFS without checksums and duplication.
NTFS consistency checks and recovery are pretty good. But they won't bring your data back.
Microsoft's ReFS (Reliable File System) might give storage reliability to Windows one day. On Linux you can use ZFS or btrfs (with some reservations) today.
https://en.wikipedia.org/wiki/ReFS
> ...work even if the power is cut?
If power is cut during NTFS logfile update, all bets are off. Hard disks can and will do weird things when losing power unexpectedly. That includes writing incorrect data, corrupting nearby blocks, corrupting any blocks, etc. That includes earlier logfile data, including checkpoints.
The article makes me wonder whether there's enough abstraction being done via the VFS layer, because all this fsync business that application developers seem to have to do can be so workload and file system specific. And I think that's asking too much from application developers. You might have to fsync the parent dir? That's annoying.
I wonder if the article and papers its based on account for how the VFS actually behaves, and then if someone wanting to do more research in this area could investigate this accounting for the recent VFS changes. On Linux in particular I think this is a bigger issue because there are so many file systems the user or distro could have picked, totally unbeknownst to and outside the control of the app developer.
That's definitely asking too much of app developers. Every time someone complains about any of this, the filesystem developers come back with a bit of lore about a non-obvious combination of renameat (yes that's a real thing) and fsync on the parent directory, or some particular flavor of fallocate, or just use AIO and manage queues yourself, or whatever depending on exactly which bogus behavior you're trying to work around. At best it's an unnecessary PITA. At worst it doesn't even do what they claimed, so now you've wasted even more time. Most often it's just non-portable (I'm so sick of hearing about XFS-specific ioctls as the solution to everything) or performs abominably because of fsync entanglement or some other nonsense.
We have libraries to implement "best practices" for network I/O, portable across systems that use poll or epoll or kqueues with best performance on each etc. What we need is the same thing for file I/O. However imperfect it might be, it would be better than where we are now.
Very rudimentary, but a way for an application developer to specify categories of perform/safety ratio operations. An app developer might have a simple app that only cares about performance, only cares about safety, there'd be a default in between both. Another app developer might have mixed needs depending on the type of data the app is generating. But in this way, if they write the app with a category of A (let's say that means highest safety at expense of performance) and their benchmarking determines this is crap, and they have to go to category B for writes, that's a simpler change that going back through their code and refactoring a pile of fsyncs or FUA writes.
I mean, I thought this was a major reason for VFS abstraction between the application and kernel anyway. It's also an example of the distinction between open source and free (libre). If as an application developer you have to know such esoterics to sanely optimize, you in fact aren't really free to do what you want. You have to go down a particular rabbit hole and optimize for that fs, at the expense of others. That's not fair choice to have to make.
The inherent issue is that there's a huge performance benefit to be gained by batching updates. FS safety will always come at the cost of performance.
The article doesn't say but I suspect most of the issues it mentions can be mitigated by mounting with the "sync" and "dirsync" options, but that absolutely kills performance.
The APIs involved could definitely be friendlier, but the app developer is using an API that's explicitly performance oriented by default at the cost of safety, and needs to opt-in to get safer writes. Whether the default should be the other way around is one matter, but ultimately someone has to pick which one they want and live with the consequences.
One of the naive assumptions that most of us make is that if there's a power failure none of the data that was fsynced successfully before will be corrupted.
Unfortunately, this is not the case for SSDs.
This issue is completely fixed by Maildir and was many years ago. Many clients, including Mutt, for example, support Maildir boxes
One thing I've always wanted to try and never had time is to build SQlite as a kernel module, talking directly to the block device layer, and then implement a POSIX file system on top of it.
It wouldn't solve problems with the block device layer itself, but it'd be interesting to see how robust and/or fast it was in real life.
SQLite's still too big of a pig, and still unreliable. But we can do LMDB in-kernel, and an LMDBfs is in the works. Based on BDBfs.
http://www.fsl.cs.sunysb.edu/docs/kbdbfs-msthesis/
SQLite itself relies on a filesystem for the rollback journal or write-ahead log. So you'd need some kind of abstraction between the block device and SQLite. Might as well just use the existing filesystem and keep SQLite in user space, since that works.
cough
http://www.sqlite.org/src/doc/trunk/src/test_onefile.c
1 reply →
This article sheds some light on a problem I've had for years:
The software product I develop stores its data in a local database using H2, stored in a single file. Users tend to have databases gigabytes in size.
After reading this article, I start to understand why my users occasionally get corrupted databases. A hard problem to solve.
An easy problem to solve. LMDB never corrupts.
While I experienced this pain first hand I'm not sure that the FS deserved a 100% of the blame. There is enough to blame to go around for the userspace, filesytems, block device layer and disk controllers, hardware & firmware.
If you start at the bottom of the stack you have all sorts of issues with drives, their firmware and controllers doing their own levels of caching & re-ordering. Over the years the upper layer developers (block layer / fs) had to deal with hardware that simply lies about what happened or is just plain buggy.
It appears that sqlite could be good basis for decrapified filesystem.
then you didn't read the article closely. SQLite has plenty of crash vulnerabilities.
I just saw zeros in the tables.
I don't program much in C, or use direct system calls for files. Mostly I use Java.
Does anyone know if any of this applies to Java's IO operations. I'm sure you can force some of this behaviour, but for instance: The flush method on OutputStream, will it ensure proper sync, or is that again dependent on the OS and file system as described in the article for syscalls?
This answers your question:
http://docs.oracle.com/javase/8/docs/api/java/io/OutputStrea...
It's only logical if you think about the bigger picture. Does Java have access to the underlying disk device or does it work with the filesystem? Which component is responsible for the filesystem?
You can force writes to disk with NIO, but I don't think that really solves any of the problems detailed in this article.
I do agree with you, this is my story about the usual stuff: http://www.sami-lehtinen.net/blog/chaos-monkey-bit-me-a-shor... As well as this classic from SQLite3 guys: https://www.sqlite.org/howtocorrupt.html - As you can see, there are multiple ways to ruin your day.
Ah, fond memories of async ext2 corrupting root filesystems beyond recognition... I think we 'invented' 'disposable infrastructure' back in 2002 because the filesystem forced us to... The MegaRAID cards that would eat their configuration along will all your data didn't help either..
Can't remember if we switched to ext3 in ordered or data-journaled mode but it made an immense difference...
IIRC I have seen this discussion before. And the answer was do a fsync. But for sakes of performance we want to be able to issue write-barriers to a group of file handles. So we know the commands will be ran in order, and no other order.
Correct, we want to be able to write in deterministic order. SCSI has supported this natively for decades. Unfortunately SATA doesn't, and the Linux kernel pretty much doesn't, because it can't rely on the storage devices to support it.
Which is just lame, its mostly because the block layer doesn't support file based fencing. A mistake made a decade an a half ago, and no one has the will/political power to fix it.
If the block layer supported it, solving the problem of fencing an ATA device would be as simple as issuing a whole device flush instead of a SYNC_CACHE with range. Which for soft raid devices would make a huge impact because only the devices with dirty data need to be flushed.
Of course the excuse today, is that most scsi devices can't do range based sync either, and just fall back to whole device because no one uses the command. Chicken/egg.
Isn't block level duplication/checksumming like RAID supposed to solve this hardware unreliability? I understand that by default RAID is not used on end user desktops.
AFAIK, most RAID systems do not have checksums. I may be wrong, but I think RAID5/RAID6 even amplifies error frequency.
It gets more "fun" when you consider many (most? all?) hard disks can get corrupted without checksum failures.
Layer "violators" like ZFS and btrfs do have checksums.
Maybe conventional block / filesystem layering itself is faulty.
The layers that ZFS violates were created years before the failure modes were well understood that filesystem-based checksums address. I'm not sure how you _can_ solve these issues without violating the old layers.
In particular: checksumming blocks alongside the blocks themselves (as some controllers and logical volume managers do) handles corruption within blocks, but it cannot catch dropped or misdirected writes. You need the checksum stored elsewhere, where you have some idea what the data _should_ be. Once we (as an industry) learned that those failure modes do happen and are important to address, the old layers no longer made sense. (The claim about ZFS is misleading: ZFS _is_ thoughtfully layered -- it's just that the layers are different, and more appropriate given the better understanding of filesystem failure modes that people had when it was developed.)
Dragonflybsd's HAMMER is a non layer "violator" with checksums (not that I mind the violations, they're great).
Dragonflybsd's HAMMER is a non layer "violator" with checksums (not that I mind the violations, they're great)
Good article. You have a typo: looks like you're using Pandoc or something similar, and left out the closing parenthesis in the link after [undo log].
Yeah, that's a common error among Markdown writers. It's too easy to forget the last bracket, especially if you are putting a link inside a parenthetical comment in the first place.
Fortunately, it's easy to detect programmatically. I have a little shell script which flags problems in my Markdown files: http://gwern.net/markdown-lint.sh
In this case, you can use Pandoc | elinks -dump to get a rendered text version, and then simply grep in the plain text for various things like "-e '(http' -e ')http' -e '[http' -e ']http'"
For me, Vim's highlighting and concealing prevents this class of typos. It also makes it more pleasant to read the source, as it hides the links unless I'm editing the line.
Does anyone know if there's a recording of the usenix talk given? (Referenced slides)
Edit: Found it. https://youtube.com/watch?v=SVYegdh2CbE
This is an interesting article but the examples given are seriously dated:
Mail servers and clients (MTAs and MUAs) have been using Maildir format to escape this problem since 1996.
Filesystems have evolved. ZFS has been available for ten years.
If you take your data seriously, you do need to make informed choices. But this article isn't targeted at people who won't know about Maildir and ZFS.
The section about relative data loss failures between various applications is great. Again, careful choices: SQLite and Postgres.
Wasn't part of the need for Maildir indexability rather than reliability?
It provides both. mbox was a known evil 20 years ago, so it's a bad example for the article.
The qmail spec describes queuing and delivery in a filesystem-safe manner.
> Filesystems have evolved. ZFS has been available for ten years.
Perhaps you should examine the section discussing the various problems with file systems. ZFS is hardly immune from these problems.
Perhaps you should take a closer look. ZFS wasn't part of that discussion, and none of the file systems which were discussed have ZFS's feature set.
2 replies →
The article is a troll with some good information.
I've been a heavy mail user for years... Never encountered data loss due to file system problem, and honestly I can't think of a time in the last decade where anyone I'm acquainted with has. (And I ran a very large mail system for a long time)
Hell, I've been using OSX predominately for years now, and that garbage file system hasn't eaten any data yet!
There are problems, even fundamental problems, but if someone is literally unable to use any mail client, you need to look at the user before the file system.
Where I think you're seeing "that garbage file system" not eating your data has a lot to do with no crashes or power losses. It has evolved a good deal since HFS and even HFS+ days, no one uses either of those anymore. It's all HFSJ, with a scant number using HFSX.
20 years ago Mac OS crashed often, and had a file system not designed to account for that. OS X even shipped with non-journaled HFS+. It was only into the 3rd major release of OS X that journaling appeared. Data corruptions, I feel, dropped massively, because the OS didn't crash nearly as often, but did still crash. In the last 4-5 years I'd say I get maybe one or two kernel panics per year on OS X, which is a lot less than I get on Linux desktops. But even still on Linux desktops, I can't say when I've seen file system corruption not attributable to pre-existing hardware issues.
Cough, cough. MH.
What, you don't think MH was doing better file handling than mbox, including most of maildir's techniques, a decade before maildir?
PG was quite proud of having just used the file system as a database with Viaweb, claiming that "The Unix file system is pretty good at not losing your data, especially if you put the files on a Netapp." His entire response is worth reading in light of the above article, if only to get a taste of how simultaneously arrogant and clueless PG can be: http://www.paulgraham.com/vwfaq.html
PG and company still think this is a great idea, because the software this forum runs on apparently also does the file system-as-database thing.
No wonder Viaweb was rewritten in C++.
EDIT: to those downvoting, my characterization of PG's response is correct. Its arrogance is undeniable, with its air of "look how smart I am; I see through the marketing hype everyone else falls for," as is its cluelessness, with PG advocating practices that will cause data loss. Viaweb was probably a buggy, unstable mess that needed a rewrite.
You're spending your Sunday ragging on someone with a throwaway and being hypervigilent about downvotes on your throwaway. Get a grip mate.
being hypervigilent about downvotes on your throwaway
Well, FWIW, it's working. At this instant his total karma is +10, he only has 2 posts, and the 2nd post is slightly greyed. So that means that his original comment is now in the range of +10.
Sadly, the fact that I'm replying here probably means that I need to get a life!
It's clear you're ignorant of the capabilities netapp filers had, even back then.
WAFL uses copy on write trees uniformly for all data and metadata. No currently live block is ever modified in place. By default a checkpoint/snapshot is generated every 10 seconds. NFS operations since the last checkpoint are written into NVRAM as a logical recovery log. Checkpoints are advanced and reclaimed in a double buffering pattern, ensuring there's always a complete snapshot available for recovery. The filer hardware also has features dedicated to operating as a high availability pair with nearly instant failover.
The netapp appliances weren't/aren't perfect, but they are far better than you're assuming. They were designed to run 24/7/365 on workloads close to the hardware bandwidth limits. For most of the 2000's, buying a pair of netapps was a simple way to just not have issues hosting files reliably.
Perhaps you should take your own advice and dial back the arrogance a bit.
> No wonder Viaweb was rewritten in C++
Or maybe this massive non-sequitur is reason enough for downvotes. Also note this from the link you provided:
> But when we were getting bought by Yahoo, we found that they also just stored everything in files
Trusting Netapp is way better than 99% of status quo. It has had ZFS style integrity for a pretty long time.
I think PG was pretty much right in his judgement. Any file system is going to be pretty good on a reliable block store, such as Netapp.
Unless the file IO is done correctly, even the best file system won't save you from data loss, such as the kind that can result from sudden power failure, which is what the article talks about.
PG obviously thinks RDBMSs are just unnecessary middlemen between you and the same file system. He doesn't realize that even if they ultimately use the same file system you do, they likely don't use it the way you do. Maybe Viaweb used something like the last snippet of code in the article, but I doubt it.
1 reply →
>clueless
I was curious why pg did it that way. Here's a brief comment from him:
>What did Viaweb use?
>pg 3160 days ago >Keep everything in memory in the usual sort of data structures (e.g. hash tables). Save changes to disk, but never read from disk except at startup.
So similar to what Redis does today but a decade before Redis and likely faster than the databases of the day. Could have been important with loads of users editing their stores on slow hardware. Anyway it worked, they beat their 20 odd competitors and got rich. I'm sceptical that it was a poor design choice.
I downvoted you for lowering the tone of the conversation with personal insults. If you had just said something like 'note that PG's advice about using the Unix file system as a database is not now considered best practice,' that would have been fine.
The article implies very little about the aptness of 'using the file system as a database' for a specific application.
Who's PG?
Paul Graham, the boss of Ycombinator.
Your comment begs the question: Do you fall for the marketing hype? And if not, then do you think you should keep quiet about stuff that works?
At the time, IMHO, PG was indeed smart to be one of the few using FreeBSD as opposed whatever the majority were using.
But he has admitted they struggled with setting up RAID. They were probably not too experienced with FreeBSD. I am sure they had their fair share of troubles.
PG's essays and his taste in software are great and the software he writes may be elegant, but that does not necessarily mean it is robust.
Best filesystem I have experienced on UNIX is tmpfs. Backing up to permanent storage is still error-prone, even in 2015.
> At the time, IMHO, PG was indeed smart to be one of the few using FreeBSD as opposed whatever the majority were using.
Why was it a better OS choice at the time than, say, Solaris or IRIX or BSDI?
1 reply →
Frd
All of this is great, except the first two sentences:
> I haven’t used a desktop email client in years. None of them could handle the volume of email I get without at least occasionally corrupting my mailbox.
If I were to get so many emails that it was corrupt my mailbox, I'd first ask myself why, and how to stop that.
I wouldn't. If your daily workflow includes a high volume of email, then it does.
What I would ask is: is there a way I can solve this problem without having to totally rearrange how I use email? I'm curious whether the author looked at, say, running a personal MDA that served mail over IMAP, so that it could be interacted with via a desktop email client, without requiring that client to serve as the point of truth. Not to say that corruption couldn't still happen that way, but Thunderbird (for example) can be configured to store only a subset of messages locally, or none at all. With a reasonably fast connection to the MDA, this seems like a possibly workable solution.
> If your daily workflow includes a high volume of email, then it does.
Not necessarily. For example, if you have monitoring software sending you all email notifications, you could change that to just write records to a database instead.
1 reply →