That a database is faster than a nominal file system has been known for quite a while. It can't quite replace them though. Microsoft tried really hard to have database for their root filesystem, investing a lot of time and effort but ultimately it fizzled, why? Important edge cases (like swap) that databases do really poorly.
That said, if you're using a file system abstraction for complex and compound documents using a database is a really stellar way to go. In part because it doesn't have the "chunking" problem where allocation blocks are used for both data and metadata in file systems so you pick a size that is least bad for both, versus one "optimum" size for disk io's to keep the disk channel bandwidth highly utilized and the naming/chunking part as records inside that.
I wrote a YAML <=> SQLite tool so that apps could use an efficient API to get at component types but the file could be "exported" to pure text. And it worked well in non-UTF string applications. (this was fine for the app which was orchestrating back end processes) At some point it would be interesting to move it to UTF-8 to see how that worked out.
In my opinion database as a filesystem was a noble goal and someone will repeat it successfully in the future. Just because MS failed at it, does not mean that this idea is not possible.
I think that we stuck in 80-s/90-s foundational tech for too long and there're many new ideas that will shape future computing.
BeFS “includes support for extended file attributes, with indexing and querying characteristics to provide functionality similar to that of a relational database.”
The BeOS file system, an OS geek retrospective | Ars Technica
In a way, this is already happening. Documents are moving to the cloud, where they exist mostly in a database-like structure. End-user computers are becoming more and more a just an interface for documents that are in an online database, or (if the app supports offline) local sqlite databases.
The endgame is a machine with only an OS. The OS is where all the hard edge cases are, which prevented the db-as-fs idea from succeeding in the first place.
There's an interesting project that's going the other direction: they're starting with a distributed database, then building an OS of sorts atop it (not traditional posix API): https://dbos-project.github.io/
If you look at the on-disk structure of NTFS, you see that it's design is rather similar to databases. Metadata and file data are treated in the same way. Every 'special' data structure, like for example the record of which disk blocks are free, is actually stored as a [hidden] file.
If a file is short, the data doesn't need a whole disk block allocated to it. Likewise, if a file name is really long, that string doesn't need to be stored together with other metadata.
I'd love to use a db for a filesystem for many of my directories.
Granted, it might just be sub-optimal in certain scenarios, but I already have to logically partition my data into big blobs that are rarely accessed and stuff that must go into the SSDs already.
Maybe just giving that FS a hatch to deal with the stuff that must be special and odd/weird is ok. I don't really want the DB magic for my swap files anyway.
I'd like to see a different abstraction and UX for "files", something that feels like a database but with looser usage patterns from a user's perspective, more intuitive to use and reason about.
I don't want to remember paths and folders and all that crap. I don't want to depend on my discipline to organize and I'm willing to trade some flexibility in return.
There's some alternatives like tags, but none of them feel natural and require setup. I feel services like dropbox had a real chance to try something, but the best they came up with was showing me "recent documents" for a homepage, wasted opportunity if you ask me.
This is why I set out to build a completely new system from scratch that could perform both file system and database operations extremely well. It will let you store hundreds of millions of files in a single container (a pod); attach a multitude of contextual tags to each one; and then find subsets of those files using database-like queries (e.g. Find all JPEG photos where event = 'wedding' AND camera = 'Cannon EOS' AND size > 10 MB AND photographer = 'John'). It can find everything (even when there are millions of matches) in less than a second and even beats out Postgres and SQLite at most DB table queries.
Unfortunately I don't agree. I say unfortunately because I personally believe the knowledge model underlying a relational database system is a beautiful and fundamentally superior method of information management.
But I also have traveled enough roads in this industry to know that heterogeneity of approaches ends up defining actual practice -- for better and for worse -- and the "universality" of fopen/fread/fwrite/fclose of blobs of data is going to be hard to move away from.
I'd say that the only way that the "database" takes over the filesystem is by basically dropping to the lowest common denominator and becoming more like a filesystem; basically a key-value store. Throwing away the benefits of what a good database offers. Mainstream databases (in the form of SQL) only do a subset of what's possible with the relational model that Date&Codd laid out decades ago. I fear a move into a lower level of the stack would only worsen this.
In fact I think the trend is kind of going the other way. Filesystems are slowly adopting more and more of the storage techniques that came out of DB research, but using them for the fairly anemic storage Unix FS storage model. Because that's basically what the industry is asking for.
Further, I think there remains a poverty of knowledge about databases generally in the industry. I don't have a CS degree myself, but almost everyone I know (outside of my current employer) who does have one has said some variant of this to me: "I didn't take any/many DB classes in school." Knowledge of the relational data model is weak/thin in our industry (the low point being the "NoSQL" wave a decade ago). And within CS academia, it seems that R&D in this field is mostly centred around Europe (with the notable exception in North America being the excellent programme @ CMU.)
The practical problem is that filesystems and databases are highly optimized toward different objectives at an architectural level. You can treat one like the other but the performance will be comparatively terrible.
This is made even more difficult by the practical necessity of backward compatibility with standards like POSIX, SQL, et al in the same implementation that were never intended to be interchangeable at the implementation level.
In principle you could design a database to be used as an effective filesystem. In practice, the implementation wouldn't be very compatible with any other software you need to work with and the ability to integrate matters.
It seems like a solution looking for a problem. Indexers with file system hooks like APFS + Spotlight and the thing Haiku has seem like a more practical approach for similar results.
Microsoft most assuredly did not try to replace the filesystem with a database. WinFS was a relational database service that served as a central metadata store for Longhorn. It was slow, but most importantly never provided any value. The idea was that applications would store their structured data inside SQL Server cough I mean WinFS and then somehow they would be able to easily link their data. There was a complicated system of loading application schemas to accomplish this. Of course, if applications want to share data they don't need to involve a brittle system sql database.
Nothing is stopping you from storing blobs of data inside a database and then exporting a POSIX api and calling that a new and improved filesystem. But once again its hard to see what value you get from all this complexity. A filesystem is complex enough and it doesn't have to store structured data. You generally don't want the OS to handle this complexity; you have just added another failure mode to a part of the system that you really don't want to fail.
> Important edge cases (like swap) that databases do really poorly.
Wouldn't it have been possible to not use a one-size-fits all approach? I'm thinking about things like moving the swap to a dedicated partition, like on Linux, and ditto for the other edge cases.
In earlier Windows versions users sometimes had to fiddle with swap file size. Even on newer versions the "System determine size of swap file" may be a default option (don't have a Windows machine nearby right now). Resizing partition is much more cumbersome process than resizing a file.
I don't think replacing the root FS with a database is a good idea. There are two options here:
1. Make it a database with schemas. In that case, you'll never be able to change or extend the language used to define schemas, since every application and OS will need to keep supporting older versions. It will also be hard to have multiple competing implementations of a complete database system, particularly because applications will come to rely on the performance characteristics of the reference implementation.
2. Make it schemaless, like a key-value store. That basically just is the file system, but non-heirarchical and with a faster implementation. In that case, your DB could just be a faster implementation of the existing filesystem interface or a natural extension thereof. The hierarchical nature of the filesystem is fairly essential if you want to allow multiple applications to avoid trampling on each other.
A file system is, at its heart, a naming system. Given this name, return me a handle to "do things" with that object. In SQL it is "SELECT * FROM FILES WHERE ( NAME == <name> );". The next debate is what the schema for the files table?
Well known things that would appear in that schema are things like "access rights", "ownership", "access times", Etc. Additionally related things like the "consuming application" might be there, and the "editing application" Etc.
If these things were all relations/schema then a lot of "warts" like MAGICNUMBER or extension type, security certificates, integrity digests, OS requirement, symbol tables, or character encoding can then become schema elements and the names get "purified" (scare quotes because architecture astronauts really get all hot and bothered by overloading things with multiple semantics.
Then you can do "views" like when someone types a name at the shell you can select from files that are executable on this OS and match the names.
There are basically a whole bunch of things that are "grafted on" to the file system and they can (and do) get out of date relative to the files and cause problems. (the canonical cases are moving an executable from one directory to another making it no longer visible for execution, changing the extension or magic number resulting in the wrong thing being used to try to run/edit it)
The architecture reasoning goes, "If there was just one source of truth about these things, a database, then a whole bunch of bugs and user annoyances would vanish."
Anyway, I've witnessed people start down this grand vision, devolve into your #2 above (key/value store) and then finally throw their hands up in surrender.
One of the systems questions I ponder sometimes is at what point is memory so cheap and plentiful that replacing the buffer cache in the kernel with a giant interconnected schema like this gives equivalent "time to record" values. An RDMA accessible victim cache for "blob" buffering might help too.
A couple of people have taken runs at it as "object storage" or "storage as a service" but it isn't quite there yet.
Still, for a complex file format (say MPEG 4) having it be a data base gives you some advantages and makes writing file component parsers unnecessary. So that's a win.
Q: Isn’t the filesystem of a system a sort of database? Not in the sense of a query back-and-forth, but in the sense of holding all file records and locations.
The cool part about DBs is interleaving storing/manipulating and querying.
A raw hard drive is already a DB by just allow you to write to them in that sense, but no one would consider them a decent DB.
A filesystem is a database. It is not relational and has a different API but the main difference is in different design goals and different set of tradeoffs.
What is the backend for the storage buckets we see in cloud (and increasingly, HPC) computing? Seems like a database filesystem would be a good fit there.
MinIO has millions of installations according to the documentation. If this is the building block in your favorite cloud is mostly hidden and unknown to users.
I don't understand how this is possible, as a database is an abstraction on top of the filesystem. Perhaps for certain workloads, it may hit RAM more frequently than disk, making it faster. But for large binary blobs, Sqlite is way slower than the file system on Linux.
As others have pointed out, that benchmark is from 2017. However, the "SQLite: Past, Present, and Future" paper[0] has an updated version of this benchmark (see section 4.3 Blob manipulation), and also compares it with DuckDB.
Edit:
Another thing that is sometimes forgotten when comparing SQLite to the filesystem is that files are hard[1]. It's not only about performance, but also about all the guarantees that you get "for free", and all the complexity you can remove from your codebase if you need ACID interactions with your files.
Rename is totally atomic in a way that is simple to reason about. Ten processes try a rename to a per process name, and just one wins, even on an NFS.That one is free to do whatever and then rename back.
Yes, rename is atomic, but that doesn't get you very far. For example, if you want to do a partial update to a file, you need to copy the file (you already need more storage), edit the file and then rename it. But then you can't do this between multiple processes because maybe you have copied the file while another process was already changing a different copy (but not yet renamed). And that's not even considering fsync, or supporting different Operating Systems.
This reminds me of WinFS, which was probably one of the most ambitious architectural project ever in the history of Windows yet failed to materialize. The vision was so attractive, encode all the semantic knowledge of file schema and metadata into relational filesystem. So you can programmatically query on whatever information about the filesystem and its content.
IIRC the problem was its performance. I don't have any insider knowledge so cannot pinpoint the culprit but I suppose that the performance issue was probably not something fundamental tradeoff (as this article suggests) but more of its immature implementation. The storage technologies got much better nowadays so many of its problem could be tackled differently. Of course the question it has to answer is also different; is it a still worth problem to solve?
On Windows in my experience it's at least a factor of 10. I worked on a script recently that reads ~20000 files a few kb each and extracts some info to generate a web page. I sped it up enormously just by putting the file contents into SQLite tables.
Lots of small files (particularly in a single directory) is a known failure mode of many OS filesystems. I remember putting a million files into MogileFS and finding that filesystem operations basically did not complete in any reasonable length of time.
It also seems that some “do one thing well” tools want to implement their logic using files. For example: each entry in a password database is a file (pass(1)).
Whatever overhead of such small files might not matter though if the problem space is kept “human sized” (whatever one human can be bothered to manage).
I used to have tens of thousands of file in git annex. I had to tarball (chunk) some of the things that I never really use in order to speed up `git annex fsck`.
However, things have totally changed with the M1 generation of Macbooks. Things that were once near impossible now run in an instant. I need to redo this experiment.
Yeah I got absolutely crushed by this when trying to migrate a Windows Server 2016 machine to unRAID, whose filesystem is absolutely horrible when dealing with thousands of smaller files. Wiped out a month of work for NAS-related activities; we're back on Windows again.
Windows struggles horrendously with lots of small files. If you tried the same task on Linux you’d see a large jump in performance. I don’t know if Linux small file access could match SQLite but it would be a lot closer.
That's because Windows' FS offers quite a lot of features that many other FSs don't -- especially SQLite doesn't. You might not need those features, of course.
One core feature is that Windows offers hooks ('filters') that allows other components to put themselves between the client program and the files. This is how virtual filesystems work (like OneDrive, etc.), or how anti-malware works.
When you read from sqlite, then those reads can't come from another server, the objects can't be scanned automatically, etc. Again -- you might not need these features, but it's not that sqlite or ext4 is somehow magically faster; they just made different design choices.
Yes, Linux does a lot better (haven't tried that specific script, but have done a lot of similar things), but I've gotten speed improvements with similar use of SQLite there too, especially when dealing with a lot of files in the same directory.
Whenever I have lots of files that need to go into a directory, I'll split into subdirs by prefix (or other partitioning scheme)--look in .git/objects for example.
Also be careful if the function that returns directory contents sorts results that you don't need done.
I think an in-process database will always have a large advantage in this scenario, as going through the filesystem means at least three system calls (open, read, close) per file.
That's not to say filesystems couldn't be improved.
> SQLite reads and writes small blobs (for example, thumbnail images) 35% faster¹ than the same blobs can be read from or written to individual files on disk using fread() or fwrite().
> The performance difference arises (we believe) because when working from an SQLite database, the open() and close() system calls are invoked only once, whereas open() and close() are invoked once for each blob when using blobs stored in individual files. It appears that the overhead of calling open() and close() is greater than the overhead of using the database.
This should come as a surprise to no one. The rest of the article is only of interest in terms of how benchmarking is done, if that's your thing.
Maybe not, but if all small files in a directory have the same structure, permissions, owner and same parent directory, and you don't care about x-times, you might as well use this context.
Windows is slow due to all the filter drivers used by antivirus, custom FS drivers and other stuff. And the lack of a dentry cache. Basically all the reasons why WSL1 is slow.
On linux you'll get slowdowns too if you have some online antivirus hooking into fanotify. I suspect android is also slower due to additional security hooks (e.g. selinux).
And yeah, io_uring helps since open-write-close can now be condensed into a single syscall.
Object stores typically do optimizations where they store small files using a different strategy than large ones, maybe directly in the metadata database.
Facebook had a published paper on their storage system for pictures, haystack, which iirc is something like a slab allocation.
S3 is similar, in the sense that it has completely different usage than a file system (no hierarchy, direct access, no need for efficient listing etc) so I'm pretty sure they use something similar.
Even if it is bypassing the file system, S3 is itself essentially a file system. It has all the usual features of paths, permissions, and so on. I assume it can't completely escape the same issues.
S3 is a key-value store where object keys might contain slashes, but the implied directories don’t really exist. This is a problem for Spark and Hadoop jobs that expect to rename a large temp dir to signal that a stage’s output has been committed, because HDFS can do that atomically but S3 requires renaming objects one by one. IAM security policies also apply to keys or prefixes (renaming an object might change someone’s access level) and changes are cached for tens of minutes.
Some people have been crazy enough to store tables of padded data in the keys of a lot of zero-length objects (which they do charge for) and use ListObjects for paginated prefix queries. It doesn’t much matter whether keys have slashes or commas or what.
Long ago we had a contrasting experience. It had been assumed that the implemented "database" (not a general database) for storing nuclear spectroscopy data was needed because the filesystem was too slow. However, one of our "physicist programmers" decided to do the experiment, and found it wasn't so, and the system was re-designed around directories of files of spectra.
On the other hand, the other facility in the lab had consulted Logica on storage of similar data, who viewed x-y (or multiple dependent variables) data as tables suitable for storage in their early RDB, Rapport. That wasn't actually used in production, and the storage format for the table model was unfortunately usually mangled by data acquisition systems writing files.
That claim sounds more preposterous than it actually is, considering that databases and filesystems are both just software using the same hardware in a different manner.
Would it make sense to store images as blobs if I’m building a web app using SQLite then? Or is this specifically small images only? Saves having to do backups of data and images separately :)
No. This is a microbenchmark. They are really great for writing blogposts that get a lot of attention and not much more.
In reality there are other operations you wanna do on your files. Concurrent access, updates, deletes are probably not as great. Backing up a database with terrabytes of binaries inside is also non-trivial. Backing up terrabytes of files is a lot simpler.
If this is a small project, and it saves you time and makes the backup process easier for you, it's worth trying out. I've found backing up SQLite databases with Litestream to be one of the nicest quality of life things I've found in a while.
Is it possible to mount a database so that its rows (or columns, or whatever you have selected upon the call to mount) are accessible as regular files?
The answer is complex, but boils down to "bad testing practices", or "measuring the wrong thing".
To maybe expand on it a bit: it's not known how the author created, configured and mounted the filesystem they were using, neither make nor version of filesystem is given. The author of the test doesn't even know that any I/O test worth its salt needs to "warm up" the system for the results to be reliable. Instead, they run the test multiple times and average them.
To give examples of things that may influence the speed of such tests:
* Is the filesystem mounted with atime option? If it is, it will generate more I/O as open() now generates writes beside reads. This is just one example of mount options affecting the test.
* How is the filesystem configured to store directory info, sometimes it's possible to embed this info in the inode, other times it's a linked-list-like structure that will potentially generate more I/O requests. This is, again, but a single example of a category of factors that have to be controlled for.
* Whether filesystem supports journaling, snapshots, deduplication, compression... whether it's parallel, how big it is / how fragmented it is... how much memory is available for caching, how is system I/O merger configured? And the list of questions not covered by the authors of the test goes on.
The explanation the authors themselves came up with for the results they see is this: the way they designed filesystem tests, they call open() and close() a lot, and they don't do that in their database tests. But, open() and close() don't have a fixed "price". Their performance will depend on many factors listed above and the options given to open().
From what I can tell, the I/O is performed in a blocking way, in a single thread, which is the worst way to perform I/O if you want good performance. So, in this test, both SQLite and filesystem suck, and, if you wanted to make them go faster, then you definitely could. Especially, you could improve the filesystem case.
I'll start with an anecdote. Scroll down if you want just the conclusion.
I work in test automation. In the general storage area. I worked for some years with a distributed filesystem (think something like Lustre, but modern and fast), then worked with something like DRBD, but, again more modern. Not sure how fast (never ran any benchmarks on DRBD). I had to deal with filesystems like Ceph's filesystem, BeeGFS...
Anyways. When I worked on the DRBD analog, let's call the product "R", one of my tasks was to figure out how well would a database work on top of R. Well, "database" is a very broad term. I figured I'd concentrate on using couple of well-known relational databases. PostgreSQL turned out to have the most to offer in terms of insight into its performance. Next, I'd have to find a suitable configuration for the database. And that's where things got really complicated. To spare you the gory details: the more memory you can spare for the benefit of your database server -- the better, the more you can relax the requirement of synchronizing with persistent storage (i.e. fsync and friends) -- even better.
In the end of the day, I had to abandon this kind of testing because, essentially, if given enough memory, or enough replicas (allows not to care about destaging to persistent storage) the bigger numbers I could produce, which made the question "how well does R compare to a plain block device?" irrelevant.
---
Fast-forward to this article. It gives out a vibe of "filesystems are not efficient, if you re-arrange something in the logic of doing I/O you can gain more performance!". And that really reads like "10 things doctors don't want you to know!" advertorial. It's similar to the misguided idea I often encounter with people who aren't system programmers that "mmap is faster".
Now, understanding why "mmap is faster" is nonsense will help understand why benchmarking a database on top of a filesystem and comparing performance doesn't make a lot of sense. So, in order to properly compare the speed, we need to make sure we compare both good and bad paths. What happens when I/O errors occur when memory-mapping files and when using other system API? I invite you to explore this question on your own. Another question you need to ask yourself in this comparison is "how well does this process scale", on a single CPU, single block device, multiple CPUs single block device, single CPU multiple block devices... And on top of this, what if we consider Harvard architecture (to an extend, IBM's mainframes are it, at least there general I/O is separate from the rest of computing). In other words, what if our hardware knows "some tricks"? (Other examples include the kinds of drivers and protocols used to talk to the hardware and whether the storage software, even if running on top of a filesystem will know / be able to take advantage of these, i.e. NVMe allows big degree of parallelism, but will "mmap" be able to utilize that? Especially if multiple files are mapped at the same time?)
And, of course, there is a difference between what's been actually done and what guarantees can be given about the state of data during and after the I/O completes. And these details will greatly depend on the details of the specific filesystem being used. For example, if you wanted to write to the device directly (as in with "ODIRECT | OSYNC"), but the filesystem is something like ZFS or Btrfs (i.e. it needs to checksum your data beside other things), then you might get confused about what you are actually comparing (direct I/O would imply less I/O than is actually necessary to give durability guarantees that may not be given by an alternative storage).
---
So, a better title for the article would have been "It's possible to carve out a use-case, where SQLite works faster than a similar process designed to only use system API to access a filesystem". Which is essentially saying: you, the programmer, don't know how to use a filesystem as well as we do... And, in my experience, most programmers are clueless when it comes to using filesystem or any other kind of storage really. So, that's not surprising, but is still not as sensationalist as the original title.
Sure those are differences. I really mean: how does something deliberately designed to be a file system in a file compare to SQLite being used that way?
That a database is faster than a nominal file system has been known for quite a while. It can't quite replace them though. Microsoft tried really hard to have database for their root filesystem, investing a lot of time and effort but ultimately it fizzled, why? Important edge cases (like swap) that databases do really poorly.
That said, if you're using a file system abstraction for complex and compound documents using a database is a really stellar way to go. In part because it doesn't have the "chunking" problem where allocation blocks are used for both data and metadata in file systems so you pick a size that is least bad for both, versus one "optimum" size for disk io's to keep the disk channel bandwidth highly utilized and the naming/chunking part as records inside that.
I wrote a YAML <=> SQLite tool so that apps could use an efficient API to get at component types but the file could be "exported" to pure text. And it worked well in non-UTF string applications. (this was fine for the app which was orchestrating back end processes) At some point it would be interesting to move it to UTF-8 to see how that worked out.
In my opinion database as a filesystem was a noble goal and someone will repeat it successfully in the future. Just because MS failed at it, does not mean that this idea is not possible.
I think that we stuck in 80-s/90-s foundational tech for too long and there're many new ideas that will shape future computing.
BeFS “includes support for extended file attributes, with indexing and querying characteristics to provide functionality similar to that of a relational database.”
The BeOS file system, an OS geek retrospective | Ars Technica
[https://arstechnica.com/information-technology/2018/07/the-b...]
3 replies →
In a way, this is already happening. Documents are moving to the cloud, where they exist mostly in a database-like structure. End-user computers are becoming more and more a just an interface for documents that are in an online database, or (if the app supports offline) local sqlite databases.
The endgame is a machine with only an OS. The OS is where all the hard edge cases are, which prevented the db-as-fs idea from succeeding in the first place.
18 replies →
There's an interesting project that's going the other direction: they're starting with a distributed database, then building an OS of sorts atop it (not traditional posix API): https://dbos-project.github.io/
If you look at the on-disk structure of NTFS, you see that it's design is rather similar to databases. Metadata and file data are treated in the same way. Every 'special' data structure, like for example the record of which disk blocks are free, is actually stored as a [hidden] file.
If a file is short, the data doesn't need a whole disk block allocated to it. Likewise, if a file name is really long, that string doesn't need to be stored together with other metadata.
I'd love to use a db for a filesystem for many of my directories.
Granted, it might just be sub-optimal in certain scenarios, but I already have to logically partition my data into big blobs that are rarely accessed and stuff that must go into the SSDs already. Maybe just giving that FS a hatch to deal with the stuff that must be special and odd/weird is ok. I don't really want the DB magic for my swap files anyway.
I'd like to see a different abstraction and UX for "files", something that feels like a database but with looser usage patterns from a user's perspective, more intuitive to use and reason about.
I don't want to remember paths and folders and all that crap. I don't want to depend on my discipline to organize and I'm willing to trade some flexibility in return.
There's some alternatives like tags, but none of them feel natural and require setup. I feel services like dropbox had a real chance to try something, but the best they came up with was showing me "recent documents" for a homepage, wasted opportunity if you ask me.
This is why I set out to build a completely new system from scratch that could perform both file system and database operations extremely well. It will let you store hundreds of millions of files in a single container (a pod); attach a multitude of contextual tags to each one; and then find subsets of those files using database-like queries (e.g. Find all JPEG photos where event = 'wedding' AND camera = 'Cannon EOS' AND size > 10 MB AND photographer = 'John'). It can find everything (even when there are millions of matches) in less than a second and even beats out Postgres and SQLite at most DB table queries.
https://www.youtube.com/watch?v=dWIo6sia_hw (file system functions) https://www.youtube.com/watch?v=Va5ZqfwQXWI (benchmark vs SQLite) https://www.youtube.com/watch?v=OVICKCkWMZE (benchmark vs Postgres)
Unfortunately I don't agree. I say unfortunately because I personally believe the knowledge model underlying a relational database system is a beautiful and fundamentally superior method of information management.
But I also have traveled enough roads in this industry to know that heterogeneity of approaches ends up defining actual practice -- for better and for worse -- and the "universality" of fopen/fread/fwrite/fclose of blobs of data is going to be hard to move away from.
I'd say that the only way that the "database" takes over the filesystem is by basically dropping to the lowest common denominator and becoming more like a filesystem; basically a key-value store. Throwing away the benefits of what a good database offers. Mainstream databases (in the form of SQL) only do a subset of what's possible with the relational model that Date&Codd laid out decades ago. I fear a move into a lower level of the stack would only worsen this.
In fact I think the trend is kind of going the other way. Filesystems are slowly adopting more and more of the storage techniques that came out of DB research, but using them for the fairly anemic storage Unix FS storage model. Because that's basically what the industry is asking for.
Further, I think there remains a poverty of knowledge about databases generally in the industry. I don't have a CS degree myself, but almost everyone I know (outside of my current employer) who does have one has said some variant of this to me: "I didn't take any/many DB classes in school." Knowledge of the relational data model is weak/thin in our industry (the low point being the "NoSQL" wave a decade ago). And within CS academia, it seems that R&D in this field is mostly centred around Europe (with the notable exception in North America being the excellent programme @ CMU.)
Disclaimer: Employed by a database company.
The practical problem is that filesystems and databases are highly optimized toward different objectives at an architectural level. You can treat one like the other but the performance will be comparatively terrible.
This is made even more difficult by the practical necessity of backward compatibility with standards like POSIX, SQL, et al in the same implementation that were never intended to be interchangeable at the implementation level.
In principle you could design a database to be used as an effective filesystem. In practice, the implementation wouldn't be very compatible with any other software you need to work with and the ability to integrate matters.
It seems like a solution looking for a problem. Indexers with file system hooks like APFS + Spotlight and the thing Haiku has seem like a more practical approach for similar results.
Microsoft most assuredly did not try to replace the filesystem with a database. WinFS was a relational database service that served as a central metadata store for Longhorn. It was slow, but most importantly never provided any value. The idea was that applications would store their structured data inside SQL Server cough I mean WinFS and then somehow they would be able to easily link their data. There was a complicated system of loading application schemas to accomplish this. Of course, if applications want to share data they don't need to involve a brittle system sql database.
Nothing is stopping you from storing blobs of data inside a database and then exporting a POSIX api and calling that a new and improved filesystem. But once again its hard to see what value you get from all this complexity. A filesystem is complex enough and it doesn't have to store structured data. You generally don't want the OS to handle this complexity; you have just added another failure mode to a part of the system that you really don't want to fail.
> Important edge cases (like swap) that databases do really poorly.
Wouldn't it have been possible to not use a one-size-fits all approach? I'm thinking about things like moving the swap to a dedicated partition, like on Linux, and ditto for the other edge cases.
In earlier Windows versions users sometimes had to fiddle with swap file size. Even on newer versions the "System determine size of swap file" may be a default option (don't have a Windows machine nearby right now). Resizing partition is much more cumbersome process than resizing a file.
1 reply →
I don't think replacing the root FS with a database is a good idea. There are two options here:
1. Make it a database with schemas. In that case, you'll never be able to change or extend the language used to define schemas, since every application and OS will need to keep supporting older versions. It will also be hard to have multiple competing implementations of a complete database system, particularly because applications will come to rely on the performance characteristics of the reference implementation.
2. Make it schemaless, like a key-value store. That basically just is the file system, but non-heirarchical and with a faster implementation. In that case, your DB could just be a faster implementation of the existing filesystem interface or a natural extension thereof. The hierarchical nature of the filesystem is fairly essential if you want to allow multiple applications to avoid trampling on each other.
This pretty succinctly captures it.
A file system is, at its heart, a naming system. Given this name, return me a handle to "do things" with that object. In SQL it is "SELECT * FROM FILES WHERE ( NAME == <name> );". The next debate is what the schema for the files table?
Well known things that would appear in that schema are things like "access rights", "ownership", "access times", Etc. Additionally related things like the "consuming application" might be there, and the "editing application" Etc.
If these things were all relations/schema then a lot of "warts" like MAGICNUMBER or extension type, security certificates, integrity digests, OS requirement, symbol tables, or character encoding can then become schema elements and the names get "purified" (scare quotes because architecture astronauts really get all hot and bothered by overloading things with multiple semantics.
Then you can do "views" like when someone types a name at the shell you can select from files that are executable on this OS and match the names.
There are basically a whole bunch of things that are "grafted on" to the file system and they can (and do) get out of date relative to the files and cause problems. (the canonical cases are moving an executable from one directory to another making it no longer visible for execution, changing the extension or magic number resulting in the wrong thing being used to try to run/edit it)
The architecture reasoning goes, "If there was just one source of truth about these things, a database, then a whole bunch of bugs and user annoyances would vanish."
Anyway, I've witnessed people start down this grand vision, devolve into your #2 above (key/value store) and then finally throw their hands up in surrender.
One of the systems questions I ponder sometimes is at what point is memory so cheap and plentiful that replacing the buffer cache in the kernel with a giant interconnected schema like this gives equivalent "time to record" values. An RDMA accessible victim cache for "blob" buffering might help too.
A couple of people have taken runs at it as "object storage" or "storage as a service" but it isn't quite there yet.
Still, for a complex file format (say MPEG 4) having it be a data base gives you some advantages and makes writing file component parsers unnecessary. So that's a win.
Q: Isn’t the filesystem of a system a sort of database? Not in the sense of a query back-and-forth, but in the sense of holding all file records and locations.
The cool part about DBs is interleaving storing/manipulating and querying. A raw hard drive is already a DB by just allow you to write to them in that sense, but no one would consider them a decent DB.
2 replies →
A filesystem is a database. It is not relational and has a different API but the main difference is in different design goals and different set of tradeoffs.
What is the backend for the storage buckets we see in cloud (and increasingly, HPC) computing? Seems like a database filesystem would be a good fit there.
I thought they were all some version of a Hadoop distributed file system.
MinIO has millions of installations according to the documentation. If this is the building block in your favorite cloud is mostly hidden and unknown to users.
Those can be classified under object storage.
I don't understand how this is possible, as a database is an abstraction on top of the filesystem. Perhaps for certain workloads, it may hit RAM more frequently than disk, making it faster. But for large binary blobs, Sqlite is way slower than the file system on Linux.
Doesn't the IBM Iseries have a database as file system?
As others have pointed out, that benchmark is from 2017. However, the "SQLite: Past, Present, and Future" paper[0] has an updated version of this benchmark (see section 4.3 Blob manipulation), and also compares it with DuckDB.
Edit:
Another thing that is sometimes forgotten when comparing SQLite to the filesystem is that files are hard[1]. It's not only about performance, but also about all the guarantees that you get "for free", and all the complexity you can remove from your codebase if you need ACID interactions with your files.
[0]: https://www.vldb.org/pvldb/vol15/p3535-gaffney.pdf
[1]: https://danluu.com/file-consistency/
Rename is totally atomic in a way that is simple to reason about. Ten processes try a rename to a per process name, and just one wins, even on an NFS.That one is free to do whatever and then rename back.
Yes, rename is atomic, but that doesn't get you very far. For example, if you want to do a partial update to a file, you need to copy the file (you already need more storage), edit the file and then rename it. But then you can't do this between multiple processes because maybe you have copied the file while another process was already changing a different copy (but not yet renamed). And that's not even considering fsync, or supporting different Operating Systems.
1 reply →
> It's not only about performance, but also about all the guarantees that you get "for free"
Note that some filesystems have mount options that let you remove some of those guarantees.
This reminds me of WinFS, which was probably one of the most ambitious architectural project ever in the history of Windows yet failed to materialize. The vision was so attractive, encode all the semantic knowledge of file schema and metadata into relational filesystem. So you can programmatically query on whatever information about the filesystem and its content.
IIRC the problem was its performance. I don't have any insider knowledge so cannot pinpoint the culprit but I suppose that the performance issue was probably not something fundamental tradeoff (as this article suggests) but more of its immature implementation. The storage technologies got much better nowadays so many of its problem could be tackled differently. Of course the question it has to answer is also different; is it a still worth problem to solve?
That's a good take. I wonder if those engineers have hopped around and said similar when Apple or others tried to implement the same?
I wonder if the performance would be substantially improved by NVMe drives.
On Windows in my experience it's at least a factor of 10. I worked on a script recently that reads ~20000 files a few kb each and extracts some info to generate a web page. I sped it up enormously just by putting the file contents into SQLite tables.
Lots of small files (particularly in a single directory) is a known failure mode of many OS filesystems. I remember putting a million files into MogileFS and finding that filesystem operations basically did not complete in any reasonable length of time.
It also seems that some “do one thing well” tools want to implement their logic using files. For example: each entry in a password database is a file (pass(1)).
Whatever overhead of such small files might not matter though if the problem space is kept “human sized” (whatever one human can be bothered to manage).
I used to have tens of thousands of file in git annex. I had to tarball (chunk) some of the things that I never really use in order to speed up `git annex fsck`.
Yes I did an experiment years ago with 6.5 million files and it was disaster (https://breckyunits.com/building-a-treebase-with-6.5-million...).
However, things have totally changed with the M1 generation of Macbooks. Things that were once near impossible now run in an instant. I need to redo this experiment.
1 reply →
Yeah I got absolutely crushed by this when trying to migrate a Windows Server 2016 machine to unRAID, whose filesystem is absolutely horrible when dealing with thousands of smaller files. Wiped out a month of work for NAS-related activities; we're back on Windows again.
4 replies →
Windows struggles horrendously with lots of small files. If you tried the same task on Linux you’d see a large jump in performance. I don’t know if Linux small file access could match SQLite but it would be a lot closer.
That's because Windows' FS offers quite a lot of features that many other FSs don't -- especially SQLite doesn't. You might not need those features, of course.
One core feature is that Windows offers hooks ('filters') that allows other components to put themselves between the client program and the files. This is how virtual filesystems work (like OneDrive, etc.), or how anti-malware works.
When you read from sqlite, then those reads can't come from another server, the objects can't be scanned automatically, etc. Again -- you might not need these features, but it's not that sqlite or ext4 is somehow magically faster; they just made different design choices.
2 replies →
Depends. fuse filesystem + SMB was so bad (unRAID; Linux underneath) that I had to switch back to Windows, which works mostly fine.
2 replies →
Yes, Linux does a lot better (haven't tried that specific script, but have done a lot of similar things), but I've gotten speed improvements with similar use of SQLite there too, especially when dealing with a lot of files in the same directory.
Try zip file. Might work even better.
I would hate to see rename semantics in this case.
1 reply →
Whenever I have lots of files that need to go into a directory, I'll split into subdirs by prefix (or other partitioning scheme)--look in .git/objects for example.
Also be careful if the function that returns directory contents sorts results that you don't need done.
I think an in-process database will always have a large advantage in this scenario, as going through the filesystem means at least three system calls (open, read, close) per file.
That's not to say filesystems couldn't be improved.
The summary says:
> SQLite reads and writes small blobs (for example, thumbnail images) 35% faster¹ than the same blobs can be read from or written to individual files on disk using fread() or fwrite().
> The performance difference arises (we believe) because when working from an SQLite database, the open() and close() system calls are invoked only once, whereas open() and close() are invoked once for each blob when using blobs stored in individual files. It appears that the overhead of calling open() and close() is greater than the overhead of using the database.
This should come as a surprise to no one. The rest of the article is only of interest in terms of how benchmarking is done, if that's your thing.
Did they read and write all the data a filesystem handles?
Off the top of my head, the typical filesystem stores:
Turning off `atime` is about the first thing you do when you care about filesystem performance.
That advice is mostly obsolete, lazytime has been a thing for years.
4 replies →
Did this for a mdadm raid system using btrfs and the difference was very noticeable on a CI build machine.
Maybe not, but if all small files in a directory have the same structure, permissions, owner and same parent directory, and you don't care about x-times, you might as well use this context.
Maybe node_modules could move to SQLite.
Maybe even with transparent compression too?
Not sure about compression but somebody could probably hack it in an afternoon using this:
https://github.com/guardianproject/libsqlfs
or something similar to check the potential for speed up.
In 2017 you didnt have io uring yet. Though that doesn't explain it for windows and android.
Windows is slow due to all the filter drivers used by antivirus, custom FS drivers and other stuff. And the lack of a dentry cache. Basically all the reasons why WSL1 is slow. On linux you'll get slowdowns too if you have some online antivirus hooking into fanotify. I suspect android is also slower due to additional security hooks (e.g. selinux).
And yeah, io_uring helps since open-write-close can now be condensed into a single syscall.
Ahhhh, this is the real answer!
Edit:
Holy shit, IO rings released on Windows Preview in 2021...
This could use (2017) (original benchmark date) or (2021) (last modified time) in the title.
Anyone else think maybe AWS (S3) has made this optimization already? Or would it just be a whole team of kernel engineers optimizing it there?
The overhead on CPU cycles this would save cloud storage systems... Can someone help me quantify the potential savings?
Edit:
They specifically don't list their storage medium on their marketing:
https://aws.amazon.com/s3/storage-classes/
Object stores typically do optimizations where they store small files using a different strategy than large ones, maybe directly in the metadata database.
So that means yes but also they've gone past that optimization?
S3 has always been designed and optimised for large files.
In order to maintain high availability they deliberately trade away latency.
So this blog only really applies to local filesystems not object stores like S3.
Facebook had a published paper on their storage system for pictures, haystack, which iirc is something like a slab allocation.
S3 is similar, in the sense that it has completely different usage than a file system (no hierarchy, direct access, no need for efficient listing etc) so I'm pretty sure they use something similar.
Even if it is bypassing the file system, S3 is itself essentially a file system. It has all the usual features of paths, permissions, and so on. I assume it can't completely escape the same issues.
S3 is a key-value store where object keys might contain slashes, but the implied directories don’t really exist. This is a problem for Spark and Hadoop jobs that expect to rename a large temp dir to signal that a stage’s output has been committed, because HDFS can do that atomically but S3 requires renaming objects one by one. IAM security policies also apply to keys or prefixes (renaming an object might change someone’s access level) and changes are cached for tens of minutes.
S3 didn’t used to be strongly consistent, though surprisingly they delivered https://aws.amazon.com/about-aws/whats-new/2020/12/amazon-s3... which I hope they’re proud of.
Some people have been crazy enough to store tables of padded data in the keys of a lot of zero-length objects (which they do charge for) and use ListObjects for paginated prefix queries. It doesn’t much matter whether keys have slashes or commas or what.
But that would be 1 layer up in a network of that size, no? Edit:
Let's call it "Ring -1"
Title should have (2017)?
Long ago we had a contrasting experience. It had been assumed that the implemented "database" (not a general database) for storing nuclear spectroscopy data was needed because the filesystem was too slow. However, one of our "physicist programmers" decided to do the experiment, and found it wasn't so, and the system was re-designed around directories of files of spectra.
On the other hand, the other facility in the lab had consulted Logica on storage of similar data, who viewed x-y (or multiple dependent variables) data as tables suitable for storage in their early RDB, Rapport. That wasn't actually used in production, and the storage format for the table model was unfortunately usually mangled by data acquisition systems writing files.
That claim sounds more preposterous than it actually is, considering that databases and filesystems are both just software using the same hardware in a different manner.
More accurate title: "Operations within a file are faster than directory operations"
«on-disc filesystem is still slower than access to an in-memory stored structured data collection»
25 years ago this was done/tried with Pgfs, an NFS server with Postgres underneath it... https://www.linuxjournal.com/article/1383
Would it make sense to store images as blobs if I’m building a web app using SQLite then? Or is this specifically small images only? Saves having to do backups of data and images separately :)
No. This is a microbenchmark. They are really great for writing blogposts that get a lot of attention and not much more.
In reality there are other operations you wanna do on your files. Concurrent access, updates, deletes are probably not as great. Backing up a database with terrabytes of binaries inside is also non-trivial. Backing up terrabytes of files is a lot simpler.
Can you elaborate why would it be simpler to backup terabytes of files instead of just one?
8 replies →
If this is a small project, and it saves you time and makes the backup process easier for you, it's worth trying out. I've found backing up SQLite databases with Litestream to be one of the nicest quality of life things I've found in a while.
I’d say test it for yourself based on your specific data shape and workload.
Is it possible to mount a database so that its rows (or columns, or whatever you have selected upon the call to mount) are accessible as regular files?
On Linux, this could be done using an userspace FUSE driver I guess.
Don't know if SQLite has a feature or extension for that, but MS SQL Server has one:
https://learn.microsoft.com/en-us/sql/relational-databases/b...
Not sure how much use it gets though, I never hear anyone talk about it, including Microsoft folks.
How does it do this? Does it cache changes and only write them every so often? Does it keep a file open to write uncommitted changes?
The answer is complex, but boils down to "bad testing practices", or "measuring the wrong thing".
To maybe expand on it a bit: it's not known how the author created, configured and mounted the filesystem they were using, neither make nor version of filesystem is given. The author of the test doesn't even know that any I/O test worth its salt needs to "warm up" the system for the results to be reliable. Instead, they run the test multiple times and average them.
To give examples of things that may influence the speed of such tests:
* Is the filesystem mounted with atime option? If it is, it will generate more I/O as open() now generates writes beside reads. This is just one example of mount options affecting the test.
* How is the filesystem configured to store directory info, sometimes it's possible to embed this info in the inode, other times it's a linked-list-like structure that will potentially generate more I/O requests. This is, again, but a single example of a category of factors that have to be controlled for.
* Whether filesystem supports journaling, snapshots, deduplication, compression... whether it's parallel, how big it is / how fragmented it is... how much memory is available for caching, how is system I/O merger configured? And the list of questions not covered by the authors of the test goes on.
The explanation the authors themselves came up with for the results they see is this: the way they designed filesystem tests, they call open() and close() a lot, and they don't do that in their database tests. But, open() and close() don't have a fixed "price". Their performance will depend on many factors listed above and the options given to open().
From what I can tell, the I/O is performed in a blocking way, in a single thread, which is the worst way to perform I/O if you want good performance. So, in this test, both SQLite and filesystem suck, and, if you wanted to make them go faster, then you definitely could. Especially, you could improve the filesystem case.
Interesting, I suspected some of those things. Thanks.
I'll start with an anecdote. Scroll down if you want just the conclusion.
I work in test automation. In the general storage area. I worked for some years with a distributed filesystem (think something like Lustre, but modern and fast), then worked with something like DRBD, but, again more modern. Not sure how fast (never ran any benchmarks on DRBD). I had to deal with filesystems like Ceph's filesystem, BeeGFS...
Anyways. When I worked on the DRBD analog, let's call the product "R", one of my tasks was to figure out how well would a database work on top of R. Well, "database" is a very broad term. I figured I'd concentrate on using couple of well-known relational databases. PostgreSQL turned out to have the most to offer in terms of insight into its performance. Next, I'd have to find a suitable configuration for the database. And that's where things got really complicated. To spare you the gory details: the more memory you can spare for the benefit of your database server -- the better, the more you can relax the requirement of synchronizing with persistent storage (i.e. fsync and friends) -- even better.
In the end of the day, I had to abandon this kind of testing because, essentially, if given enough memory, or enough replicas (allows not to care about destaging to persistent storage) the bigger numbers I could produce, which made the question "how well does R compare to a plain block device?" irrelevant.
---
Fast-forward to this article. It gives out a vibe of "filesystems are not efficient, if you re-arrange something in the logic of doing I/O you can gain more performance!". And that really reads like "10 things doctors don't want you to know!" advertorial. It's similar to the misguided idea I often encounter with people who aren't system programmers that "mmap is faster".
Now, understanding why "mmap is faster" is nonsense will help understand why benchmarking a database on top of a filesystem and comparing performance doesn't make a lot of sense. So, in order to properly compare the speed, we need to make sure we compare both good and bad paths. What happens when I/O errors occur when memory-mapping files and when using other system API? I invite you to explore this question on your own. Another question you need to ask yourself in this comparison is "how well does this process scale", on a single CPU, single block device, multiple CPUs single block device, single CPU multiple block devices... And on top of this, what if we consider Harvard architecture (to an extend, IBM's mainframes are it, at least there general I/O is separate from the rest of computing). In other words, what if our hardware knows "some tricks"? (Other examples include the kinds of drivers and protocols used to talk to the hardware and whether the storage software, even if running on top of a filesystem will know / be able to take advantage of these, i.e. NVMe allows big degree of parallelism, but will "mmap" be able to utilize that? Especially if multiple files are mapped at the same time?)
And, of course, there is a difference between what's been actually done and what guarantees can be given about the state of data during and after the I/O completes. And these details will greatly depend on the details of the specific filesystem being used. For example, if you wanted to write to the device directly (as in with "ODIRECT | OSYNC"), but the filesystem is something like ZFS or Btrfs (i.e. it needs to checksum your data beside other things), then you might get confused about what you are actually comparing (direct I/O would imply less I/O than is actually necessary to give durability guarantees that may not be given by an alternative storage).
---
So, a better title for the article would have been "It's possible to carve out a use-case, where SQLite works faster than a similar process designed to only use system API to access a filesystem". Which is essentially saying: you, the programmer, don't know how to use a filesystem as well as we do... And, in my experience, most programmers are clueless when it comes to using filesystem or any other kind of storage really. So, that's not surprising, but is still not as sensationalist as the original title.
I wonder how using SQLite like this compares to something like squashFS.
SquashFS is read-only and requires elevated permissions to mount, but also presents as a true filesystem.
Sure those are differences. I really mean: how does something deliberately designed to be a file system in a file compare to SQLite being used that way?
1 reply →
> but also presents as a true filesystem.
As does:
https://github.com/guardianproject/libsqlfs
https://github.com/narumatt/sqlitefs
(I know nothing about these, just got them from a quick search)
[flagged]
Not if there are lots of files. mmap can be good for one big file.