← Back to context

Comment by overcast

5 years ago

What happened to storing blobs in a database being giant rookie mistake? I'm currently dealing with this at the office, where now a SQLServer database has grown to unmanageable size from helpdesk software attachments.

That's the thing: this isn't "a database" anymore when it's used as an application file format, it's now "an application file format" that is conveniently also a sqlite db file so everything in it is trivially accessible.

Storing huge blobs then becomes a matter of "is this my data, or is this general data that I'm merely also making use of". Examples abound of both:

- Adobe Lightroom (a non-destructive photo editor) goes with "these are not my images" and leaves them out of its sqlite "lrcat" files, instead maintaining a record of where files can be found on-disk.

- Clip Studio (a graphics program tailored for digital artists) on the other hand stores everything, because that's what you want: even if you imported 20 images as layers, that data is now strongly tied to a single project, and should absolutely all be in a single file.

So the key point here is that, yes: sqlite files are database, but because they're also single files on disk, their _use_ extends far beyond what a normal database allows. As application file format, the rules for application files apply, not the rules for standard databases.

Storing your blobs outside of the database when they are an inherent part of some larger transactional model is the giant rookie mistake IMO. We did this for a little while for the reasons of "oh my gosh look how big the .db is getting" and it ended up a complete shitshow. Simple tasks like cleaning up old entries can quickly turn into a nightmare if you have to go around disk and make sure you are not leaving garbage behind. There are also other concerns outside the box like VM snapshots. This could leave you in a very inconsistent state regarding what blob you have on disk vs what the database is expecting.

Is having everything in one gigantic .db file an actual downside? What makes a database's size "unmanageable"? Presumably, you'd have to store that business information somewhere anyways, right? I don't understand how 1 unified file is unmanageable, but splitting that concern up into 2 different domains magically makes it better.

  • It can trash db performance. A relatively simple alternative is to store references to immutable blobs in the db so that everything remains transactionally consistent (and easily restorable to any point-in-time). This is what many large companies do. I happen to work at a company that runs the largest + most "shall not be named" databases and has struggled for years to evict all the blobs people shoved in em.

    • If you index on the file_name in a table where you have file_name and a blob then why would it trash performance? SQLite can quickly locate where to start reading the blob.

Did you read the article? This isn’t about storing blobs as data in SQLite, but rather using SQLite-formatted databases as the file format for an application.

As a simple example, Word documents are just zipped XML text files (try unzipping a .docx and looking inside). Instead of using this, you could a SQLite .db file (probably with a different extension), translating the XML files into tables, and folders into databases. The OpenOffice case study has more details: https://sqlite.org/affcase1.html

  • Apparently I'm misreading the following paragraph.

    Any application state that can be recorded in a pile-of-files can also be recorded in an SQLite database with a simple key/value schema like this:

        CREATE TABLE files(filename TEXT PRIMARY KEY, content BLOB);

    • I agree that is unclear for an introduction, and that the article could have been written better. The “pile-of-files” they are referencing isn’t a filesystem of arbitrary and unstructured binary data stored for humans to look at later (like your Helpdesk attachments), but rather structured and specific data for an application to read in a well-defined manner.

      Your Helpdesk example used SQLServer problematically because a SQL database shouldn’t be used as an arbitrary file store. But if you know what the file structure is and have a reasonable grasp for how it might scale (that each binary blob is small, that each user only adds one row to the database, etc), there are huge advantages to “a SQLite table with lots of binary and text columns” versus “a folder with lots of binary and text files.” And if those text files are just small key-value pairs then maybe they should also go in SQLite.

    • It is reasonable to adopt that style if it is data you don't control. Storing blobs is probably a mistake if it is data you have control over.

      Sometimes all that is known about data is it exists - in that case, into the database as a blob it shall go. If it can be decomposed it probably should be.

  • The OpenOffice case study is a fascinating thought experiment.

    We don't really create too many new file formats these days, and if we did they're highly performance specialized (parquet, arrow).

    Just wondering aloud, what recent file format would have benefited from being based on sqlite?

Storing blobs is good if the blobs are what you actually want to store; the rookie mistake is storing blobs instead of the well-structured, meaningful, indexable and much more compact columns and additional tables that you should be storing but you are too rookie to to design.

I don't think there is a consensus that storing blobs in the database is always the wrong decision. Yes, there are drawbacks, but it really depends on the circumstances, and it might be the right choice.

Suppose, for example, that you were going to make your own implementation of Git. The normal implementation has a bunch of blobs under one directory. There's not much use in manipulating this directory separately from everything else under .git. The blobs and most of the non-blob stuff are only useful together anyway, so you already manage them as a single collection.

It could create problems for certain backup tools, so that's a disadvantage. But it also simplifies application development. So it's not a slam dunk one way or the other.

  • Relatedly, this is (from what I understand) exactly how Fossil works for version control; it's literally a single SQLite file storing blobs representing changes to the repository.

I've never faced this problem so perhaps I'm missing something obvious, but what's the big difference storing binary blobs on the DB vs, say, storing URLs on the DB and hosting the blobs elsewhere? Is the size of the database the main concern?

  • Both approaches have their advantages.

    Storing blobs on the DB makes the data more consistent (the database is always in sync with the file content), and allows you to use advanced features of the DB on these files (transactions, SQL queries, data rollback ...). You also only have to backup the DB.

    Storing links to the objects is usually more scalable, as your DB will not grow as fast. DBs are usually harder to scale, and also more expensive (at least 10x per GB).

    It really depends on the project, but I'm favoring more and more storing the data in BLOBs, as it makes backups easier and as I can use SQL queries directly on the data. Databases as a service also make it easy to scale DBs up to +/- 2TB. But the cost might still be an issue.

    • >and also more expensive (at least 10x per GB).

      Expensive in what way? Memory/Compute? It can't be licensing money since SQLite is public domain.

      2 replies →

I think that refrain usually applies to when the database is used to store relational data, rather than when it's being requisitioned for use as a file format.