SQLite as an Application File Format (2014)

5 years ago (sqlite.org)

The great thing about applications that use sqlite as their application file format is that you can write secondary apps and utilities to supplement the applications themselves.

For example, Adobe's Lightroom uses sqlite as their application file format, which means that it's almost trivial to write an application to help it do things it's either really bad/exceedingly slow at (like removing images from a catalogue that no longer exist on disk) or literally can't do (like generating a playlist of files-on-disk for a specific collection or tag), without being locked into whatever janky scripting solution exists in-app. If there even is one to begin with.

All you need is a programming language with a sqlite connector, and you're in the driving seat. And sure, you'll need to figure out the table schemas you need to care about, but sqlite comes with sqldiff so it's really easy to figure out which operations hit which tables with only a few minutes of work.

Good luck reverse engineering proprietary file formats in the same way!

  • I have an application that uses sqlite as their file format but they've cleared the header so you can't open it directly.

    I'd love to be able to make secondary applications like you've described but being enterprise software they don't want to make it too easy.

    They obviously want to keep people locked in with their $40k per seat application!

    I guess the first step is figuring out the page size and other bits the other meta data you set in the header [1].

    I know I just have to sit down and understand the format better and I will eventually figure it out...

    [1] https://www.sqlite.org/fileformat.html

    • That's mean, but you'll just have to live with that. If someone takes active steps towards preventing people from making secondary apps/utils, then they're (a) jerks, and (b) allowed to do that.

      Might be worth actually contacting them to ask why, if you can make the case that secondary applications will increase the value of their app, not decrease it.

      34 replies →

    • At $40K/seat, that must be some special software. Simple economics invites competition. Depending on the software, it may not take too many seats to incent the funding of an alternative.

      Maybe this is the kind of software that requires huge development costs. But maybe it would be worth 20 seats' worth of customers joining forces to fund a team of 5 people to build you a competing app tailored to your specific needs/wants and completely under your control.

      Granted, that could bump your costs from $800K/year to $1.6M+/year. But only short-term. Once your software is production-ready, you drop the costs of your current software. So think of it more like going from $8M/10 years to $6-10M/10 years but having complete control to add the features you want. And perhaps having the opportunity to recoup $millions/year by licensing to others. Or, open source it and give others the same kind of control while benefiting from the features they add. Spread your development costs across more seats to further lower your $/seat.

      Or, look at the 100 employees your vendor currently has and lose heart, then hope somebody with deep pockets funds a competitor.

      6 replies →

    • I might be wrong but I thought Dropbox used to do that as well. All the sync state is stored in a sqlite file with the header changed or removed.

  • The only thing stopping SQLite from becoming the primary application file format is browsers. Every single other client supports SQLite. Embedded, Mobile, Lightroom, Desktop Apps and Cross platform mobile frameworks have really good support for SQLite.

    • Primary application file format for... what kind of files? Because it's certainly not going to (nor should it ever) replace hypertext documents.

      Being able to load a read-only .sqlite database might seem cool, but I can't think of a single instance in which that's smaller and/or more efficient than calling data endpoints that use gzip/brotli for transport.

      Or are you thinking "as replacement for IndexedDB"? In which case, hard agree but then it's an actual database, not used as file data container.

  • > sqldiff

    You just blew my mind. Thank you!

    I can't believe the things I find out from random HN comments.

  • I tried to do that with a Firefox SQLite database files but it had an exclusive lock.

    • This is the case for any application that constantly writes to their dbs, so you usually run your own application/utility on them when the "parent" application isn't running.

  • Is it any different from JSON or XML?

    • that's like asking whether .zip is any different from .txt - yes, it's rather quite a lot different. A databases--that-is-a-file-on-disk, used as application file format for containing arbitrary -and arbitrarily complex- data is almost nothing like a plain text markup document.

      3 replies →

  • It's very annoying when the applications use sqlite with encryption. That is not open source, so regular language bindings can not be used.

    • I'm not sure how slqite as your application file format has anything to do with open source, honestly. It's certainly super easy for open source projects to use, but a closed source proprietary application using an encrypted sqlite file sounds perfectly sensible to me?

      We all win when folks decide to leave it accessible, but I'm not going to hold "encrypting a file format so that people can't easily reverse engineer it" against folks who are trying to sell software.

      9 replies →

“Atomic transactions” is a feature needs formal support in random file formats way more often than people realize. Simply writing to a file at all in an guaranteed-atomic way is much harder than it looks. That guarantee becomes important when your app gets widely distributed. If you have a million users of your free mobile app, 1 in a million events happen every day. For example: random hardware shutdown midway through a file write operation. How is your app going to react when it reads back that garbled file?

I’ve used SQLite on mobile apps to mitigate this problem. I’ve used LMDB on a cloud app where the server was recording a lot of data, but also rebooting unexpectedly. Would recommend. I’ve also gone through the process of crafting an atomic file write routine in C. https://danluu.com/file-consistency/ It was “fun” if your idea of fun is responding to the error code of fclose(), but I would not recommend...

  • Posix (and I think windows) guarantee this atomically and durably overwrites a file:

    tempfile = mkstemp(filename-XXXX)

    write(tempfile)

    fsync(tempfile)

    close(tempfile)

    rename(tempfile, filename)

    sync()

    Assume the entire write failed if any of the above return an error.

    In some systems (nfs and ext3 come to mind), you can skip the fsync and/or sync, but don’t do that. It doesn’t make things significantly faster on the systems where it’s safe, but it definitely will lose data on other systems.

    The only loophole I know of is that the final sync can fail, then return anyway. If that happens, the file system is probably hosed anyway.

    • You need a recovery step on startup to retry the rename if tempfile is complete, or delete it if it isn't.

      That means you need a way to verify that tempfile is complete. I do that by removing filename after completing tempfile. And that requires a placeholder for filename if it didn't already exist (e.g. a symlink to nowhwere).

      On crash, rename may leave both files in place.

      This technique doesn't work if you have hardlinks to filename which should refer to the new file.

      1 reply →

  • I second the recomendation of LMDB. With one important caveat: under heavy write load it is perfect demonstration of brokenness of semaphore implementation on freebsd and macos.

  • > For example: random hardware shutdown midway through a file write operation. How is your app going to react when it reads back that garbled file?

    Don't filesystem journals ensure that you can't get a garbled file from sudden shutdowns?

    • They ensure you don't get a garbled filesystem.

      They also expose an API that allows you, if you're very careful and really know what you're doing (like danluu or the SQLite author), to write performant code that won't garble files on random shutdowns. But most programmers at most times would rather just let the OS make smart decisions about performance at the risk of garbling the file, or if they really need Durability, just use a library that provides a higher level API that takes care of it, like LMDB or an RDBMS like SQLite.

      To not get your file garbled, you need to use an API that knows about legal vs. illegal states of the file. So either the API gets a complete memory image of the file content at a legal point in time and rewrites it, or it has to know more about the file format than "it's a stream of bytes you can read or write with random access".

      Popular APIs to write files are either cursor based (usually with buffering at the programming language standard library level, I think, which takes control of Durability away from the programmer) or memory mapped (which realllly takes control of Durability from the programmer).

      SQLite uses the cursor API and is very careful about buffer flushing, enabling it to promise Durability. Also, to not need to rewrite the whole file for each change, it does it's own Journaling inside the file* - like most RDBMSs do.

      * Well, it has a mode where it uses a more advanced technique instead to achieve the same guarantees with better performance

      3 replies →

    • Not really. If you have a file format that requires, e.g., changes to be done in two places then it's reasonable to write to one place, have the system shut down never having written to the second place, and now have a corrupt file.

      The journal ensures (helps ensure?) that individual file operations either happen or don't and can improve write performance, but it can't possibly know that you need to write, e.g., two separate 20TB streams to have a non-corrupt file.

      3 replies →

  • Windows had a feature that had snapshot-based transactions across the entire file system (NTFS-only, though). Unfortunately, it has been deprecated... it's such a shame that we can't seem to move on from the notion of filesystem from 40 years ago, apparently.

I've been reading through many of the other pages on the SQLite website. I didn't realize that SQLite is developed by a company (Hwaci) and funded by selling support, licenses, and testing services. I had always assumed it was an open source project backed by a non-profit, similar to the Apache Software Foundation.

This is another very interesting example of an open-source business. I would be interested to learn more about how the Hwaci company operates (revenue, number of employees, etc.). I find this very interesting:

> We are a 100% engineering company. There is no sales staff. Our goal is to provide outstanding service and honest advice without spin or sales-talk.

They list some "$8K-50K/year" and "$85k" price tags directly on their "Pro Support" webpage. These would usually be behind a "Schedule a Call" or "Get Quote" button. I've been thinking about doing something similar with my own on-premise licenses and support contracts. I'm not very good at sales and I don't really want to hire a sales team, so I'd be interested to know how this worked out for them.

I also liked this sentence, which is very similar to Basecamp's philosophy (and both companies were started around the same time - 1999 vs 2000):

> Hwaci intends to continue operating in its current form, and at roughly its current size until at least the year 2050.

It's interesting to think that SQLite could have raised money and grown into a billion-dollar public company with thousands of employees.

I'm going to listen to this Changelog interview with Richard Hipp now [2], and also this talk on YouTube [3].

[1] https://sqlite.org/prosupport.html

[2] https://changelog.com/201/

[3] https://www.youtube.com/watch?v=Jib2AmRb_rk

  • Also important to note is that enterprise pricing is way higher than you'd think because when an enterprise needs a solution, something like $1000 is literally too cheap to justify the cost of procurement (which involves not just getting a license, but also internal documentation, which in turn needs to be entered into whatever process certification programme is in place, etc.)

  • As someone who has had to evaluate and recommended enterprise software for purchasing, a straightforward price tag is going to make me come your product in a better light.

    Plus, I have to wonder how much extra profit the "call us" route actually takes in, after you've subtracted costs for the marketing staff it requires (especially if you try to renegotiate the cost after the subscription/license expires).

  • Thank you for writing up about the company behind SQLite. I never knew!

    I knew SQLite is well-respected by many, considered one of the best examples of software engineering. I just assumed that it was created by an individual or small team of brilliant minds - and developed/maintained by a user community - as such well-designed software often is.

    The company sounds great. Their approach to business is refreshing, and reminds of a few other exemplary companies with principles, daring to tread their own path to success.

    As an addendum: Hipp, Wyrick & Company, Inc. (Hwaci) is based in North Carolina, USA.

    http://hwaci.com/

  • Yes, it's a very smart business model. SQLite is extremely useful at all levels; I get it for free (not even merely open source, they took pains to dedicate it into the public domain, which is part of why they don't take patches), and enterprises pay to have their procurement boxes ticked, legal documents signed, and so forth.

    The hypothetical Hwaci which tried to be a trendy billion dollar company would be so much worse, and you probably wouldn't even be able to buy support. No one would benefit but institutional money.

From experience, I can also recommend using SQLite as an application file format. I landed on SQLite after looking for solutions for a file format for an educational app we made for simulating biological neural networks. The app is cross-platform, written in Qt and the simulations needed to be stored as a JSON describing the network, a thumbnail and some metadata. It was also intended to be extensible with more features and backwards compatible if new versions were released. I considered creating our own simple format, using ZIP files, HDF5, Qt resource files or SQLite.

I landed on SQLite for many of the reasons outlined in this article and in particular because of how easy it was to implement and maintain. SQLite is supported natively in QtSql, which made it extremely easy to write the save and load functions, and later extend these with more data fields. In addition, we did not have to worry about cross-platform support since this was covered by SQLite and Qt already.

  • Interesting, I used HDF5 in a similar situation because we needed to save a lot of same-sized rows of data (simulation time steps), so a matrix-oriented format seemed to make sense but it wasn't entirely without some need for cross-referencing between tables, so it does make me wonder now if sqlite would have been a comparable or better choice. Any reason for rejecting HDF5 in your case?

    • Is it cheap to insert/update/delete data in HDF5? If not, that should be the answer. I'm also curious if Parquet would fit your requirements.

SQLite is amazing – trivial to embed in almost every context I've tried, rock solid, incredibly flexible, performs well enough for most use cases and for the others it probably still performs better than the solution I had in mind. I wouldn't dream of using anything else for application data, unless you need something text based for human readers as well, and can't deal with having people use a SQLite client. And even then, really consider whether you need this.

With SQLite embeddable in websites thanks to wasm, and the ability to create object URLs it's also pretty trivial to make full blown (read-only) SPAs delivered as a single HTML file. This last bit might seem crazy – and it kind of is – but if your clients are all on a LAN in a corporate network so bandwidth and latency aren't really an issue it makes a bit more sense.

I love SQLite, hands down one of my favorite tools.

I think SQLite is underrated.

With WAL mode + increasing the page cache you can get some excellent concurrency, even if doing reads and writes at the same time.

With rqlite it's easy to make it a server database and have a cluster of SQLite databases (https://github.com/rqlite/rqlite).

I wouldn't try to create the new Instagram with it, but I think it'd be capable enough for many apps that are built on top of more complex DBs.

Amongst all the praise we're seeing for this approach here, may we please just have a minute's silence for the various exploits sqlite has seen when loading arbitrary database files over the years?

My recent favourite being https://media.ccc.de/v/36c3-10701-select_code_execution_from...

  • This concern was just raised on the SQLite Forum (probably after showing up here). See my reply at https://sqlite.org/forum/forumpost/8beceed68e for additional insights into the problem and recent SQLite enhancements to address it.

    • Absolutely, there are many things that can be done to mitigate this kind of thing, I'm merely advising caution over people immediately reaching for a solution where they're executing thousands upon thousands of lines of code over untrusted files when they can get away with something much simpler.

  • Interesting. That one only affects systems which add their own application-specific SQLite functions to the database after loading it.

    That's probably a niche case which doesn't affect most systems. But it'd be good to know about and watch out for in the systems that do.

  • Is there any way to mitigate code execution issues (besides just fixing bugs)? Because allow code execution might be a pretty big deal breaker in certain contexts.

  • This is a security flaw of the system or application using SQLite, though. It is trusting data it should not trust and executing code provided with that data.

    • If I use libpng or libjpeg to decode an image and that causes a buffer overrun or another security vulnerability, that is very much a bug in libpng or libjpeg.

      The same is true when I open a database in SQLite: if that causes a security problem, it's a bug in that library. I don't even see how you could validate a database file before you hand it over to SQLite.

      2 replies →

    • Huh? If you're using SQLite as an application file format, isn't opening untrusted files a key use case? What else, never share files with people? That must be a pretty boring application.

      4 replies →

While we are on the topic of file formats, I stumbled across a cool project the other day that aims to make parsing well-known binary formats much easier: http://kaitai.io/

In high school I was obsessed with the game Starcraft. It came with its own map editor that would save to its own proprietary format. Some smart people came along and reverse engineered that format and allowed us to do all kinds of neat things we weren't supposed to. I see modding communities for games are more popular than ever, and finding this project brought back lots of great memories.

I was surprised, but also intrigued when I discovered that Manga Studio (Clip Studio these days) used sqlite for the native file format, including the pixel planes.

I wish my predecessor at Krita had made that choice, instead of choosing to re-use the KOffice xml-based file format that's basically a zip file with stuff in it. It would have made adding the animation feature so much easier.

  • So isn't it possible for Krita to create an updating process and switch? Btw. it's really fantastic Krita is on Android now!

  • What's wrong with xml + zip?

    • XML-based formats are export formats, not state-keeping formats. To use an XML-based format for storage, need to have a separate, canonical in-memory representation of the data, which you then snapshot and serialize into XML upon request. You may or may not be able to get away with serializing less than your full in-memory object graph upon save, using techniques similar to DOM reconciliation. Either way, you'll still likely need your entire document/project represented in memory.

      If you're working with something analogous to a text document, this snapshot-and-serialize approach to saving works fine. If you're working with other types of data, though, this approach only works for trivial projects; once your document exceeds ~100MB, the overhead of snapshotting+serializing your object graph becomes bad enough that people stop saving very often (dangerous!), and it also makes the saving process itself more fragile (since the longer a save takes, the more likely it becomes that the process might be killed by some natural event like a power cut during it†.)

      And, once your project size exceeds the average computer's memory capacity, an in-memory canonical representation quickly becomes untenable. You start to have to resort to hacks like forcing the user to "partition" their project, only allowing the user to work with one pieces at a time.

      With an applicaton store-keeping format, you have none of these concerns; the store is itself the canonical data location. You don't have a canonical in-memory representation of the data; the in-memory representation is simply a write-through or write-back caching layer for the object graph on disk, and the cache can be flushed at any time. Or you may not have a cache at all; many systems that use SQLite as a file-format just do SQL queries directly whenever they want to know something, never instantiating any intermediate in-memory representation of the data itself, only retrieving "reports" built from it.

      † You can fix fragile saving with a WAL log, but now the WAL log is your true application state-keeping format, with the XML format just being a convenient rollup representation of it.

      4 replies →

    • One of the famous SQLite refrains is:

      > SQLite does not compete with client/server databases. SQLite competes with fopen().

      This undersells SQLite somewhat. Like Berkeley DB, SQLite was created as an alternative to dbm [1] and one of the main use cases is safe multi-process access to a single data file, typically CLI apps (written in C, TCL, etc.).

      Client-Server databases tackle multi-user concurrency while embedded databases often tackle multi-process concurrency.

      This article has long been part of the SQLite documentation found under the "Advocacy" section. There is also a short version. [2]

      [1] https://en.wikipedia.org/wiki/DBM_(computing)

      [2] https://sqlite.org/aff_short.html

    • I don't personally know anything wrong with it, but SQL queries are powerful powerful powerful, so although XML might not have any huge issues, in my personal opinion it is still sub-optimal compared to what is possible with SQL (at least for my uses, which could be very different to yours so take with a grain of salt). In particular, SQLite is a tiny executable with a HUGE amount of tests and is used very widely, so reliable for a lot of tasks as long as the data size stays relatively small and you caste your types properly.

      SQLite also has a lot of tutorials and books and language support. I've used it with Python, C#, Perl, TCL, and Powershell with no issues. You can access it via the command-line or you can hook into it with a fully graphical SQL IDE like DB-Visualizer (I really recommend using an IDE for interactive SQL use). If your language doesn't have built-in or library support, even a novice programmer like myself can roll a few functions together to build the tables, update, delete, and run queries to analyze the data if you can run some system commands. It's a wonderful little technology that I feel comfortable reaching to when I need it.

      One thing I've shied away from over the years are technologies which require running complex installers as it makes things more confusing and makes it harder for me to share with colleagues that aren't as interested in programming. Both SQLite and DB-Visualizer require zero installation. I just put each in a folder and then run the executable. This is really easy to use to me and easy to get others started too. Note that this is not commercial software, but internal business apps to help people do complex tasks easier. So you have a script that does some data processing pushes that data to SQLite and then the user can bring up DB-VISUALIZER, point it to the little SQLite .db file I created and then get to work. We have a lot of little apps like this and since most of our engineers are really good with SQL, they can do whatever they need efficiently.

    • It's really hard to update in place, we basically have to rewrite the whole file after a change, and, of course, we've got to do a lot of things ourselves, like inventing an xml format for describing the image.

Recently I've been using SQLite to store state for a data-syncing CLI utility in Python where I was considering using CSV or JSON at first. The Python 'Dataset' library, https://dataset.readthedocs.io/en/latest/ has taken almost all of the friction out of using SQLite for trivial state persistence.

I don't have to manually manage schema or create tables - they're all lazily created from Dict keys on insertion, but indexes and SQL can still be used as desired.

This lends itself really well to Jupyter notebook assisted development, where functions can be quickly and interactively iterated without having to muck around with existing tables whenever data changes shape.

It's been a real productivity boost, and I've been looking around for something similar to use in Clojure.

  • Thanks for posting this, it's exactly what I was looking through this thread for.

How would SQLite go as for something like Audacity projects? Audacity projects are a pile of files: an XML file that declares the project metadata, tracks, channels, envelopes (an example of non-destructive editing), and references to the zillions of audio files that it drops in a directory beside the XML file (typically less than 1MB each).

I presume it splits all the audio up into small files so that most types of edits can only need to touch a small area.

If you directly ported that to SQLite, would it work fairly well, or would you want to restructure it somehow? Things like additions or deletions, would it need to write lots of extra data to the disk (would it be doing something like defragmenting, or would it grow larger than it should, or are there other tricks that I don’t know about to delete a chunk from the middle of a file without needing to rewrite all of the file beyond that point)?

  • Adobe Lightroom seems comparable (non-destructive image editing.) It uses a sqlite database that points to image files on the filesystem. I think precisely the same approach could be taken with an audio editing application.

    You could also consider putting the audio files into the sqlite db, which might work alright. I've heard of image thumbnails being stored in sqlite dbs (maybe by Lightroom iirc?) though those are probably smaller than your audio clips I'm guessing.

Shouldn't this has 2014 inside the title? The story has also appeared several times in HN before.

Very interesting concept but now I think perhaps application file format using TileDB will be much better since it can support sparse data as well [1].

[1] https://github.com/TileDB-Inc/TileDB

We transitioned the file format for the images in our climbing-topo app from a pile of files to SQLite (the images are tiled).

Going from having high hundreds of tiny files on disk per topo photo to just one was an incredible boon to productivity for things like data backup and transferring files onto the device for testing.

  • Tell me more about your app

    • It acts as a digital window into our print guidebook catalogue. The books are mostly for the uk, with the rest spread across Europe.

      I take the finished desktop-publishing documents and extract and package the data to put into the app, including tiling the images so we can have very high resolution photos on low-end devices.

      There's a semi-technical article about the topo-view implementation here:

      https://www.ukclimbing.com/articles/features/rockfax_app_dee...

Years pass, now decades, yet SQLite is still the most pleasant DB to work with.

  • Agree, but it also side-steps some complex topics like sharding, replication, high-availability, etc. I’m not suggesting those topics should be addressed by SQLite; that simplicity is an asset.

    Has anyone used SQLite remotely over a network?

    • It lives in your process. It has no network features. It's not a DBMS.

      People have built layers over it though

    • I think the main reason SQLite doesn't have them is because it's not a goal for the project. SQLite would be monumentally more complex if it supported networking, sharding, etc. If you need those use cases then perhaps it's a good idea to switch to a more scalable DBMS. My favorite is PostgreSQL. If I'm building something which will have user-generated content then I always pick Postgres. Otherwise, SQLite.

    • I am working on a flash card app and specifically did not want to have to create a SaaS to support cross-device data sync, so have opted instead to create a journal of the data mutations on each device and having those (i.e. it's CRDT) uploaded to a simple file store (like Dropbox) for each device to share. Each device stores its own local data using SQLite, but does a merge on all the journaled commands to stay up to date.

      So, basically I rolled my own sync solution that just happens to use SQLite. It's worth noting that I started this project by storing the data in my own format (JSON, at first) and quickly realized it was growing too large and was taking too long to serialize/deserialize. I'm very glad to have ended up using SQLite instead because it is super easy to use and has been reliable.

    • When Subversion changed their working copy format to rely on an SQLite database, I had a number of problems using working copies on network file systems.

      No doubt this was an indication that the network file system was incorrectly configured, but I think the fact that it worked in practice with the file-based format and didn't with the SQLite format is a strike against the idea of using SQLite for what the user sees as saving a file.

      6 replies →

    • If you need sharding and replication, just use a different database. SQLite is not intended to solve those problems.

      As for high-availability, isn't a single file on your disk the most available thing there is? :)

What happened to storing blobs in a database being giant rookie mistake? I'm currently dealing with this at the office, where now a SQLServer database has grown to unmanageable size from helpdesk software attachments.

  • That's the thing: this isn't "a database" anymore when it's used as an application file format, it's now "an application file format" that is conveniently also a sqlite db file so everything in it is trivially accessible.

    Storing huge blobs then becomes a matter of "is this my data, or is this general data that I'm merely also making use of". Examples abound of both:

    - Adobe Lightroom (a non-destructive photo editor) goes with "these are not my images" and leaves them out of its sqlite "lrcat" files, instead maintaining a record of where files can be found on-disk.

    - Clip Studio (a graphics program tailored for digital artists) on the other hand stores everything, because that's what you want: even if you imported 20 images as layers, that data is now strongly tied to a single project, and should absolutely all be in a single file.

    So the key point here is that, yes: sqlite files are database, but because they're also single files on disk, their _use_ extends far beyond what a normal database allows. As application file format, the rules for application files apply, not the rules for standard databases.

  • Storing your blobs outside of the database when they are an inherent part of some larger transactional model is the giant rookie mistake IMO. We did this for a little while for the reasons of "oh my gosh look how big the .db is getting" and it ended up a complete shitshow. Simple tasks like cleaning up old entries can quickly turn into a nightmare if you have to go around disk and make sure you are not leaving garbage behind. There are also other concerns outside the box like VM snapshots. This could leave you in a very inconsistent state regarding what blob you have on disk vs what the database is expecting.

    Is having everything in one gigantic .db file an actual downside? What makes a database's size "unmanageable"? Presumably, you'd have to store that business information somewhere anyways, right? I don't understand how 1 unified file is unmanageable, but splitting that concern up into 2 different domains magically makes it better.

    • It can trash db performance. A relatively simple alternative is to store references to immutable blobs in the db so that everything remains transactionally consistent (and easily restorable to any point-in-time). This is what many large companies do. I happen to work at a company that runs the largest + most "shall not be named" databases and has struggled for years to evict all the blobs people shoved in em.

      1 reply →

  • Did you read the article? This isn’t about storing blobs as data in SQLite, but rather using SQLite-formatted databases as the file format for an application.

    As a simple example, Word documents are just zipped XML text files (try unzipping a .docx and looking inside). Instead of using this, you could a SQLite .db file (probably with a different extension), translating the XML files into tables, and folders into databases. The OpenOffice case study has more details: https://sqlite.org/affcase1.html

    • Apparently I'm misreading the following paragraph.

      Any application state that can be recorded in a pile-of-files can also be recorded in an SQLite database with a simple key/value schema like this:

          CREATE TABLE files(filename TEXT PRIMARY KEY, content BLOB);

      2 replies →

    • The OpenOffice case study is a fascinating thought experiment.

      We don't really create too many new file formats these days, and if we did they're highly performance specialized (parquet, arrow).

      Just wondering aloud, what recent file format would have benefited from being based on sqlite?

  • Storing blobs is good if the blobs are what you actually want to store; the rookie mistake is storing blobs instead of the well-structured, meaningful, indexable and much more compact columns and additional tables that you should be storing but you are too rookie to to design.

  • I don't think there is a consensus that storing blobs in the database is always the wrong decision. Yes, there are drawbacks, but it really depends on the circumstances, and it might be the right choice.

    Suppose, for example, that you were going to make your own implementation of Git. The normal implementation has a bunch of blobs under one directory. There's not much use in manipulating this directory separately from everything else under .git. The blobs and most of the non-blob stuff are only useful together anyway, so you already manage them as a single collection.

    It could create problems for certain backup tools, so that's a disadvantage. But it also simplifies application development. So it's not a slam dunk one way or the other.

    • Relatedly, this is (from what I understand) exactly how Fossil works for version control; it's literally a single SQLite file storing blobs representing changes to the repository.

  • I've never faced this problem so perhaps I'm missing something obvious, but what's the big difference storing binary blobs on the DB vs, say, storing URLs on the DB and hosting the blobs elsewhere? Is the size of the database the main concern?

    • Both approaches have their advantages.

      Storing blobs on the DB makes the data more consistent (the database is always in sync with the file content), and allows you to use advanced features of the DB on these files (transactions, SQL queries, data rollback ...). You also only have to backup the DB.

      Storing links to the objects is usually more scalable, as your DB will not grow as fast. DBs are usually harder to scale, and also more expensive (at least 10x per GB).

      It really depends on the project, but I'm favoring more and more storing the data in BLOBs, as it makes backups easier and as I can use SQL queries directly on the data. Databases as a service also make it easy to scale DBs up to +/- 2TB. But the cost might still be an issue.

      3 replies →

  • I think that refrain usually applies to when the database is used to store relational data, rather than when it's being requisitioned for use as a file format.

I used to advocate for this, then tried opening a sqlite file I'd been using on another computer and it said it was corrupted. I took it back to the original computer and it opened just fine. So there was something going on that made it more complex and less portable than something simple like json, which is super frustrating because otherwise it's a really useful format for datasets, and now I can't really trust it.

  • If you use certain new features of SQLite, the database file gets "upgraded". At that time, you must use a version of SQLite that understands these new features.

    So, the other computer you used was probably running an older version of SQLite. Just update it to make it work.

    • I think the computer it worked on was one with an older version of SQLite rather than a newer one, but I’m not certain. It was a while ago. I thought I’d checked for the upgrade issue, but my memory is shit.

Quickbooks for Mac uses sqlite for its file format and this allowed users to query the books with any SQL browser!

Alas, they encrypted the file starting two years ago.

The article has great points.

Main reason for existence of lot of file formats is that enterprises don't want just about everyone access their files and modifying them. It greatly reduces the usage of their proprietary software hence their revenues.

I remember the days when open office trying to render doc file but formatting used to suck big time.

Open source softwares should leverage this kind of file formats for inter operating.

  • > Main reason for existence of lot of file formats is that enterprises don't want just about everyone access their files and modifying them. It greatly reduces the usage of their proprietary software hence their revenues.

    I'm sure this does happen, but it seems more like MBA-paranoia than a legitimate concern. For instance I sincerely doubt Adobe has lost any revenue by using SQLite with lightroom, despite various open source tools being able to interact with their lrcat (sqlite) files.

  • it's the old business paradox, if everything was totally lean and solved, how do people earn a living ? you need a strong force to overcome this hurdle and show how to reorganize the system in the large

SQLite is a fine single-file application file format, but if you want average users to be able to inspect and perhaps fix application data when things go wrong, a tree of JSON-format files is more friendly.

True, this requires care to ensure that such files are updated reliably, but that's not quite rocket science.

Is anyone doing this in Javascript (via electron) and encrypting the file in such a way that the data is protected and not an obvious sqlite db?

I have some ideas on how to do this, but I'm curious if there's a "preferred" way to do it.

  • SQLite has encryption plugin support, or you could encrypt the file yourself at rest. However, no matter what you are doing, as the application needs to be able to decrypt the file, you have to ship the key and thus the protection will be able to be circumvented.

    It’s really not worth the effort IMHO

  • Why would you want to "protect" the data in this way? It's user-hostile, and anyone determined enough will be able to get at it no matter what you do.

    If it's popular and valuable enough, instructions and/or code to break it automatically will then be published, regardless of how much money or time you invested into the protection. (For proof, look no further than the game industry's DRM over the past 30 or so years)

    • It's not meant to be user hostile, it's for a game. I allow data exports, but I don't want to expose the inner workings of the game via direct access to the DB.

      5 replies →

I used it with a different extension to save application for a product I wrote.

Any classes that need to be saved had serialize() and deserialize() functions. Serialize before saving to SQLite and if read in, deserialize after reading it from the DB.

Maybe. I've used XML as a file format for all my recent projects. It's well supported in the .Net framework, I don't need any 3rd party assemblies. Using SQL to write out an object graph would involve more work.

  • A tree I could understand since after all XML uses a tree-like structure, but for a graph you'll need to store references anyway so flattening and denormalizing it completely doesn't sound like a lot more work to me.

    Also XML makes me weep and I'll never willingly opt to use it if there's an alternative, but that's my own prejudice.

  • It depends on how large and complicated your XML file is.

    Once your XML file hits a certain size, minor updates incur a gigantic performance penalty because you have to write out the entire XML file every time.

    Random writes in the middle of an XML file are impossible. For example, if you were to change an attribute so that it's one character longer, you still have to rewrite the remainder of the file in order to shift everything over by one character.

    That's the main reason why SQLite is so popular for applications.

    (I know that's from personal experience. I had to support an old version of an application that constantly wrote out XML while the new version that we hadn't shipped used SQLite. A customer that made heavy use of the old version basically hit the limits of XML but the version that used SQLite wasn't ready for them yet.)

    But, I'm going to be honest here: We had some tables that only had a few rows, so I moved those to XML files. QE really liked it because it was easy to diagnose issues.

    • I think it depends on whether you need a file format or a database. All the benefits of a database are wiped out if you are implementing File->Save for a document based application.

      1 reply →

  • Depends on the size or structure of the data.

    If you're having to do things like updates over a large dataset, SQLITE can be nice because of the performance boosts from indexing.

A lot of well known apps are exactly doing this, using a sqlite db as the internal file format: Whatsapp, WeChat, Google Keep, Apple iMessage, Notes and Calendar...

Probably tons more if you care to dig around.

I worked with an SQLite Database on a personal C#/WinForms app. It was a bit complicated to get started, but afterwards, what a handy trick in comparison to using something like the discontinued SQL Server Compact.

I did, however, "enhance" it a bit (or "proprietarize" it) by encrypting it with a short password and an AES algorithm with some uncommon settings. I never shipped it that way, but the output file looked like any other proprietary app's file - a mess of random symbols.

I just converted a 90's-vintage xBase app to SQLite. I just love it. Should have done this years ago.

The only thing I miss is the ability to jump X% into an index. Not a deal-breaker, though.

Everything old is new again. I can recall a similar movement at the end of 90s regarding databases as application file formats. That time it was Microsoft Jet Database engine.

Concurrent access to SQLite is the main reason I haven't really looked into Deno. For that matter, I really think that WebAssembly file system implementations should have enough flexibility to be able to work with a SQLite database file in the host OS.

The needs for concurrent SQLite pretty much covers the need for a robust file system.

I keep switching between csv, parquet and sqlite. I dont need another article to encourage me to switch again! :)

Does SQLLite have a more text-friendly storage format? This would enable Text editors to do edits/modifications, which is critical to debugging and testing and learning. Savefiles don't need the same ongoing access optimizations, or size optimizations for wire protocols, so a less efficient format that is text editor friendly would help this cause.

Also, does SQLLite have libraries in the native code for the major languages to read and write to them? XML (ugh), JSON, and YAML have managed to get decent implementations in almost all major languages.

  • Using a text editor to mess with a SQLite database doesn't make any sense. Do you use text editors to edit, say, MPEG media?

    SQLite databases can be edited with with generic GUI and command line tools, both SQL-based and tabular editors, which are safer and more convenient than a text editor could ever be.

When I was reversing the XMDF E-Book format, I was surprised to find an embedded SQLite database... in an E-Book?? After reading this article, it makes more sense why someone would make that decision

One minor point that needs to be raised: Using SQLite as your application file format locks you into the SQLite source code.

(And in most cases that's not a bad thing, it's free and open source.)

But, if you truly want an open file format, someone needs to be able to independently write a program that can read your file without relying on third party dependencies. This is why the browser vendors decided not to put SQLite into the HTML and JavaScript standards.

It is an awesome product! I've worked with it for over a decade and I'm a fan!

  • > it's free and open source

    It's more than "open source" it's in the public domain.

    The file format is fully documented and if someone like ISO or ANSI wanted to, they could make a standard out of it. It's also forward compatible since inception and versioned.

    The browser vendors decided not to put SQLite into their browsers because "key/value good, SQL bad" and "not invented here".

    IndexedDB is a clumsy reinvention with minimal ACID properties and isn't far advanced from ISAM. They could have used the SQLite file format and implemented IndexedDB as an API over the top. They could have allowed both standards to be implemented and then let reality take its course to choose which one was successful.

To my very limited knowledge it is very common for iOS and android apps to store data in SQLite.

Is it possible to extract or change these databases?

I have a few (offline) apps on my phone that I’d love to append data to

  • As a sibling comment says, if you are on Android and have root access, it is very easy.

    Just have a look at /data/data/appname/.

    For example, this is what I copy to make backups of my contacts on my Android phone:

    > # pwd

    > /data/data/com.android.providers.contacts/databases

    >> # ls -1

    > calllog.db

    > contacts2.db

    > profile.db

    This is a simple way for moving data around, restoring applications, performing backups, editing your data (if you know your way inside the app).

    Beware of the selinux labels if you are moving files across different devices, as recent android version now run with selinux in enforcing mode.

    Just be sure to adjust them (useful commands if you need a reference: ls -lZ, chcon, semanage).

  • For unjailbroken iOS the only way to extract these databases is to make an unencrypted device backup in iTunes (Finder in Catalina) and then look at the backup file. The backed up files won't be conveniently named but you can still find it.

  • It's possible on Android if you have root access, I think. Android is very serious about apps not accessing each other's data.

  • It's also super common on MacOS. According to lsof(1), my laptop has 95 sqlite .db files currently open; deduping by path shows 56 unique database files open at the moment.

Would such an app hold the sqlite database open while the user has the document "open", and live-write user changes back to the database immediately? Or would it follow the traditional model of the user choosing to "Save"?

I worked on an app that did the former many years ago (to an Access database, not sqlite), and it did not go well because this broke user expectations on the usual "open/save/save as" model.

  • The save model at the UX level is completely orthogonal to the application file format. You could implement either model with an SQLite DB file.

  • One quick-and-dirty way to maintain the traditional model would be to copy the on-disk DB into an in-memory DB, then make any changes in memory; when the user clicks Save, the application would then move the old on-disk DB, open a new one with the old's original filename, copy everything from the in-memory DB into the new on-disk DB, and delete the old on-disk DB.

    Another option would be to keep both the in-memory and on-disk copies open, and then update the on-disk version in-place with the in-memory version's data when the user clicks Save. SQLite has built-in support for connecting multiple DBs (such that you can query both in the same statement) to make this straightforward.

  • I think the former is preferable and most modern users will prefer it. However if your users are upset about it, you could do the later, or you could do the former and give them a dummy/placebo "save" option that does nothing, or maybe VACUUMs the DB.

    • The problem is not Save, the problem is when you close the app without using Save - the user will expect that to work like Cancel and discard all session changes.

  • I guess you could send changes to the DB on-the-fly inside a transaction, so when the user clicks "save" it's just a matter of running COMMIT? Not sure what effect that would have on performance as the transaction grows, though.

  • I think it would depend on the app (e.g. how document-like it is), but nowadays people will be much more used to the instant-save model due to things like Google Docs, OneNote, and phone apps.

Where can you use SQLite?

Embedded Systems: Yes

Raspberry Pi : Yes

Mobile Apps. : Yes

Desktop Apps : Yes

Browsers : No

Servers : Yes

Supercomputers : Yes

  • You used to be able to use it in a browser with Google Gears.

    The reason why it was never adopted was because the browser makers wanted to be able to independently implement their own database instead of everyone having to use the same source code.

They say that Microsoft Word, PowerPoint, etc are "fully custom formats", but actually those formats are "wrapped pile-of-files formats" (and 7-Zip can open them), although the contents of the files inside are custom formats (and they seem to have failed to consider that category).

The application ID number in the SQLite header can be used to identify application file formats. The application ID number is a 32-bit number, and there have been a few different ways to handle it; I have seen the use of hexadecimal and of ASCII; I used base thirty-six, and I have then later seen the suggestion to use RADIX-50. Additionally, there is a document about "defense against dark arts" in case you need to load untrusted files.

TeXnicard uses a SQLite database file (with application ID 1778603844) for the card database file. The version control file (optional, and not fully implemented yet) uses a custom format (which is fully documented), and it does support atomic transactions. It consists of a header followed by a sequence of frames, which are key frames and delta frames. The header of the version control file contains two pointers: one to the beginning of the most recently committed key frame, and the other one to the end of the most recently committed frame (whether a key frame or a delta frame; if all frames are fully committed, this will be equal to the length of the file). These pointers are written only after the rest of the file is written; if it gets interrupted, reads will ignore the partially written data, and further writes will overwrite the partially written data.

ZZ Zero uses a Hamster archive of custom (but documented) formats as its world file format. (A Hamster archive is zero or more "lumps" concatenated together. A lump consists of a null-terminated ASCII filename, 32-bit PDP-endian data size (measured in bytes), and then the data of that lump. The preceding text in these parentheses is the full definition of the Hamster archive format; you can use this to implement your own.)

Free Hero Mesh uses a "pile-of-wrapped-pile-of-files" format. A puzzle set consists of four files: .class (which stores class definitions), .xclass (which stores pictures and sounds to be used by the class definitions), .level (which stores levels), and .solution (which stores solutions). The .class file is a plain text file; the other three are Hamster archives. These are four logically distinct parts of a puzzle set; this allows you to split them apart, to create symlinks to share class definitions with puzzle sets, to substitute your own graphics, to work with multiple solution sets (e.g. per user), etc. If you need to do more than that, then you can of course extract the lumps if needed. For class definitions, you can just copy and paste the text.

MegaZeux used a fully custom format before, but now it uses a ZIP archive with the stuff inside being custom formats (one of which is the "MegaZeux Property List" format, which I have documented in Just Solve The File Format wiki; the authors of MegaZeux did not seem to document this format themself anywhere, so I figured it out and did it by myself).

For some cases, SQLite database is a good application file format; other times, I think other formats (such as text formats) may be better. It depends on the application. XML is too often used for stuff that isn't text markup stuff, and XML is especially bad for stuff that isn't text markup stuff, I think.

If you use SQLite though, you will get more than just the database access. It also gives you the string builder functions, the sqlite3_mprintf function, a page cache implementation, memory usage statistics, and a SQL interpreter; the SQL interpreter can be used as one way to allow user customization and user queries (including batch operations), without having to make an entirely new scripting language to embed.

They mention interfaces of SQLite are available for many other programming languages, although at least one that doesn't seem to have a interface to SQLite is PostScript (although you can use %pipe%, it doesn't work so well especially since it is only a one way pipe), and I am not sure if awk has it either.

  • .doc, .ppt, &c. were fully custom. .docx, .pptx, &c. are ZIP files. Remember that this was written in 2014, and many people were still using the old formats then, because many people didn’t have Office 2007 or newer or software that could cope with this new format. (Now I think very few people still use the old Office formats.)

    • No, .doc, .ppt, etc are wrapped file of files, although not using a ZIP container format. I also believe Visual Pinball uses the same container format as Microsoft Word. 7-Zip just calls this format "Compound", although there are other names for it, such as "DocFile" and "POIFS". I just tried now such a file, and I got the following listing from 7-Zip:

           Date      Time    Attr         Size   Compressed  Name
        ------------------- ----- ------------ ------------  ------------------------
                            .....        35842        36352  WordDocument
                            .....          106          128  [1]CompObj
                            .....         4096         4096  1Table
                            .....         4096         4096  Data
        2007-12-25 22:33:00 D....                            ObjectPool
                            .....         4096         4096  [5]DocumentSummaryInformation
                            .....         4096         4096  [5]SummaryInformation
        ------------------- ----- ------------ ------------  ------------------------
                                         52332        52864  6 files, 1 folders

Is there any good way to use SQLite as an application file format while also using it in WAL mode? They seem sort of mutually exclusive; near as I can figure the way to do this would be to execute `PRAGMA journal_mode=DELETE` when the user requests a save file, then copy the DB to the location the user specifies. I think this negates a few of the advantages of using sqlite as an application file format though.

  • I think the backup api or VACUUM INTO are the preferred ways, though you're gonna probably want to turn off wal on the exported copy still.

    • Shrinking the database is the equivalent of saving the file when the application shuts down or saving an extra copy to make a backup.

are there any other "file format oriented" data storage system?? I see software tend to use zip file, sqlite or good old ole structured storage.

In my job we used a fat filesystem as a storage system and recently switched to sqlite, and while i love it i didn't really find any alternative.

Are we talking about good old fashioned CRUD applications here or is this something different?

Is SQLite able to handle multiple processes that want to read or write to the file database at the same time? Or nearly at the same time.

They say that SQLite is more in competition with fopen, for file open, rather than a true RDMS system.

Please no. This is an incredibly complicated file format that is not particularly well suited for file saving. It's also very slow ((edit: correction) ~10 transactions per second) if you aim for integrity, or unsafe if you tune for performance (~50k transactions per second, but if your program or computer dies half way, your file is hosed). So, keeping files up to date in place makes your ui janky.

You can work around it by working in memory and writing out a whole new database from in memory structures on save and then do the atomic rename. But if you do that, you are probably better off with json, protobuf, or similar. The libraries around these formats are similarly battle tested, but they fit the needs better, supporting working in ram fully and then saving cleanly and easily.

  • > Please no. This is an incredibly complicated file format that is not particularly well suited for file saving. It's also very slow (~100 transactions per second) if you aim for integrity, or unsafe if you don't.

    The kind of application files they are talking about (things like word processor documents, spreadsheets, drawings, source code control system data) would only be writing sporadically. During one of those sporadic writes they might need to update thousands of rows but those could all be done in one transaction.

    • I made a mistake -- it's an order of magnitude slower. That means a single transaction is within the threshold for human perception. If you block the UI thread on a transaction, you're now dropping 6 frames at 60fps. If you block other operations on the transaction completing, you end up with lag.

      Reference: https://www.sqlite.org/faq.html, question 19. (I've seen similar when testing on SSDs locally).

      15 replies →

  • > if your program or computer dies half way, your file is hosed

    Doesn't this apply to writing files in general? It's not unique to SQLite...

    Edit: Nope, SQLite is designed to be protected against crashes, even OS crashes.

  • > It's also very slow (~100 transactions per second) > But if you do that, you are probably better off with json, protobuf, or similar.

    Are you talking about a server application?

    > or unsafe if you find for performance (you can get it up to, IIRC, ~50k transactions per second, but if your program or computer dies half way, your file is hosed)

    Never heard that SQLite has unsafe operations. Any source?

    • > Are you talking about a server application?

      I'm talking about doing a handful of transactions -- even a single one, now that I looked at the actual numbers that sqlite discusses -- being enough to introduce user-visible jank.

      > Never heard that SQLite has unsafe operations. Any source?

      The sqlite docs. You can improve the performance of sqlite by multiple orders of magnitude by messing with things like https://www.sqlite.org/pragma.html#pragma_synchronous, at the cost of safe atomic updates.

      2 replies →

  • > It's also very slow ((edit: correction) ~10 transactions per second ...

    Huh?

    I was pushing updates to a SQLite file on my laptop a few weeks ago with code that's not at all optimised, and wasn't too fussed with ~400 transactions a second, sustained for a minute or so.

    What were you doing that only gave 10 transactions a second?

I have been thinking about this as well lately. For my app (http://ngrid.io) I'm using a custom built Entity-Component-System (https://en.wikipedia.org/wiki/Entity_component_system) based UI framework in Rust.

In the ECS paradigm, everything is stored as struct-of-arrays as opposed to array-of-structs (https://en.wikipedia.org/wiki/AoS_and_SoA) and as a result, your serialization becomes trivial. You are not chasing pointers to other objects when serializing. Apple's Core Data takes the object graph approach and it's a real clusterfuck.

ECS is very similar to databases in that your data is normalized and things are referenced by offsets rather than pointers https://floooh.github.io/2018/06/17/handles-vs-pointers.html).

Why roll your own thing as opposed to use say SQLite? If you use SQLite, you will probably have some OOP on top of it which introduces a serious impedance mismatch. With ECS, you cut that whole layer out.

I'm guessing that I'm losing on some atomicity but for my use case that doesn't matter that much.

  • ECS is useful for real-time, but it means that you are rolling your own query logic. This is fine for the runtime of game engines because they generally aren't dealing with that many data types(for the very most complex AAA games, perhaps a few hundred components) and the queries used at runtime are simple and tend to demand optimization at the data structure level.

    If your purpose is an editing tool you may want to reconsider. Editing changes the goal in a very substantial way and the complexity of your queries goes way up, which is where SQL syntax absolutely shines.

    • FWIW, a game I'm writing uses in-memory SQLite as its ECS, mostly because I wanted to see if I could. SQLite is surprisingly performant. My game is not realtime, but I tested it, and I could actually redraw the screen at 60+ FPS while querying for render data every frame, and I still had lots of frame time to spare.