Show HN: Roast my SQLite encryption at-rest

2 years ago (github.com)

SQLite encryption at-rest is a hot requested feature of both the “default” CGo driver [1] and the transpiled alternative driver [2]. So, this is a feature I wanted to bring to my own Wasm based Go driver/bindings [3].

Open-source SQLite encryption extensions have had a troubled last few years. For whatever reason, in 2020 the (undocumented) feature that made it easy to offer page-level encryption was removed [4]. Some solutions are stuck with SQLite 3.31.1, but Ulrich Telle stepped up with a VFS approach [5].

Still, their solution seemed harder than something I'd want to maintain, as it requires understanding the structure of what's being written to disk at the VFS layer. So, I looked at full disk encryption for something with less of an impedance mismatch.

Specifically, I'm using the Adiantum tweakable and length-preserving encryption (with 4K blocks, matching the default SQLite page size), and encrypting whole files (rather than page content).

I'm not a cryptographer, so I'd really appreciate some roasting before release.

There is nothing very Go specific about this (apart from the implementation) so if there are no obvious flaws, it may make sense to port it to C/Rust/etc and make it a loadable extension.

[1] https://github.com/mattn/go-sqlite3/pull/1109

[2] https://gitlab.com/cznic/sqlite/-/issues/105

[3] https://github.com/ncruces/go-sqlite3/issues/55

[4] https://github.com/sqlite/sqlite/commit/b48c0d59

[5] https://github.com/utelle/SQLite3MultipleCiphers

How do you feel yours compares with (say) SQLCipher, which is a very popular C based one, and keeps fairly close to the upstream SQLite releases?

* https://www.zetetic.net/sqlcipher/

* https://github.com/sqlcipher/sqlcipher

Their most recent release (v4.5.7) is based upon SQLite v3.45.3, which is the latest version of SQLite:

* https://github.com/sqlcipher/sqlcipher/releases/tag/v4.5.7

  • So, I haven't studied their solution much, but ISTM what they did was fork SQLite at (around) the time of the commit that removed SQLITE_HAS_CODEC, and forward port the 4 years of changes since then.

    That's a bit untenable for me, since I'd rather keep as close as possible to SQLite compiled with Clang, and use the extension points already provided by SQLite (the VFS API).

    Most SQLite encryption solutions (SQLite Encryption Extension, SQLCipher, SQLite3 Multiple Ciphers, sqleet) encrypt page content, and some need to reserve a few bytes of every page to do so (for nounces, MACs). This was "easy" to do with SQLITE_HAS_CODEC, but requires some "reverse engineering" to do from the VFS layer. Some of this "reverse engineering" is likely stable, because the "checksum VFS" [1] depends on it. OTOH, extension points that are not part of the "public API" have been summarily dropped in the past [2].

    My scheme does not care about the SQLite file format(s) at all, because instead of encrypting just page content, it encrypts entire files. It uses 4K blocks, so setting page size to (at least) 4K is advised, but not required. The only assumption it makes is that SQLite is not sensitive to file sizes rounding up to the next block (4K) size. An assumption that holds for databases, journals and WALs.

    The scheme does not try to authenticate blocks, so it doesn't try to protect against forgery. Other solutions may include MACs, but to offer random access they don't protect against reverting a page to an older version of itself, so IMO, this is of limited value.

    Other schemes add a nounce to each page, which allows on disk content to change, while the decrypted content stays the same. I don't include a nounce, so if an adversary gets hold of multiple encrypted backups of the same database he knows not only which blocks couldn't possibly have changed, but also which ones definitely did.

    [1] https://www.sqlite.org/cksumvfs.html

    [2] https://sqlite.org/forum/forumpost/db235b3070

Hey, it's my Adiantum package! How's the performance? Revisiting the API now, I see that Encrypt and Decrypt return newly-allocated slices, which is unfortunate. Should I add Append variants that don't force an allocation?

  • Hey! That's amazing that you found this.

    I haven't benchmarked much, but I think I measured a 15% hit on speedtest1 [1] (with about 10% inside your library). Less if I kept temporary files in memory. Other solutions claim less of a performance hit. I'd have to measure to make sure.

    There are some things I could do to improve performance. A partial block update needs a read-update-write cycle; journals and WALs do some of these. I could buffer consecutive writes in memory unencrypted and only flush them on fsync. Didn't do that as it requires some additional bookkeeping.

    I don't think Encrypt and Decrypt allocate at all! They encrypt/decrypt in-place; the returned slice just aliases the input block. But thanks, it'd be pretty bad if they did.

    [1] https://sqlite.org/src/file/test/speedtest1.c

    • Ha, you're right, I even have benchmarks in the README showing 0 allocs/op. Nice. (Sorry for doubting you, past-self!)

      In the past I've implemented a form of the write-buffering you describe -- it was not fun. But it was a lot more important in that context, because flushing a write meant uploading many MB to remote servers. With 4 KB pages flushed locally, I'd wager it's not worth it.

SQLite has a paid version. One of the paid features is encryption.

They’re not going to make it easy.

  • Of course, nor would I expect them to make it so.

    In fact, I mention the SEE in my package's documentation. If you have a license to the paid extension, it should be easy to compile it and use it with my package.

    It will be slow, however, because it will be running the reference AES implementation in Wasm.

    That said, if anyone is interested in sponsoring a SEE license, I can look into doing the encryption Go (which uses assembly on most platforms for those bits).

For your KDF, how did you select the parameters for Argon2id? How often is the KDF invoked / what are your requirements for speed in KDF'ing?

Nit, but your "pepper" is confusingly named IMO, because it is hardcoded and not secret. "label" may make more sense.

However, one thing to keep in mind with the use of a static(and public) Argon2 salt is that it allows an attacker to pre-compute keys. If this package were to gain adoption, I think that may become a bigger issue. I would reccommend randomly generating a 128 bit salt, similar to how you're randomly generating the key if one isn't provided.

  • I used the parameters suggested in the documentation [1], which follow the RFC.

    The KDF is invoked every time a connection is opened iff you specify your key with the textkey= parameter. It is ill advised to overuse this, especially if you don't use connection pooling, as it makes opening connections slow. You can bypass the KDF by providing your own 256-bit key using either the hexkey= or key= parameters (key= cannot embed NULLs).

    I agree pepper confusing (because the pepper is supposedly secret), but this is not a salt either, as a salt is supposed to be public, but unique. Do you have better naming that you can suggest?

    Anyway, I forgot to do this, but the intention was for the pepper to be changeable as a build parameter. Thanks for reminding me!

    [1] https://pkg.go.dev/golang.org/x/crypto/argon2#IDKey

    • > I used the parameters suggested in the documentation [1], which follow the RFC

      Where in the RFC is your parameter set mentioned? I don't see it anywhere[0]. The only parameter set I see mentioned with memory requirement as low as 64 MiB have t=3, not t=1. I believe the Go documentation might be outdated.

      [0]: https://datatracker.ietf.org/doc/rfc9106/

      > I agree pepper confusing (because the pepper is supposedly secret), but this is not a salt either, as a salt is supposed to be public, but unique. Do you have better naming that you can suggest?

      If it's tweakable at build, perhaps pepper makes more sense.

      7 replies →

Why do you need this in built into your database?

Surely you turn on disk encryption on your drive then if an attacker yanks the drive and tries to read your database without your password they fail?

  • This is very useful, for example, when you ship your application to the client(s) with SQLite as the main conf/data storage. You don't have to think about whether their drives are encrypted. Also, it assures the clients that all the data in your application is encrypted by default.

    • Where are the encryption keys stored? Does the client have to type in a password every time?

      Userspace encryption of user data has been almost universally rejected because there's no reasonable attack on it - Any attacker that has access to the data also has access to the encryption key.

      2 replies →

  • I don't know if that's his use case, but we had many users tell us they share their account with other users so they want the data to be encrypted even when logged in.

    Of course if they share their computer, someone could install a keylogger and wait for them to type their passwords, but I guess that's an extra layer of security that may help a bit.

  • Data is stolen much more often by copying is instead of yanking the drive out.

Have you considered targeting libSQL for this instead? https://github.com/tursodatabase/libsql

SQLite is a great product and I use it a lot, but it does not have a great history of outside contributions and collaboration. If I were developing any kind of extension, I feel like I would have a better shot at getting help and feedback from a community-based project.

  • My extension targets my Go SQLite bindings (the VFS is implemented in Go) [1].

    This extension, and the wrapper, target public SQLite APIs, so if anyone wants to replace SQLite with libSQL, that should be easy.

    You simply need to include a little bit of glue C code that I append to the amalgamation, and use some SQLite compile options.

    I explicitly support building and providing your own, custom, Wasm SQLite "blob" (it just needs to include my glue code).

    As for Adiantum encryption, as I said, reimplementing this in C/C++/Rust to make it a loadable extension is perfectly viable, and would be compatible with libSQL, again because this uses only public SQLite APIs (that's the point, basically!)

    But this is predicated on it being a secure scheme (that's the feedback I was looking for this time around).

    PS: I've got nothing against Turso, or libSQL. In fact I spent the last year perusing their virtual WAL API. The problem is that I found no documentation, nor any useful open source implementations of it. If there any I'd be very interested. So, thus far, I also don't have anything that drives towards libSQL.

    [1] https://github.com/ncruces/go-sqlite3

  • Turso / libsql supports encryption, integrated with MultipleCiphers

    Example and usage code is here - https://turso.tech/blog/fully-open-source-encryption-for-sql...

    • FTA: "One project in particular was very suitable for us, SQLite Multiple Ciphers. Since it is licensed under MIT, we have just moved the code into libSQL."

      See, this is what I don't get about libSQL. Turso claims to want to foster a community around it, but then I go to the SQLite3MultipleCiphers GitHub and there's zero trace of contacting the author about "just moving the code into libSQL."

      I hope they at least considered sponsoring development [1], given that they're making this a "premium feature" of their hosted offering.

      [1]: https://www.paypal.com/donate/?hosted_button_id=3JC9PMMH8X7U...

      2 replies →

A couple of thoughts:

First, great job on the readme! One way you could improve it is by expanding on the "Caution" section. What's written is the beginnings of a threat model, but it could be improved by being more explicit about which attacks this system does/doesn't defend against.

> The only security property that disk encryption (and this package) provides is that all information such an adversary can obtain is whether the data in a sector has (or has not) changed over time.

I think the adversary learns a bit more than this. Randomized encryption would provide the above probably, but the _deterministic_ scheme that's used here will let the adversary learn not only whether a sector changed, but whether its value matches what it was at a previous point in time.

How does this translate into the security of the database, itself? Seeing what blocks have changed might reveal information about what data has changed. Let's consider a security game where I (the adversary) get to submit sql queries, and then learn which blocks on disk has changed. After this initial phase (where I can learn where data is stored), I submit two different sql queries, you pick one of them at random and run it, and then tell me which blocks have changed. I win if I can guess which sql query you picked.

Suppose I submit queries which each insert into a different table. Because the tables are stored separately on-disk, it'll probably be pretty easy for me to distinguish them. But okay, that's still count-ish/size-ish, and maybe out of scope.

What if I submit two queries which each insert different values, but into the same table. Further, let's say that this table has an index. Based on which pages were written to, I can now learn something about the _values_ that were inserted, because different values will write into the index in different places.

Now, it's completely valid if the threat model says, "if you can see more than two copies of the database file, then all is lost." However, I think it'd be worth translating the current write-up of the threat model into the implications for leaking the database. For more examples of attacks based on seeing what indices/sizes changed [1] and [2].

Is it valid to pad the sqlite file to a multiple of the block size? Does sqlite ever call truncate on a non-block-aligned size and expect any truncated bytes to be fully removed?

What are the atomicity requirements for a sqlite VFS? SQLite, in general, is supposed to not get corrupted if power were to be yanked mid-write. However, because this VFS writes one block at-a-time, the computer dying mid-write could corrupt more bytes around the write position than would normally be corrupted if the the standard VFS was used. It's possible this is a non-issue, but it's worth considering what contract sqlite has for VFSes.

[1]: https://en.wikipedia.org/wiki/CRIME [2]: https://www.usenix.org/legacy/events/sec07/tech/full_papers/...

  • The threat model has to exclude:

    - attacks on a running app that has the keys loaded, naturally

    The threat model has to include at least:

    - passive attacks against the DB itself, lacking access to the keys

    The threat model really should also include:

    - active attacks against the DB lacking access to the keys (e.g., replace blocks)

    IMO ZFS does a pretty good job against these threats, for example, so ZFS is a good yardstick for measuring things like TFA.

    However, the fact that a running system must have access to the keys means that at-rest data encryption does not buy one much protection against server compromise, especially when the system must be running much/most/all of the time. So you really also want to do the utmost to secure the server/application.

    • ZFS, AFAIK, can offer something in addition which is harder for a VFS to offer, and which AFAICT no other SQLite encryption offers: a kind of HMAC Merkel tree that authenticates an entire database (at a point in time).

      Alternatives, even those that use MACs only authenticate pages/blocks. They still allow mix-and-match of pages/blocks from previous backups.

      I could, potentially, add optional/configurable nounces and MACs at the VFS layer.

      I've refrained from doing so because (1) it complicates the implementation; (2) it can be added later, compatibly; (3) it doesn't fix mix-and-match; (4) it will impact performance further; and (5) it would be MAC-then-encrypt (against best practice).

      1 reply →

    • > ... the fact that a running system must have access to the keys means that at-rest data encryption does not buy one much protection against server compromise, especially when the system must be running much/most/all of the time.

      A common approach to help mitigate this is by having the keys be fetchable (eg via ssh) from a remote server.

      Preferably hosted in another jurisdiction (country) in a data centre owned by a different organisation (ie. not both in AWS).

      When the encrypted server gets grabbed, the staff should (!) notice the problem and remove its ssh keys from the ssh server holding the ZFS encryption keys.

      ---

      That being said, I'm not an encryption guy whereas some of the people in this thread clearly are. So that's just my best understanding. ;)

      7 replies →

  • First of all, thanks for the review. I'll try to respond to all points.

    Disk encryption, on which this is based, is usually deterministic in nature.

    So yes, an adversary 1 that can inspect multiple versions of a database (e.g. backups) can learn exactly: which (blocks) changed, which didn't change, which have been reverted; but that is all they should learn.

    An adversary 2 that can modify files, can also mix-and-match blocks between versions to produce a valid file with high probability .

    And an adversary 3 that can submit changes and see their effect on the encrypted data can probably infer a lot about the database.

    I'll try to make these more explicit in the README. In practical terms: adversary 1 is the one I thought I'd covered reasonably well; adversary 2 means that backups should be independently signed, and signatures verified before restoring them; adversaries 2 and 3 mean that this is ineffective against live attacks.

    Security, though, is also about comparing options. Reading the documentation for alternatives (even the expensive ones) I don't see this kind of analysis. I see 2 advantages to the alternatives that encrypt page data with a nounce and a MAC. The nounce allows reverts to go unnoticed. No change, means a block definitely didn't change. But ciphertext changing doesn't necessarily mean plaintext changed. The MAC ensures blocks are valid. But they still be reverted to previous versions of themselves, mix-and-match is still possible. Do these two properties make a huge difference? Is there anything else I'm missing?

    On your other points.

    Yes it's always safe to round up file sizes to block size, for databases, journals and WALs (I could detail why, but the formats are documented). It may not be safe for all temporary files (I'm assuming it is), but that can be fixed for those files by remembering the file size in memory.

    About atomicity, corruption, etc, the VFS is supposed to declare its characteristics [1] to SQLite. Your concerns are covered by SAFE_APPEND and POWERSAFE_OVERWRITE. See also [2]. As a wrapper VFS, I filter most of those characteristics from the underlying VFS, forcing SQLite to assume the worst.

    [1] https://www.sqlite.org/c3ref/c_iocap_atomic.html

    [2] https://www.sqlite.org/psow.html