Comment by boudewijnrempt

5 years ago

I was surprised, but also intrigued when I discovered that Manga Studio (Clip Studio these days) used sqlite for the native file format, including the pixel planes.

I wish my predecessor at Krita had made that choice, instead of choosing to re-use the KOffice xml-based file format that's basically a zip file with stuff in it. It would have made adding the animation feature so much easier.

So isn't it possible for Krita to create an updating process and switch? Btw. it's really fantastic Krita is on Android now!

What's wrong with xml + zip?

  • XML-based formats are export formats, not state-keeping formats. To use an XML-based format for storage, need to have a separate, canonical in-memory representation of the data, which you then snapshot and serialize into XML upon request. You may or may not be able to get away with serializing less than your full in-memory object graph upon save, using techniques similar to DOM reconciliation. Either way, you'll still likely need your entire document/project represented in memory.

    If you're working with something analogous to a text document, this snapshot-and-serialize approach to saving works fine. If you're working with other types of data, though, this approach only works for trivial projects; once your document exceeds ~100MB, the overhead of snapshotting+serializing your object graph becomes bad enough that people stop saving very often (dangerous!), and it also makes the saving process itself more fragile (since the longer a save takes, the more likely it becomes that the process might be killed by some natural event like a power cut during it†.)

    And, once your project size exceeds the average computer's memory capacity, an in-memory canonical representation quickly becomes untenable. You start to have to resort to hacks like forcing the user to "partition" their project, only allowing the user to work with one pieces at a time.

    With an applicaton store-keeping format, you have none of these concerns; the store is itself the canonical data location. You don't have a canonical in-memory representation of the data; the in-memory representation is simply a write-through or write-back caching layer for the object graph on disk, and the cache can be flushed at any time. Or you may not have a cache at all; many systems that use SQLite as a file-format just do SQL queries directly whenever they want to know something, never instantiating any intermediate in-memory representation of the data itself, only retrieving "reports" built from it.

    † You can fix fragile saving with a WAL log, but now the WAL log is your true application state-keeping format, with the XML format just being a convenient rollup representation of it.

    • > it also makes the saving process itself more fragile (since the longer a save takes, the more likely it becomes that the process might be killed by some natural event like a power cut during it†.)

      This is one I take very seriously, after I got bit by it. I was saving state by writing s-expressions to a text file; it seemed a reasonable enough thing to do even with tens of megabytes of it, until my laptop turned off in the middle of a write. After recovering from a backup and losing several hours of work in the process, I switched to SQLite that evening.

    • I've never had my problem scale to the size that required a database/SQL, but I don't quite get the advantage of your solution. Having all your interactions with data have to go to disk though a cache muddles things b/c it makes it much harder to reason about performance (b/c when do you have a cache miss? and how do you configure a cache properly?) You introduce a lot more blackmagic variables to reason about.

      If you're editing images I'd think it'd just makes more sense to have all of your stuff in RAM and then a saving-to-disk is done on a separate thread. I don't quite get why the users would stop saving in this example.

      I'm not saying you're wrong - but more asking for some more details b/c I've never imagined using a DB on data that can fit in RAM

      2 replies →

  • One of the famous SQLite refrains is:

    > SQLite does not compete with client/server databases. SQLite competes with fopen().

    This undersells SQLite somewhat. Like Berkeley DB, SQLite was created as an alternative to dbm [1] and one of the main use cases is safe multi-process access to a single data file, typically CLI apps (written in C, TCL, etc.).

    Client-Server databases tackle multi-user concurrency while embedded databases often tackle multi-process concurrency.

    This article has long been part of the SQLite documentation found under the "Advocacy" section. There is also a short version. [2]

    [1] https://en.wikipedia.org/wiki/DBM_(computing)

    [2] https://sqlite.org/aff_short.html

  • I don't personally know anything wrong with it, but SQL queries are powerful powerful powerful, so although XML might not have any huge issues, in my personal opinion it is still sub-optimal compared to what is possible with SQL (at least for my uses, which could be very different to yours so take with a grain of salt). In particular, SQLite is a tiny executable with a HUGE amount of tests and is used very widely, so reliable for a lot of tasks as long as the data size stays relatively small and you caste your types properly.

    SQLite also has a lot of tutorials and books and language support. I've used it with Python, C#, Perl, TCL, and Powershell with no issues. You can access it via the command-line or you can hook into it with a fully graphical SQL IDE like DB-Visualizer (I really recommend using an IDE for interactive SQL use). If your language doesn't have built-in or library support, even a novice programmer like myself can roll a few functions together to build the tables, update, delete, and run queries to analyze the data if you can run some system commands. It's a wonderful little technology that I feel comfortable reaching to when I need it.

    One thing I've shied away from over the years are technologies which require running complex installers as it makes things more confusing and makes it harder for me to share with colleagues that aren't as interested in programming. Both SQLite and DB-Visualizer require zero installation. I just put each in a folder and then run the executable. This is really easy to use to me and easy to get others started too. Note that this is not commercial software, but internal business apps to help people do complex tasks easier. So you have a script that does some data processing pushes that data to SQLite and then the user can bring up DB-VISUALIZER, point it to the little SQLite .db file I created and then get to work. We have a lot of little apps like this and since most of our engineers are really good with SQL, they can do whatever they need efficiently.

  • Seems very hard to seek in XML? And how would you update it in place? Also seems like it would be massive?

    • Well, we're not storing pixels in xml, but now we've got two xml files, one with author data, one descibing the image and a folder tree containing various kinds of binary data, all in a zip file. And yes, updating in place is a bother.

  • It's really hard to update in place, we basically have to rewrite the whole file after a change, and, of course, we've got to do a lot of things ourselves, like inventing an xml format for describing the image.

  • Do people really pick XML as a go-to format these days? Probably easier to list what's not wrong with it

    • Yes, unfortunately, if you're in an environment where people have been using XML forever and are more comfortable with it than any other alternative. When you find yourself needing to write an XPath to update a simple app configuration parameter, you know you're in hell.

    • Well, these days 22 years have passed since the original file format was designed -- and not by me, though I probably would have done the same thing, 22 years ago.

      Also, it's zip + xml files + binary files, not all xml.

      1 reply →