Comment by derefr

5 years ago

XML-based formats are export formats, not state-keeping formats. To use an XML-based format for storage, need to have a separate, canonical in-memory representation of the data, which you then snapshot and serialize into XML upon request. You may or may not be able to get away with serializing less than your full in-memory object graph upon save, using techniques similar to DOM reconciliation. Either way, you'll still likely need your entire document/project represented in memory.

If you're working with something analogous to a text document, this snapshot-and-serialize approach to saving works fine. If you're working with other types of data, though, this approach only works for trivial projects; once your document exceeds ~100MB, the overhead of snapshotting+serializing your object graph becomes bad enough that people stop saving very often (dangerous!), and it also makes the saving process itself more fragile (since the longer a save takes, the more likely it becomes that the process might be killed by some natural event like a power cut during it†.)

And, once your project size exceeds the average computer's memory capacity, an in-memory canonical representation quickly becomes untenable. You start to have to resort to hacks like forcing the user to "partition" their project, only allowing the user to work with one pieces at a time.

With an applicaton store-keeping format, you have none of these concerns; the store is itself the canonical data location. You don't have a canonical in-memory representation of the data; the in-memory representation is simply a write-through or write-back caching layer for the object graph on disk, and the cache can be flushed at any time. Or you may not have a cache at all; many systems that use SQLite as a file-format just do SQL queries directly whenever they want to know something, never instantiating any intermediate in-memory representation of the data itself, only retrieving "reports" built from it.

† You can fix fragile saving with a WAL log, but now the WAL log is your true application state-keeping format, with the XML format just being a convenient rollup representation of it.

4 comments

derefr

catalogia 5 years ago

> it also makes the saving process itself more fragile (since the longer a save takes, the more likely it becomes that the process might be killed by some natural event like a power cut during it†.)

This is one I take very seriously, after I got bit by it. I was saving state by writing s-expressions to a text file; it seemed a reasonable enough thing to do even with tens of megabytes of it, until my laptop turned off in the middle of a write. After recovering from a backup and losing several hours of work in the process, I switched to SQLite that evening.

geokon 5 years ago

I've never had my problem scale to the size that required a database/SQL, but I don't quite get the advantage of your solution. Having all your interactions with data have to go to disk though a cache muddles things b/c it makes it much harder to reason about performance (b/c when do you have a cache miss? and how do you configure a cache properly?) You introduce a lot more blackmagic variables to reason about.

If you're editing images I'd think it'd just makes more sense to have all of your stuff in RAM and then a saving-to-disk is done on a separate thread. I don't quite get why the users would stop saving in this example.

I'm not saying you're wrong - but more asking for some more details b/c I've never imagined using a DB on data that can fit in RAM

HelloNurse 5 years ago
It's primarily a problem of inflexibility handicapping performance, not of "cache misses" and clever algorithms.
For example, imagine a word processing program opening a document and showing you the first page: you could load 50MB of kitchen sink XML and 250 embedded images from a zip file and then start doing something with the resulting canonical representation, or you could load the bare minimum of metadata (e.g. page size) from the appropriate tables and the content that goes in the first page from carefully indexed tables of objects. Which variant is likely to load faster? Which one is guaranteed to load useless data? Which one can save the document more quickly and efficiently (one paragraph instead of a whole document or a messy update log) when you edit text?
- geokon 5 years ago
  
  ah okay, incremental loading seems essential and I hadn't considered it. Thanks for explaining :)