← Back to context

Comment by adityaathalye

4 months ago

For my little system (blog author here) I've decided that all tables will be append-only logs of facts. In the post, I tried keeping traditional schema alongside a "main" facts table. Of course, the problem of audit logs comes up for any change made to traditional "current-database-view" tables. And then who audit logs the audit log?

I feel like "two systems" is the problem.

Writes should either be traditional schema -> standard log of all changes, OR, should be "everything is a log", and the system keeps the "current" view updated, which is just a special case of the "as of" query, where "as of" is always "now".

How badly my schema will behave (in my SQLite-based architecture) is to be discovered. I will hopefully be in a position to run a reasonable benchmark next week.

A follow-up blog post is likely :sweat-smile:

> OR, should be "everything is a log",

100%. This is a regret that I have in our lifecycle tracking. We effectively are updating a giant spreadsheet (table) and tracking side effects of those updates. I would much rather work in a log/event based system kinda flipping that on its head, where we track events like domain_renewed or domain_transferred.

As it stands we track that the renewed_at attribute changed, and have to infer (along with other fields) things like "were you a renewal event?" which has changed meaning over time.

Alas change tracking wasn't even part of the spec of this original feature... so I am glad I had the foresight to institute change tracking from 2022, as imperfect as it is currently.

  • So much this!

    Straight-up copy-paste from my development notes...

    * what pivots in current db vs temporal db?

    What is the difference between these pictures?

    #+BEGIN_SRC text

      ("Current" DB)
    
       CRUD records
             ^
             |
             v
      [ current view ] -- update --> [ facts log ]
    
      ---------------------------------------------------------
    
      [ current view ] <-- update -- [ facts log ]
             |                             ^
             v                             |
        READ records             WRITE records (as facts)
    
      ("Temporal" DB)
    

    #+END_SRC

    - Hint: It is /not/ the schema. It is /what/ "pivots".

    - In both cases the current view can be understood to be a pivot table of the facts log.

    - BUT in the current DB, we must "pivot" the /process/, i.e. take a CRUD op and transform it into an audit log. This /must/ be done synchronously in in real-time. Whereas in the Temporal DB's case, we must "pivot" the stored data, which we can do at any point in query time, as of any point in time of the log.

    - The complexity of Current DBs derives from /live process management/ challenges. Whereas the complexity of Temporal DBs derives from /retroactive process management/ challenges.

    /now/ is /never/ cheap. It is the most expensive non-fungible thing. Once it's gone, it's gone. Fail to transact an audit trail for the CRUD operation? Too bad. Better luck next time. Whereas disk space is cheap, and practically infinite which affords Temporal DBs greater opportunity to find a better trade-off between essential complexity and DB capabilities. At least as long as disk space remains plentiful and cheap.

    This is why if we are modeling a Temporal DB over a Current DB, it is preferable to write all tables as /fact/ tables and query their auto-generated 'current view' versions, for normal query needs. For audit / analysis needs, we can snapshot the facts tables and operate on those out-of-band (detached from the live app). Impedance mismatch occurs when trying to run /both/ current CRUD tables (writing to audit logs) for some parts of the schema, in parallel with a "main" facts table for all fact-records. In a given data system, it is better to do either one or the other, not both at the same time.

    • The problem with `everything is a log` is that is very undisciplined, and trigger from the actual main table have this very serious advantages:

      * Your main table has the correct shape, and the derived log too! * MOST of the queries are to that and for now

      1 reply →

Looking forward to the second blog!

As a side project, I'm writing a CLI to tie environment variables to project directories ( https://github.com/bbkane/enventory ) and share env vars between projects with a "reference" system. Similar to direnv but in a central SQLite DB.

See https://github.com/bbkane/enventory/tree/master/dbdoc for how the schema looks, but it's a few tables with foreign keys to each other to support the "reference" system.

Unfortunately those foreign keys mean I don't have a way to "undo" changes easily. So instead I added a confirmation prompt, but I'd still like that undo button.

Hopefully in subsequent blog posts you can follow up on your "Model an example domain of sufficient complexity" TODO and make this a bit less abstract (and hopefully more understandable) to me.

  • Maybe I could convert every table into a "log" table (i.e., append only similar to your blog) and add a timestamp column to each table (or UUID). Then I also store all of these UUIDs in another table to keep track of the global order. To "undo" I look up the previous UUID, find entries with that UUID, and append to the log table the values that make the state the same as the previous UUID....

    This isn't as general as your setup, but I think it would work ok with my foreign key issue.

    • I believe UUIDv7 does not (cannot) guarantee total order. I am using it as a method to manage multiple timelines---each "fact" lives on its own virtual timeline (identified by a "valid_id", which is a UUIDv7). My DB system does not automatically disambiguate conflicting facts (because it cannot --- it simply does not have the meta-information). The users of the system can manually audit and fix inconsistencies by making correcting (append-only) entries.

      For your case, I feel like your best bet would be to create the global order in the first place... i.e. force all writes through a single (and single-threaded) writer process, and fetch reads off one or more independent reader processes (I'm doing this with WAL-mode SQLite).

      But I could be totally wrong --- this is my first foray into the freaky world of temporal data systems :D

      1 reply →