← Back to context

Comment by sandGorgon

5 years ago

Does everyone use a giant Pickle dump ? I mean - how big is that ? Petabytes ?

I'm kind of surprised nobody monkey patched python serialisation to use a database (much like GitHub did with ssh key lookup in MySQL).

What does the devops there look like ? Snapshot every minute ?

It's not a single giant pickle dump; each individual object gets pickled and stored in Minerva (which works more or less like Cassandra or something). It's a pretty similar high level design to what the likes of Google or Facebook do do where you store everything as protobufs in BigTable - the bank uses pickle rather than protobuf because they put a higher priority on being able to store arbitrary objects and deal with robustness/compatibility later, rather than having to write a proto definition and a bunch of mapping code up front. You wouldn't want to use a relational database because they're not properly distributed (and, frankly, kind of bad and overrated).

The Minerva I worked on was temporal and append-only, like a HBase that never did compactions (so "delete" actually just writes a tombstone row at a particular timestamp - there was an "obliterate" command but you needed special authorization to use that), and it was distributed (with availability zones even) so you didn't really worry about losing data; loading data as-of a particular timestamp was part of every query (and implemented efficiently). There were probably regular dumps somewhere too but I never needed to encounter those.

  • So Minerva is like a distributed datastore, specifically for python object storage ?

    Interesting. Do you think you would do this today with a Cassandra/Hbase? Can it be done - let's say take python 3.10 and the latest Cassandra (or even better - something like Firebase or Cloud Spanner).

    Just curious that in a post AWS/Firebase world, can something like Minerva be built, without investing in writing the db store ground up.

    • The incarnation of Minerva I worked on actually used Cassandra as its storage backend. But it's something that's not particularly useful piecemeal; the great value of Minerva is that all the bank's data is there and it's all temporal, all access-controlled and all the rest. The most fragile and cumbersome parts of Minerva are the parts where it integrates with an external/legacy datastore - but if you tried to introduce a Minerva-style datastore as a small piece in a system that was otherwise using a "normal" technology stack, those integrations would be most of what you made.

ZODB is the object oriented database as a giant pickle dump. Surprisingly, it works and scales wuite well. The downside is that non-Python tools cannot access it all.

https://zodb.org/en/latest/

  • I learnt Python via Zope in 2000, and attended the Zope Conference in Python that year.

    Joined JPMorgan in 2010 to work on Athena, and immediately had a real sense of deja vu... Athena's Hydra object db (essentially an append-only KV store of pickles) felt like a great grandchild of Zope's ZODB.

  • I remember explaining our tech stack (Python and Zope) to clients.

    “Where is the code for that page?”

    “It’s in the database”

    “Oh… Like MySQL?”

    “No. It’s an object database”

    “???”

    I called it “Martian Technology Syndrome”. But it worked. At later stages we paid the price and had to serialize the datastore for migrations, but that’s what you get for relying on pickles.

I use Pickle quite a lot for caching, a file read is almost always faster than a DB query.

For long-term persistent data ? Seems very dangerous to me, even reading a pickle from say PyPy vs a Cython intepreter corrupts the damn thing.

This works until...

Specifically, until you realize that pickle changes based on python version, so updating from py3.x to py3.x+1 will prevent your application from reading previously stored data.

  • This is wrong. pickle can read old files just fine and lets you generate files in old pickle format versions if you require backwards compatibility further than when the current protocol was introduced (it does not get increased with each python version).