Comment by Kalanos
2 years ago
TLDR; don't do it.
I've used SQLite blob fields for storing files extensively.
Note that there is a 2GB blob maximum: https://www.sqlite.org/limits.html
To read/write blobs, you have to serialize/deserialize your objects to bytes. This process is not only tedious, but also varies for different objects and it's not a first-class citizen in other tools, so serialization kept breaking as my dependencies upgraded.
As my app matured, I found that I often wanted hierarchical folder-like functionality. Rather than recreating this mess in db relationships, it was easier to store the path and other folder-level metadata in sqlite so that I could work with it in Python. E.g. `os.listdir(my_folder)`.
Also, if you want to interact with other systems/services, then you need files. sqlite can't be read over NFS (e.g. AWS EFS) and by design it has no server for requests. so i found myself caching files to disk for export/import.
SQLite has some settings for handling parallel requests from multiple services, but when I experimented with them I always wound up with a locked db due to competing requests.
For one reason or another, you will end up with hybrid (blob/file) ways of persisting data.
> As my app matured, I found that I often wanted hierarchical folder-like functionality. Rather than recreating this mess in db relationships, it was easier to store the path and other folder-level metadata in sqlite so that I could work with it in Python. E.g. `os.listdir(my_folder)`.
This is a silly argument, there's no reason to recreate the full hierarchy. If you have something like this:
Then you can do this:
This gets you everything in that path and everything in the subpaths (if you just want from the single folder, you can always just add a `directory` column). I benchmarked it using hyperfine on the Linux kernel source tree and a random deep folder: `/bin/ls` took ~1.5 milliseconds, the SQLite query took ~3.0 milliseconds (this is on a M1 MacBook Pro).
The reason it's fast is because the table has a UNIQUE index, and LIKE uses it if you turn off case-sensitivity. No need to faff about with hierarchies.
EDIT: btw, I am using SQLite for this purpose in a production application, couldn't be happier with it.
The postgres ltree extension does this beautifully. I'm not sure if similar things exist for sqlite or other rdbmses.
https://www.postgresql.org/docs/current/ltree.html
MSSQL has something similar but no, most other solutions involve recursive queries.
cool. i don't want to recreate a filesystem in my app logic
If that SELECT query is too much for you, I agree, SQLite is maybe not meant for you. Not a very solid argument against SQLite, though.
3 replies →
> To read/write blobs, you have to serialize/deserialize your objects to bytes. This process is not only tedious, but also varies for different objects and it's not a first-class citizen in other tools, so things kept breaking as my dependencies upgraded.
I'm confused what you mean by this. Files also only contain bytes, so that serialization/deserialization has to happen anyway?
The tools associated with every file type you support have to support reading/writing a buffer/bytestream or whatever it is called.
For example, `pd.read_parquet` accepts "file-like objects" as its first argument: https://pandas.pydata.org/docs/reference/api/pandas.read_par...
However, this is not the case for fringe tools
Maybe you need to encode bytes as text for sql?
Writing and reading bytes in sqlite is very easy: https://www.sqlite.org/c3ref/blob_open.html
> I've used SQLite blob fields for storing files extensively. Note that there is a 2GB blob maximum: https://www.sqlite.org/limits.html
Also note that SQLite does have an incremental blob I/O API (sqlite3_blob_xxx), so unlike most other RDBMS there is no need to read/write blobs as a contiguous piece of memory - handling large blobs is more reasonable than in those. Though the blob API is still separate from normal querying.
Do you have or know of a clear example of how to do this? I have to ask because I spent half of yesterday trying to make it work. The blob_open command wouldn't work until I set a default value on the blob column and then the blob_write command wouldn't work because you can't resize a blob. It was very weird but I'm pretty confident it's because I'm missing something stupid.
I’m pretty sure you can’t resize blobs using this API. The intended usage is to insert/update a row using bind_zeroblob and then update that in place (sans journaling) using the incremental API. It’s a major limitation especially for writing compressed data.
1 reply →
Most RDBMS's have streaming blob apis:
MS SQL Server: READTEXT, WRITETEXT, substring, UPDATE.WRITE
Oracle: DBMS_LOB.READ, DBMS_LOB.WRITE
PG: Large Objects
Most of my experience is with SQL server and it can stream large objects incrementally through a web app to browser without loading the whole thing into memory at 100's Mbytes/sec on normal hardware.
I wasn’t aware of PG Large Objects, though I do know about MSSQL Filestream, which iirc are actual files on disk and remotely transparently accessed via SMB - which in this context would be the moral equivalent of storing just the path in a SQLite db. The functionality you quote seems to be deprecated?
1 reply →
I wish the API was compatible with iovec though. As that's what all the c standard lib APIs use for non-contiguous memory
That sounds nice for chunking, but what if you need contiguous memory? E.g. viewing an image or running an AI model
You can always use a chunked API to read data into a contiguous buffer. Just allocate a large enough contiguous block of memory, then copy chunk by chunk into that contiguous memory until you've read everything.
> so i found myself caching files to disk for export/import
Could use a named pipe.
I’m reminded of what I often do at the shell with psub in fish. psub -f creates and returns the path to a fifo/named pipe in $TMPDIR and writes stdin to that; you’ve got a path but aren’t writing to the filesystem.
e.g. you want to feed some output to something that takes file paths as arguments. We want to compare cmd1 | grep foo and cmd2 | grep foo. We pipe each to psub in command substitutions:
which expands to something like
As long as the tool doesn’t seek around the file. (caveats are numerous enough that without -f, psub uses regular files.)
> I’m reminded of what I often do at the shell with psub in fish.
ksh and bash too have this as <(…) and >(…) under Process Substitution.
An example from ksh(1) man page:
bash (at least) has a built-in mechanism to do that
diff <(cmd1 | grep foo) <(cmd2 | grep foo)
> As my app matured, I found that I often wanted hierarchical folder-like functionality. Rather than recreating this mess in db relationships, it was easier to store the path in sqlite and work with it in Python. E.g. `os.listdir(my_folder)`
This makes total sense and it is also "frowned upon" by people who take a too purist view of databases
(Until it comes a time to backup, or extract files, or grow a hard drive etc and then you figure out how you shot yourself in the foot)
To make it more queryable, you can have different classes for dataset types with metadata like: file_format, num_files, sizes
> As my app matured, I found that I often wanted hierarchical folder-like functionality.
In the process of prototyping some "remote" collaborating file systems, I always wonder whether it is a good idea maintaining a flat map from path concatenated with "/" like an S3 to the file content, in term of efficiency or elegancy.
> As my app matured, I found that I often wanted hierarchical folder-like functionality
(1) Slim table "items"
- id / parent_id / kind (0/1 file folder) integer
- name text
- Maybe metadata.
(2) Separate table "content"
- id integer
- data blob
There you have file-system-like structure and fast access times (don't mix content in the first table)
Or, if you wish for deduplication or compression, add item_content (3)
Can you describe how you stored the paths in sqlite? I'm not entirely getting it.
just a string field that points to the file path
The idea to emulate hierarchical folder-like functionality ala filepaths is quite brilliant - I might try it out.
Storing Hierarchical Data in Relational Databases
https://medium.com/@rishabhdevmanu/from-trees-to-tables-stor...