Comment by Kalanos

2 years ago

TLDR; don't do it.

I've used SQLite blob fields for storing files extensively.

Note that there is a 2GB blob maximum: https://www.sqlite.org/limits.html

To read/write blobs, you have to serialize/deserialize your objects to bytes. This process is not only tedious, but also varies for different objects and it's not a first-class citizen in other tools, so serialization kept breaking as my dependencies upgraded.

As my app matured, I found that I often wanted hierarchical folder-like functionality. Rather than recreating this mess in db relationships, it was easier to store the path and other folder-level metadata in sqlite so that I could work with it in Python. E.g. `os.listdir(my_folder)`.

Also, if you want to interact with other systems/services, then you need files. sqlite can't be read over NFS (e.g. AWS EFS) and by design it has no server for requests. so i found myself caching files to disk for export/import.

SQLite has some settings for handling parallel requests from multiple services, but when I experimented with them I always wound up with a locked db due to competing requests.

For one reason or another, you will end up with hybrid (blob/file) ways of persisting data.

34 comments

Kalanos

OskarS 2 years ago

> As my app matured, I found that I often wanted hierarchical folder-like functionality. Rather than recreating this mess in db relationships, it was easier to store the path and other folder-level metadata in sqlite so that I could work with it in Python. E.g. `os.listdir(my_folder)`.

This is a silly argument, there's no reason to recreate the full hierarchy. If you have something like this:

    CREATE TABLE files (path TEXT UNIQUE COLLATE NOCASE);

Then you can do this:

    SELECT path FROM files WHERE path LIKE "./some/path/%";

This gets you everything in that path and everything in the subpaths (if you just want from the single folder, you can always just add a `directory` column). I benchmarked it using hyperfine on the Linux kernel source tree and a random deep folder: `/bin/ls` took ~1.5 milliseconds, the SQLite query took ~3.0 milliseconds (this is on a M1 MacBook Pro).

The reason it's fast is because the table has a UNIQUE index, and LIKE uses it if you turn off case-sensitivity. No need to faff about with hierarchies.

EDIT: btw, I am using SQLite for this purpose in a production application, couldn't be happier with it.

jayknight 2 years ago
The postgres ltree extension does this beautifully. I'm not sure if similar things exist for sqlite or other rdbmses.
https://www.postgresql.org/docs/current/ltree.html
- halfmatthalfcat 2 years ago
  
  MSSQL has something similar but no, most other solutions involve recursive queries.
Kalanos 2 years ago
cool. i don't want to recreate a filesystem in my app logic
- OskarS 2 years ago
  
  If that SELECT query is too much for you, I agree, SQLite is maybe not meant for you. Not a very solid argument against SQLite, though.
  
  3 replies →

jorams 2 years ago

> To read/write blobs, you have to serialize/deserialize your objects to bytes. This process is not only tedious, but also varies for different objects and it's not a first-class citizen in other tools, so things kept breaking as my dependencies upgraded.

I'm confused what you mean by this. Files also only contain bytes, so that serialization/deserialization has to happen anyway?

Kalanos 2 years ago

The tools associated with every file type you support have to support reading/writing a buffer/bytestream or whatever it is called.
For example, `pd.read_parquet` accepts "file-like objects" as its first argument: https://pandas.pydata.org/docs/reference/api/pandas.read_par...
However, this is not the case for fringe tools
Demiurge 2 years ago
Maybe you need to encode bytes as text for sql?
- arianvanp 2 years ago
  
  Writing and reading bytes in sqlite is very easy: https://www.sqlite.org/c3ref/blob_open.html

formerly_proven 2 years ago

> I've used SQLite blob fields for storing files extensively. Note that there is a 2GB blob maximum: https://www.sqlite.org/limits.html

Also note that SQLite does have an incremental blob I/O API (sqlite3_blob_xxx), so unlike most other RDBMS there is no need to read/write blobs as a contiguous piece of memory - handling large blobs is more reasonable than in those. Though the blob API is still separate from normal querying.

fnordlord 2 years ago
Do you have or know of a clear example of how to do this? I have to ask because I spent half of yesterday trying to make it work. The blob_open command wouldn't work until I set a default value on the blob column and then the blob_write command wouldn't work because you can't resize a blob. It was very weird but I'm pretty confident it's because I'm missing something stupid.
- formerly_proven 2 years ago
  
  I’m pretty sure you can’t resize blobs using this API. The intended usage is to insert/update a row using bind_zeroblob and then update that in place (sans journaling) using the incremental API. It’s a major limitation especially for writing compressed data.
  
  1 reply →
SigmundA 2 years ago
Most RDBMS's have streaming blob apis:
MS SQL Server: READTEXT, WRITETEXT, substring, UPDATE.WRITE
Oracle: DBMS_LOB.READ, DBMS_LOB.WRITE
PG: Large Objects
Most of my experience is with SQL server and it can stream large objects incrementally through a web app to browser without loading the whole thing into memory at 100's Mbytes/sec on normal hardware.
- formerly_proven 2 years ago
  
  I wasn’t aware of PG Large Objects, though I do know about MSSQL Filestream, which iirc are actual files on disk and remotely transparently accessed via SMB - which in this context would be the moral equivalent of storing just the path in a SQLite db. The functionality you quote seems to be deprecated?
  
  1 reply →
arianvanp 2 years ago

I wish the API was compatible with iovec though. As that's what all the c standard lib APIs use for non-contiguous memory
Kalanos 2 years ago
That sounds nice for chunking, but what if you need contiguous memory? E.g. viewing an image or running an AI model
- mort96 2 years ago
  
  You can always use a chunked API to read data into a contiguous buffer. Just allocate a large enough contiguous block of memory, then copy chunk by chunk into that contiguous memory until you've read everything.

floam 2 years ago

> so i found myself caching files to disk for export/import

Could use a named pipe.

I’m reminded of what I often do at the shell with psub in fish. psub -f creates and returns the path to a fifo/named pipe in $TMPDIR and writes stdin to that; you’ve got a path but aren’t writing to the filesystem.

e.g. you want to feed some output to something that takes file paths as arguments. We want to compare cmd1 | grep foo and cmd2 | grep foo. We pipe each to psub in command substitutions:

    diff -u $(cmd1 | grep foo | psub -f) $(cmd2 | grep foo | psub -f)

which expands to something like

   diff -u /tmp/fish0K5fd.psub /tmp/fish0hE1c.psub

As long as the tool doesn’t seek around the file. (caveats are numerous enough that without -f, psub uses regular files.)

tzot 2 years ago
> I’m reminded of what I often do at the shell with psub in fish.
ksh and bash too have this as <(…) and >(…) under Process Substitution.
An example from ksh(1) man page:
paste <(cut -f1 file1) <(cut -f3 file2) | tee >(process1) >(process2)
jayknight 2 years ago
bash (at least) has a built-in mechanism to do that
diff <(cmd1 | grep foo) <(cmd2 | grep foo)

raverbashing 2 years ago

> As my app matured, I found that I often wanted hierarchical folder-like functionality. Rather than recreating this mess in db relationships, it was easier to store the path in sqlite and work with it in Python. E.g. `os.listdir(my_folder)`

This makes total sense and it is also "frowned upon" by people who take a too purist view of databases

(Until it comes a time to backup, or extract files, or grow a hard drive etc and then you figure out how you shot yourself in the foot)

Kalanos 2 years ago

To make it more queryable, you can have different classes for dataset types with metadata like: file_format, num_files, sizes

qbane 2 years ago

> As my app matured, I found that I often wanted hierarchical folder-like functionality.

In the process of prototyping some "remote" collaborating file systems, I always wonder whether it is a good idea maintaining a flat map from path concatenated with "/" like an S3 to the file content, in term of efficiency or elegancy.

out_of_protocol 2 years ago

> As my app matured, I found that I often wanted hierarchical folder-like functionality

(1) Slim table "items"

- id / parent_id / kind (0/1 file folder) integer

- name text

- Maybe metadata.

(2) Separate table "content"

- id integer

- data blob

There you have file-system-like structure and fast access times (don't mix content in the first table)

Or, if you wish for deduplication or compression, add item_content (3)

stavros 2 years ago

Can you describe how you stored the paths in sqlite? I'm not entirely getting it.

Kalanos 2 years ago

just a string field that points to the file path

knighthack 2 years ago

The idea to emulate hierarchical folder-like functionality ala filepaths is quite brilliant - I might try it out.

thunderbong 2 years ago

Storing Hierarchical Data in Relational Databases
https://medium.com/@rishabhdevmanu/from-trees-to-tables-stor...