Comment by gethly
11 days ago
I've done few formats myself. Nothing complicated. But once you do one, all others are essentially the same. You need length of data, data itself and then likely version and magic bytes for identification purposes. With those few details you can do essentially anything.
For example, one format I use is just to concatenate multiple files into a single one, I use it to group video timeline seeker images into one file - it is faster than using archive or tar/gzip. Another one is a format that concatenates AES-GCM chunks into a single file, which allows me to have interrupted writes and it also supports seeking and streaming of reads.
These things are quite useful, but there is no general use(like gzip/tar). Usually there is some specific functionality needed, so they have to always be written from scratch.
> For example, one format I use is just to concatenate multiple files into a single one, I use it to group video timeline seeker images into one file - it is faster than using archive or tar/gzip
I did something like this when I was moving my files onto a new computer like 25 years ago, and all I had was a floppy drive. Just continuously dump the data onto a floppy until space runs out and ask for another one until there are no more files.
This almost IS the tar format. It’s just a 512 byte header with metadata per file then the file data. Repeat for each file. The cpio format is similar but the header is shorter. Details of the contents of the headers vary, hence the different flavours. And I believe POSIX added extensible extra metadata fields that are saved as a kinda pseudo file
Floppy disks..ah, good times :)
I wouldn't expect video timeline seeking to be all that performance critical, I would think you could use SQLlite with indexes, since you only need a small number at a time and they're probably pretty low resolution, right?
I'd buy the AES-GCM chunks one for a dollar!
I spent quite a lot of time on that one, for obvious reasons. But in general it is not too hard. The GCM is a block-based cypher with built-in checksum, unlike CTR, which is a streaming one. So all you need to do is have a fixed block size where you store the header and the data. The nonce is 12 bytes, gcm tag is 16 bytes, so that is fixed 28 bytes. After some experimenting, 64kb block size seemed to work the best, despite it being quite a large chunk of data. And then, as you know, you have exactly 64kb of data in each chunk, you just stack them one after another. The hard part is then handling reads as you need to know into which chunk you have to seek, decrypt it and then seek to the correct position to stream/read the correct data. And once you reach the end of the chunk to move on to the next one. It is a bit tricky but perfectly doable and have been working for me for probably 3 years now. One caveat is to properly handle the last chunk as that one will not be full 64kb but whatever was left in the buffer of the last data. This is important for appending to existing files.
I've been just re-encrypting to CTR and streaming from that. You can stream ok from a big, single GCM file, but random-access has to faked by always restarting at 0...
1 reply →