Comment by marginalia_nu

19 hours ago

Zip with no compression is a nice contender for a container format that shouldn't be slept on. It effectively reduces the I/O, while unlike TAR, allowing direct random to the files without "extracting" them or seeking through the entire file, this is possible even via mmap, over HTTP range queries, etc.

You can still get the compression benefits by serving files with Content-Encoding: gzip or whatever. Though it has builtin compression, you can just not use that and use external compression instead, especially over the wire.

It's pretty widely used, though often dressed up as something else. JAR files or APK files or whatever.

I think the articles complaints about lacking unix access rights and metadata is a bit strange. That seems like a feature more than a bug, as I wouldn't expect this to be something that transfers between machines. I don't want to unpack an archive and have to scrutinize it for files with o+rxst permissions, or have their creation date be anything other than when I unpacked them.

Isn't this what is already common in the Python community?

> I don't want to unpack an archive and have to scrutinize it for files with o+rxst permissions, or have their creation date be anything other than when I unpacked them.

I'm the opposite, when I pack and unpack something, I want the files to be identical including attributes. Why should I throw away all the timestamps, just because the file were temporarily in an archive?

  • There is some confusion here.

    ZIP retains timestamps. This makes sense because timestamps are a global concept. Consider them a attribute dependent on only the file in ZIP, similar to the file's name.

    Owners and permissions are dependent also on the computer the files are stored on. User "john" might have a different user ID on another computer, or not exist there at all, or be a different John. So there isn't one obvious way to handle this, while there is with timestamps. Archiving tools will have to pick a particular way of handling it, so you need to pick the tool that implements the specific way you want.

    • > ZIP retains timestamps.

      It does, but unless the 'zip' archive creator being used makes use of the extensions for high resolution timestamps, the basic ZIP format retains only old MSDOS style timestamps (rounded to the closed two seconds). So one may lose some precision in ones timestamps when passing files through a zip archive.

  • > Why should I throw away all the timestamps, just because the file were temporarily in an archive?

    In case anyone is unaware, you don't have to throw away all the timestamps when using "zip with no compression". The metadata for each zipped file includes one timestamp (originally rounded to even number of seconds in local time).

    I am a big last modified timestamp fan and am often discouraged that scp, git, and even many zip utilities are not (at least by default).

    • git updates timestamps in part by necessity of compatibility with build systems. If it applied the timestamp of when the file was last modified on checkout then most build systems would break if you checked out an older commit.

      1 reply →

  • > Isn't this what is already common in the Python community?

    I'm not aware of standards language mandating it, but build tools generally do compress wheels and sdists.

    If you're thinking of zipapps, those are not actually common.

  • Yes, it's a lossy process.

    If your archive drops it you can't get it back.

    If you don't want it you can just chmod -R u=rw,go=r,a-x

This is how Haiku packages are managed, from the outside its a single zstd file, internally all dependacies and files and included in read only file. Reduces IO, reduces file clutter, instant install/uninstall, zero chance for user to corrupt files or dependancy, and easy to switch between versions. The Haiku file system also supports virtual dir mapping so the stubborn Linux port thinks its talking to /usr/local/lib, but in reality its part of the zstd file in /system/packages.

Strangely enough, there is a tool out there that gives Zip-like functionality while preserving Tar metadata functionality, that nobody uses. It even has extra archiving functions like binary deltas. dar (Disk ARchive) http://dar.linux.free.fr/

  • You mean ZIP?

    Zip has 2 tricks: First, compression is per-file, allowing extraction of single files without decompressing anything else.

    Second, the "directory" is at the end, not the beginning, and ends in the offset of the beginning of the directory. Meaning 2 disk seeks (matters even on SSDs) and you can show the user all files.

    Then, you know exactly what bytes are what file and everything's fast. Second, you can easily take off the directory from the zip file, allowing new files to be added without modifying the rest of the file, which can be extended to allow for arbitrary modification of the contents, although you may need to "defragment" the file.

    And I believe, encryption is also per-file. Meaning to decrypt a file you need both the password and the directory entry, which means that if you delete a file, and rewrite just the directory, the data is unrecoverable without requiring a total rewrite of the bytes.

> Zip with no compression is a nice contender for a container format that shouldn't be slept on

SquashFS with zstd compression is used by various container runtimes, and is popular in HPC where filesystems often have high latency. It can be mounted natively or with FUSE, and the decompression overhead is not really felt.

  • Just make sure you mount the squashfs with —direct-io or else you will be double caching (caching the sqfs pages, and caching the uncompressed files within the sqfs). I have no idea why this isn’t the default. Found this out the hard way.

Gzip will make most line protocols efficient enough that you can do away with needing to write a cryptic one that will just end up being friction every time someone has to triage a production issue. Zstd will do even better.

The real one-two punch is make your parser faster and then spend the CPU cycles on better compression.

> It's pretty widely used, though often dressed up as something else. JAR files or APK files or whatever.

JAR files generally do/did use compression, though. I imagine you could forgo it, but I didn't see it being done. (But maybe that was specific to the J2ME world where it was more necessary?)

  • Specifically the benefit is for the native libraries within the file as you can map the library directly to memory instead of having to make a decompressed copy and then mapping that copy to memory.

    • Yes, that's clear. I'm just not aware of people actually doing that, or having done it back in the era when Java was more dominant.

      1 reply →

Doesn’t ZIP have all the metadata at the end of the file, requiring some seeking still?

  • It has an index at the end of the file, yeah, but once you've read that bit, you learn where the contents are located and if compression is disabled, you can e.g. memory map them.

    With tar you need to scan the entire file start-to-finish before you know where the data is located, as it's literally a tape archiving format, designed for a storage medium with no random access reads.

> I wouldn't expect this to be something that transfers between machines

Maybe non-UNIX machines I suppose.

But I 100% need executable files to be executable.

  • This seems like something that shouldn't be the container formats responsibility. You can record arbitrary metadata and put it in a file in the container, so it's trivial to layer on top.

    On the other hand, tie the container structure to your OS metadata structure, and your (hopefully good) container format is now stuck with portability issues between other OSes that don't have the same metadata layout, as well as your own OS in the past & future.

  • Honestly, sometimes I just want to mark all files on a Linux system as executable and see what would even break and why. Seriously, why is there a whole bit for something that's essentially an 'read permission, but you can also directly execute it from the shell'?

I thought Tar had an extension to add an index, but I can't find it in the Wikipedia article. Maybe I dreamt it.

  • You might be thinking of ar, the classic Unix ARchive that is used for static libraries?

    The format used by `ar` is a quite simple, somewhat like tar, with files glued together, a short header in between and no index.

    Early Unix eventually introduced a program called `ranlib` that generates and appends and index for libraries (also containing extracted symbols) to speed up linking. The index is simply embedded as a file with a special name.

    The GNU version of `ar` as well as some later Unix descendants support doing that directly instead.

  • Besides `ar` as a sibiling observed, you might also be thinking of pixz - https://github.com/vasi/pixz , but really any archive format (cpio, etc.) can, in principle, just put a stake in the ground to have its last file be any kind of binary / whatever index file directory like Zip. Or it could hog a special name like .__META_INF__ instead.

> It effectively reduces the I/O, while unlike TAR, allowing direct random to the files without "extracting" them or seeking through the entire file

How do you access a particular file without seeking through the entire file? You can't know where anything is without first seeking through the whole file.

  • You look at the end of the file which tells you where the central directory is. The directory tells you where individual files are.

  • At the end of the ZIP file, there's a central directory of all files contained in that archive. Read the last block, seek to the block containing the file you want to access, done