Comment by marginalia_nu

13 days ago

Zip with no compression is a nice contender for a container format that shouldn't be slept on. It effectively reduces the I/O, while unlike TAR, allowing direct random to the files without "extracting" them or seeking through the entire file, this is possible even via mmap, over HTTP range queries, etc.

You can still get the compression benefits by serving files with Content-Encoding: gzip or whatever. Though it has builtin compression, you can just not use that and use external compression instead, especially over the wire.

It's pretty widely used, though often dressed up as something else. JAR files or APK files or whatever.

I think the articles complaints about lacking unix access rights and metadata is a bit strange. That seems like a feature more than a bug, as I wouldn't expect this to be something that transfers between machines. I don't want to unpack an archive and have to scrutinize it for files with o+rxst permissions, or have their creation date be anything other than when I unpacked them.

51 comments

marginalia_nu

1718627440 13 days ago

Isn't this what is already common in the Python community?

> I don't want to unpack an archive and have to scrutinize it for files with o+rxst permissions, or have their creation date be anything other than when I unpacked them.

I'm the opposite, when I pack and unpack something, I want the files to be identical including attributes. Why should I throw away all the timestamps, just because the file were temporarily in an archive?

nh2 13 days ago
There is some confusion here.
ZIP retains timestamps. This makes sense because timestamps are a global concept. Consider them a attribute dependent on only the file in ZIP, similar to the file's name.
Owners and permissions are dependent also on the computer the files are stored on. User "john" might have a different user ID on another computer, or not exist there at all, or be a different John. So there isn't one obvious way to handle this, while there is with timestamps. Archiving tools will have to pick a particular way of handling it, so you need to pick the tool that implements the specific way you want.
- pwg 13 days ago
  
  > ZIP retains timestamps.
  It does, but unless the 'zip' archive creator being used makes use of the extensions for high resolution timestamps, the basic ZIP format retains only old MSDOS style timestamps (rounded to the closed two seconds). So one may lose some precision in ones timestamps when passing files through a zip archive.
  
  1 reply →
password4321 13 days ago
> Why should I throw away all the timestamps, just because the file were temporarily in an archive?
In case anyone is unaware, you don't have to throw away all the timestamps when using "zip with no compression". The metadata for each zipped file includes one timestamp (originally rounded to even number of seconds in local time).
I am a big last modified timestamp fan and am often discouraged that scp, git, and even many zip utilities are not (at least by default).
- rcxdude 13 days ago
  
  git updates timestamps in part by necessity of compatibility with build systems. If it applied the timestamp of when the file was last modified on checkout then most build systems would break if you checked out an older commit.
  
  2 replies →
Dylan16807 10 days ago

> I'm the opposite, when I pack and unpack something, I want the files to be identical including attributes. Why should I throw away all the timestamps, just because the file were temporarily in an archive?
I would expect modified dates to stay the same, and other dates to change similar to copying a directory. I think this is the normal experience with zip?
For creation dates, Linux usually doesn't even track those at all. There's partial support on BTRFS and ZFS, and on ext4 there nowhere to store it at all.
rustyhancock 13 days ago
Yes, it's a lossy process.
If your archive drops it you can't get it back.
If you don't want it you can just chmod -R u=rw,go=r,a-x
- 1718627440 13 days ago
  
  > If your archive drops it you can't get it back.
  Hence, the common archive format is tar not zip.
zahlman 13 days ago
> Isn't this what is already common in the Python community?
I'm not aware of standards language mandating it, but build tools generally do compress wheels and sdists.
If you're thinking of zipapps, those are not actually common.
- 1718627440 13 days ago
  
  I was talking about using zipfile as a generic file format, instead of open and close.
  
  1 reply →

stabbles 13 days ago

> Zip with no compression is a nice contender for a container format that shouldn't be slept on

SquashFS with zstd compression is used by various container runtimes, and is popular in HPC where filesystems often have high latency. It can be mounted natively or with FUSE, and the decompression overhead is not really felt.

__turbobrew__ 13 days ago

Just make sure you mount the squashfs with —direct-io or else you will be double caching (caching the sqfs pages, and caching the uncompressed files within the sqfs). I have no idea why this isn’t the default. Found this out the hard way.
ciupicri 13 days ago
Wouldn't you still have a lot of syscalls?
- stabbles 13 days ago
  
  Yes, but with much lower latency. The squashfs file ensures the files are close together and you benefit from fs cache a lot.
- LtdJorge 13 days ago
  
  You then use io_uring

smallstepforman 13 days ago

This is how Haiku packages are managed, from the outside its a single zstd file, internally all dependacies and files and included in read only file. Reduces IO, reduces file clutter, instant install/uninstall, zero chance for user to corrupt files or dependancy, and easy to switch between versions. The Haiku file system also supports virtual dir mapping so the stubborn Linux port thinks its talking to /usr/local/lib, but in reality its part of the zstd file in /system/packages.

0xbadcafebee 13 days ago

Strangely enough, there is a tool out there that gives Zip-like functionality while preserving Tar metadata functionality, that nobody uses. It even has extra archiving functions like binary deltas. dar (Disk ARchive) http://dar.linux.free.fr/

spwa4 13 days ago
You mean ZIP?
Zip has 2 tricks: First, compression is per-file, allowing extraction of single files without decompressing anything else.
Second, the "directory" is at the end, not the beginning, and ends in the offset of the beginning of the directory. Meaning 2 disk seeks (matters even on SSDs) and you can show the user all files.
Then, you know exactly what bytes are what file and everything's fast. Second, you can easily take off the directory from the zip file, allowing new files to be added without modifying the rest of the file, which can be extended to allow for arbitrary modification of the contents, although you may need to "defragment" the file.
And I believe, encryption is also per-file. Meaning to decrypt a file you need both the password and the directory entry, which means that if you delete a file, and rewrite just the directory, the data is unrecoverable without requiring a total rewrite of the bytes.
- hedora 13 days ago
  
  I think Zip's main trick is that it's been preloaded on everything forever.

hinkley 13 days ago

Gzip will make most line protocols efficient enough that you can do away with needing to write a cryptic one that will just end up being friction every time someone has to triage a production issue. Zstd will do even better.

The real one-two punch is make your parser faster and then spend the CPU cycles on better compression.

cb321 13 days ago

DNA researchers developed a parallel format for gzip they call "bgzip" ( https://learngenomics.dev/docs/genomic-file-formats/compress... ) that makes data seem less trapped behind a decompression perf wall. Zstd is still a bit faster (but < ~2X) and also gets better compression ratios (https://forum.nim-lang.org/t/5103#32269)

zahlman 13 days ago

> It's pretty widely used, though often dressed up as something else. JAR files or APK files or whatever.

JAR files generally do/did use compression, though. I imagine you could forgo it, but I didn't see it being done. (But maybe that was specific to the J2ME world where it was more necessary?)

charcircuit 13 days ago
Specifically the benefit is for the native libraries within the file as you can map the library directly to memory instead of having to make a decompressed copy and then mapping that copy to memory.
- zahlman 13 days ago
  
  Yes, that's clear. I'm just not aware of people actually doing that, or having done it back in the era when Java was more dominant.
  
  2 replies →

account42 12 days ago

One problem with the zip format is that metadata is stored both in the central directory and also before each file data - that creates ambiguity when the metadata differs which different programs/libraries don't handle consistently.

LtdJorge 13 days ago

Doesn’t ZIP have all the metadata at the end of the file, requiring some seeking still?

marginalia_nu 13 days ago

It has an index at the end of the file, yeah, but once you've read that bit, you learn where the contents are located and if compression is disabled, you can e.g. memory map them.
With tar you need to scan the entire file start-to-finish before you know where the data is located, as it's literally a tape archiving format, designed for a storage medium with no random access reads.
conradludgate 13 days ago

Yes, but it's an O(1) random access seek rather than O(n) scanning seek

paulddraper 13 days ago

> I wouldn't expect this to be something that transfers between machines

Maybe non-UNIX machines I suppose.

But I 100% need executable files to be executable.

dwattttt 13 days ago
This seems like something that shouldn't be the container formats responsibility. You can record arbitrary metadata and put it in a file in the container, so it's trivial to layer on top.
On the other hand, tie the container structure to your OS metadata structure, and your (hopefully good) container format is now stuck with portability issues between other OSes that don't have the same metadata layout, as well as your own OS in the past & future.
- paulddraper 12 days ago
  
  What is a container then?
  Just an id,blob format?
  The purpose of tar (or competitors) is to serialize files and their metadata.
  
  1 reply →
Joker_vD 13 days ago
Honestly, sometimes I just want to mark all files on a Linux system as executable and see what would even break and why. Seriously, why is there a whole bit for something that's essentially an 'read permission, but you can also directly execute it from the shell'?
- paulddraper 13 days ago
  
  It’s a security thing, in conjunction with sudoers, I think.
- pram 12 days ago
  
  From the days when UNIX was primarily multiuser/timeshare. You can prevent users from running wacky stuff with the umask.
  
  1 reply →
marginalia_nu 13 days ago

Do you also want the setuid bit I added?

01HNNWZ0MV43FF 13 days ago

I thought Tar had an extension to add an index, but I can't find it in the Wikipedia article. Maybe I dreamt it.

st_goliath 13 days ago

You might be thinking of ar, the classic Unix ARchive that is used for static libraries?
The format used by `ar` is a quite simple, somewhat like tar, with files glued together, a short header in between and no index.
Early Unix eventually introduced a program called `ranlib` that generates and appends and index for libraries (also containing extracted symbols) to speed up linking. The index is simply embedded as a file with a special name.
The GNU version of `ar` as well as some later Unix descendants support doing that directly instead.
cb321 13 days ago

Besides `ar` as a sibiling observed, you might also be thinking of pixz - https://github.com/vasi/pixz , but really any archive format (cpio, etc.) can, in principle, just put a stake in the ground to have its last file be any kind of binary / whatever index file directory like Zip. Or it could hog a special name like .__META_INF__ instead.

thaumasiotes 13 days ago

> It effectively reduces the I/O, while unlike TAR, allowing direct random to the files without "extracting" them or seeking through the entire file

How do you access a particular file without seeking through the entire file? You can't know where anything is without first seeking through the whole file.

nikanj 13 days ago
At the end of the ZIP file, there's a central directory of all files contained in that archive. Read the last block, seek to the block containing the file you want to access, done
- thaumasiotes 12 days ago
  
  > At the end of the ZIP file, there's a central directory of all files contained in that archive.
  Where does that begin?
  > Read the last block
  You mean the last 4KB chunk defined by the file system, or what? The comment can be up to 64KB long.
  
  1 reply →
charcircuit 13 days ago

You look at the end of the file which tells you where the central directory is. The directory tells you where individual files are.