Comment by nightpool
1 day ago
Really interesting post, but this part from the beginning stuck out to me:
Ruby Gems are tar files, and one of the files in the tar file is a YAML representation of the GemSpec. This YAML file declares all dependencies for the Gem, so RubyGems can know, without evaling anything, what dependencies it needs to install before it can install any particular Gem. Additionally, RubyGems.org provides an API for asking about dependency information, which is actually the normal way of getting dependency info (again, no eval required).
It would be interesting to compare and contrast the parsing speed for a large representative set of Python dependencies compared to a large representative set of Ruby dependencies. YAML is famously not the most efficient format to parse. We might have been better than `pip`, but I would be surprised if there isn't any room left on the table to parse dependency information in a more efficient format (JSON, protobufs, whatever).
That said, the points at the end about not needing to parse gemspecs to install "most" dependencies would make this pretty moot (if the information is already returned from the gemserver)
Although Yaml is a dreadful thing, given the context and the size of a normal gemspec I would be very surprised if it showed up in any significant capacity when psych should be in the low single digit MB/s throughput.
For a "YAML" lockfile, you could probably write a much simpler and much more performant parser that throws out much of what makes YAML complicated, in particular, anchors, data type tags, all the ways of doing multi-line strings, all the weird unexpected type conversions (like yes/no converting to a boolean)... If the lockfile is never meant to be edited by human hands, only reviewed by human eyes, you can build a much simpler parser for something like:
It mostly doesn't matter, because these metadata files are pulled into their respective package managers. When you publish to RubyGems the file is read into their database and made available to their API, just like when you publish a Python file the pyproject.toml is parse into the PyPI database and made available.
This is a major reason why UV is faster than older python package managers, as they were able to take advantage of the change in the PyPI registry that enabled this. Now these package managers can run their dependency calculations without needing to download the entire package, decompress the package files, and then parse them.