← Back to context

Comment by torginus

6 months ago

Because generic monomorphization generates a massive amount of machine code.

that can be the reason, but it's a very bad example

it's quite unlikely that it would be _that_ much smaller if it had been written in C or C++ with the _exact_ same goals, features etc. in mind.

like grep and ripgrep seem on the surface quite similar (grep something, have multiple different regex engine etc.) but if you go into the details they are quite different (not just because rg has file walking and resolution of gitignore logic build in, but also wrt. goals features of their regex engines, performance goals, terminal syntax highlighting etc.)

  • Responding narrowly:

    ripgrep doesn't do "terminal syntax highlighting." It has some basic support for colors, similar to GNU grep.

    GNU grep and ripgrep share a lot of similarities, even beyond superficial ones. There are also some major differences. But I've always said that the venn diagram of GNU grep and ripgrep has a much bigger surface area in their intersection than the area of their symmetric difference.

  • I don't know the reason, but as having worked in embedded, I worked on a project that had drivers, app logic, filesystem support, TCP stack, and then some more, that fit in less that 64kb of ROM (written in C), without much trouble, 2MB for such a tool seems excessive, would love to see a breakdown of what's in there and how much space it takes up.

    • I wrote an encrypted file exchange tool, 26kb. External dependencies are files, sockets, memcpy and malloc. It's client and server in one file, so it's two times bigger than it can be. It also has complex and almost useless features like traffic obfuscation, probing resistance and hertzbleed resistance because why not, so it's not a minimal implementation.

  • ugrep, which is C++ and similar in scope to ripgrep is 0.9 MB on my machine, ripgrep is 4.4 MB and GNU grep us 0.2 MB. They all depend on libc and libpcre2.

    Ugrep however depends on libstdc++ and a bunch of libraries for compressed file support (libz,...).

    So yeah a bit bloated but we are not at Electron level yet.

    • It's not clear to me that you're accounting for the difference in size that results from static vs dynamic linking. For example, if I build `ugrep` with `./build.sh --enable-static --without-brotli --without-lzma --without-zstd --without-lz4 --without-bzlib`, then I get a `ugrep` binary that is 4.5MB. (I added all of those `--without-*` flags because I couldn't get the build to work otherwise.) If I add `--without-pcre2`, I get a 3.9MB binary.

      ripgrep is only a little bigger here when you do an apples to apples comparison. To get a static build without PCRE2, run `cargo build --profile release-lto --target x86_64-unknown-linux-musl`. That gets me a 4.6MB `rg` binary. Running `PCRE2_SYS_STATIC=1 cargo build --profile release-lto --target x86_64-unknown-linux-musl --features pcre2` gets a fully static binary with PCRE2 at a 5.4MB `rg` binary.

      Popping up a level, a fair criticism is that it is difficult to get ripgrep to dynamically link most of its dependencies. You can make it dynamically link libc and PCRE2 (that's just `cargo build --profile release-lto --features pcre2`) and get a 4.1MB binary, but getting it to dynamically link all of its Rust crate dependencies is an unsupported build configuration for ripgrep. But I don't know how much tools like ugrep or GNU grep rely on that level of granular dynamic linking anyway. GNU grep doesn't seem to do so on my system (only dynamically linking with libc and PCRE2).

      Additionally, the difference in binary size may be at least partially attributable to a difference in Unicode support:

          $ echo ♥ | rg '\p{Emoji}'
          ♥
          $ echo ♥ | ugrep-7.5.0 '\p{Emoji}'
          ugrep: error: error at position 6
          (?m)\p{Emoji}
                \___invalid character class

      7 replies →