Build a Database in Four Months with Rust and 647 Open-Source Dependencies

1 year ago (tisonkun.io)

“ With a team of three experienced developers, we have implemented ScopeDB from scratch”

“ with 100 direct dependencies and 647 dependencies in total”

Next up: watch me build numpy from scratch with only 150 dependencies, one of which numpy.

  • You're not wrong, they depend on an external SQL database, which they access with sqlx.

    • In the linked article below, we talked about "If RDS has already been used, why is another database needed?" and "Why RDS?"

      Briefly, you need to manage metadata for the database. You can write your own raft based solution or leverage existing software like etcd or zookeeper that may not "a relational database". Now you need to deploy them with EBS and reimplement data replication + multi AZ fault tolerance, and it's likely still worse performance than RDS because first-class RDS can typically use internal storage API and advanced hardware. Such a scenario is not software driven.

      https://flex-ninja.medium.com/from-shared-nothing-to-shared-...

When it comes to understanding the risks involved with having this many dependencies, one thing that folks might not understand is that Rust's support for dependency resolution and lock files is fantastic.

Tools like `cargo audit` can tell you statically based on the lockfile which dependencies have security vulnerabilities reported against them (but you have to run it!). And Github's https://github.com/dependabot/ will do that same thing automatically, just based on the existence of the lockfile in your repo (and will also open PRs to bump deps for you).

And as mentioned elsewhere: Cargo's dependency resolver supports providing multiple versions of a dep in different dependency subgraphs, which all but eliminates the "dependency hell" that folks expect from ecosystems like Python or the JVM. Two copies of a dep at different versions? Totally fine.

  • Doesn't node npm also do similar?

    • Yes. AFAIK, it evolved over time across 3+ package managers (`npm`, `yarn`, `pnpm`, etc), but the current state of that ecosystem is similar (including the behavior of dependabot).

    • Python's Poetry has poetry audit as well, and there are third-party tools such as Safety (Python), Nancy (Golang), etc. Lots of languages have something like this.

      2 replies →

  • > Tools like `cargo audit` can tell you statically based on the lockfile which dependencies have security vulnerabilities reported against them

    known security vulnerabilities. If someone compromises your cargo repository (see npm for examples) all your safety is gone.

Isn't that something that should be posted April 1? I'm really not sure if the author is proud about the fact that his project has so many dependencies. Is that something modern coders aim for these days? I usually try to achieve the exact opposite in my projects.

  • > Is that something modern coders aim for these days?

    Yes. No dependencies is so 80's. Just run an ldd on your commonly used programs.

  • Even developers with "few" dependencies often lean on projects (languages, frameworks, etc) where there are hundreds of dependencies.

  • I also prefer to minimize dependencies, and it feels like this is why I can't find work.

    • So do I. I am writing a Perl script right now, and I could either use a non-core dependency, or implement my own. I went with my own. It is only a few lines of code. It works without the need to cpan i the module.

  • Its just really tongue-in-cheek about everything which makes this article more fun to read imo

While acknowledging one does not "have to" have so many dependencies, the prevalence of this npm-esque type of practice is one of the two things that destroyed all of my interest in Rust.

  • Rust dependencies tend to be pretty high quality in my experience. Maintained by experts and offer new improvements over state-of-the-art.

    But if you compare to C/C++ at least with Rust you _can_ but aren't required to use dependencies. In C/C++ if you want to, it's a _massive_ pain.

    • I care less about the quality of the dependencies than about the burden of protecting against supply chain attacks when there are a lot of dependencies.

      30 replies →

    • One malicious dependency is enough. When you have 600 dependencies "tend to be pretty high quality" does not cut it.

  • it's completely stupid to measure "number of dependencies" in absolute numbers.

    Lots of packages have a `-macros` or `-derive` transient dependency, meaning a single dependency can end up coutning as 3 additional dependencies.

    Rust makes it simple to split packages into workspaces - for example, regex[1] consists of `regex-automata` and `regex-syntax` packages.

    This composition and separation of concerns is a sign of good design, and not an npm-esque hellhole.

    1. https://crates.io/crates/regex/1.11.1/dependencies

    • I suppose you could say that the audit burden scales linearly with the number of module publishers, with a small additional amount on every release point to confirm that the publisher is still who they purport to be and hasn't been compromised.

      This is assuming that the audit consists of validating dependency authorship, and not the more labor-intensive approach of reviewing dependency code.

      7 replies →

    • Indeed. It's actually been quite handy on a few occasions to be able to just pull in a smaller crate as opposed to the whole project. (in constrast to, say, boost in C++, which is a big mess of a dependency even though it's one that goes to at least a little bit of effort to let you split it up, but through an ad-hoc process as opposed to a standard package management system).

      (I would genuinely be interesting in an experiment which pushes this as far as possible: what if each function was a 'package'? There's already a decent understanding of how dependencies within a library work in the compiler, what if that extended to the package manager? You would know exactly what code you actually needed, and would only pull in exactly what was necessary)

    • that's kind of on rust for pushing crates front and center rather than groupings of crates that are developed / reviewed / released together as a single cohesive unit (typically a git repo).

      e.g. go dependencies are counted on modules (roughly git repos), rather than packages (directories, compilation units). java is counted in packages rather than classes.

    • The vulnerability to supply chain attacks gives me pause. It's not unique to rust and it bothers me with npm or Python as well.

  • What are you comparing this to? Do you have positive examples? This seems to be a general dependancy management issue unrelated to rust—the reason C++ has this is that C++ also lacks any concept of dependencies, so people kind of just make do with modifying what packages are already integrated into the build process. This certainly doesn't imply you should trust boost (or the standard library, or whatever people use this decade, or xz, or whatever).

  • How many transitive dependencies is the right number for a database?

    • Honestly, current best practice puts that number right around zero, which you see for ambitious implementations.

      A non-obvious issue is that database engines have peculiar requirements for how libraries are designed and implemented which almost no conventional library satisfies. To make matters worse, two different database implementations may have different requirements in this regard, so you can't even share libraries between databases. There are no black boxes in good database engines.

      6 replies →

  • Just tried to look at what some macro was generating using cargo-expand. It requires a LOT of dependencies. Took like 5 minutes to compile it all (run `cargo install cargo-expand` if you want to try). I almost aborted because the description of the crate says "Wrapper around rustc -Zunpretty=expanded." so I had expected the simplest possible crate to do that.

    • > Took like 5 minutes to compile it all

      TBF this has nothing to do with dependency complexity and everything to do with semantic complexity. You could easily do this without using any dependencies at all.

      unless you're downloading dependencies during the build or something like that, of course.

I really chuckled about how the blog post opens with how great Rusts open-source ecosystem is, and ends with an "anyway, we made our software private and proprietary"

  • That's technically correct, but they listed several ways they contribute back to the OSS ecosystem: PRs, issues, creating new libraries...

    This comment makes it seem like all this company does is take, which feels unfair to me

    • "We keep ScopeDB private and proprietary, while we actively get involved and contribute back to the open-source dependencies, open source common libraries when it’s suitable"

      They say they do when suitable (never or rarely).

      But that's fine as the licenses allow it. It feels like another company blogging about how great open source to get pr while close sourcing their product.

      The older I get the more I understand why gpl variations are superior to bsd if you want to grow the software. Bsd are good for throw away code or standards you want others to adopt.

    • >This comment makes it seem like all this company does is take, which feels unfair to me

      Profit isn't far removed from theft, so maybe this shouldn't feel so unfair.

      3 replies →

  • Isn't that pretty much the modern stack? Open source language, framework, and libraries, and proprietary end product?

  • > I really chuckled about how the blog post opens with how great Rusts open-source ecosystem is, and ends with an "anyway, we made our software private and proprietary"

    I mean that's been the prevalent attitude for the entire history of open source. Its easy to laugh until someone replaces you.

I was hoping this would be a discussion of Rust build times and how they optimized them with that number of dependencies.

But I think it’s easy for people to criticize dependencies from afar without understanding what they’re used for. I’m sure the dependencies in my projects would look strange to others - for example, I use three HTTP libraries: one for 95% of cases and the others for very specific use-cases where I need control at a low level. But without that context it might seem excessive.

My main question is why observability data needs (or benefits from) a tailor-made database instead of a general purpose one. In 2025, anyone working on observability who told me they have to build their own database, I would be very suspicious!

  • Datadog always builds their own event store: https://www.datadoghq.com/blog/engineering/introducing-husky...

    It may not be named "database" but actually take the place of a database.

    Observability vendors will try to store logs with ElasticSearch and later find it over expensive and has weak support for archiving cold data. Data Warehouse solution requires a complex ETL pipeline and can be awkward when handling log data (semi-structured data).

    That said, if you're building an observability solution for a single company, I'd totally agree to start with single node PG with backup, and only consider other solution when data and query workload grow.

    • In 2025 I'd consider starting with clickhouse instead, if you're going the DIY route

  • Not even limited to general purpose ones, there are existing tailor made databases for observability. Maybe somewhere on that page, they explain why this one is better.

57 of which written by DPRK Koding Forces, waiting for the right moment to push a glorious update, striking at the heart of The Biggest Enemy.

I'm having a real crisis trying to decide whether this system should be called a database or not. It's a system for managing data, so obviously it is.. but by that loose interpretation any CRUD webserver would count too.

Yet another thread where people go "Dependency number too big! Rust bad!" with the level of nuance of my dogs discussing dinner.

The full list is linked in the article https://gist.github.com/tisonkun/06550d2dcd9cf6551887ee6305e...

There isn't a single thing there that seems iffy to me. Rust projects split themselves into as small of a crate as possible to 1) ease their own development, 2) improve compile times to make their compilation trivially parallelizable, and 3) allow for reuse. Because of this, you can easily end up with a dozen crates all written by the same group of people, meant to be used together. If a project is a single big crate, or a dozen small crates, you're on the exact same situation. If you wouldn't audit the small crates because they are a lot, you wouldn't audit the big crate thoroughly either.

But what about transitive dependencies? Similar thing: if you have a crate to check for the terminal width, I prefer to take the existing small crate than copy paste its code. I can do the latter, but then you end up with effectively a vendored library in your code that no tool can know about to warn you when a security vulnerability has happened.

  • > There isn't a single thing there that seems iffy to me.

    You mean like four versions of hashbrown (which is useful, but it's rare to have to use it directly instead of `std::collections::HashMap`, never mind pulling four versions of it into your project) or four versions of itertools (which is extremely situational, and even when it is useful it usually only saves you a couple of lines of code, so it's essentially never worth pulling it once, never mind four times)? Or maybe three different crates for random number generation (rand, nanorand, fastrand)?

    There's a definitely problem with how the Rust community approaches dependencies (and I say this as someone who loves Rust and uses it as their main language for 10+ years now). People are just way too trigger happy with external dependencies, and burying our heads in the sand is not helping.

    Inclusion of every external dependency should always be well motivated. How big is the dependency? How much of it do we use? How big of an effect will it have on compile times? How much effort would it be to write it yourself? Is it security sensitive? Is it a dependency which everyone uses and is maintained by well known community members, or some random guy from who knows where? And so on.

    For example, cryptography stuff? No, don't write that yourself if you're not an expert; you'll get it wrong and expose yourself to vulnerabilities. Removing leading whitespace from strings? ("unindent" crate, which is also on your list) Hell no! That's like a minute or two to write this yourself. Did we learn nothing from the left-pad incident?

    • > You mean like four versions...

      The two options for cargo here are 1) fail to compile when there's more than one crate-version in the dep tree or 2) allow for there to be more than one and let the project continue compiling. The former would be more "principled" but in practice incredibly disruptive. I usually go "dep hunting" to unify the versions of duplicated deps. Most of the time that's just looking at `cargo tree` and modifying the `Cargo.toml` slightly. Other times it's not easy, and have to either patch or (better) wait until the diverging dep updates their own `Cargo.toml`.

      > People are just way too trigger happy with external dependencies, and burying our heads in the sand is not helping.

      >> Inclusion of every external dependency should always be well motivated. How big is the dependency? How much of it do we use? How big of an effect will it have on compile times? How much effort would it be to write it yourself? Is it security sensitive? Is it a dependency which everyone uses and is maintained by well known community members, or some random guy from who knows where? And so on.

      We can have a nuanced discussion about dependencies. That's not what I was seeing. There are plenty of things that can be done to improve the situation, specially around Supply Chain Security, but this idea that dependency count is the issue is misguided. It pushes projects towards copy-pasting and vendoring. That makes that code opaque to security tools, existing or proposed. Think of the shitshow it is if you have an app and decided "more dependencies is bad, so I'm copying xz into my repo"?

      > Removing leading whitespace from strings? ("unindent" crate, which is also on your list) Hell no! That's like a minute or two to write this yourself.

      I don't have access to the closed-source repo to run `cargo tree` to see where `unindent` is used from, but why do you feel this is an invalid crate to pull in? It is a proc-macro, that deindents string literals at compile time. Would I include it directly in a project of mine? Likely not, but if I were using `indoc` (written by dtolnay), which uses `unindent` (written by dtolnay) my reaction wouldn't be "oh, no! An additional useless dependency!".

      4 replies →

  • Agreed, the dependency list looks extremely boring and completely auditable to me.

    The dependencies are modular, not diffuse.

    I think people saw the title, and got triggered into hate. When actually, this seems author-submitted, and they were probably just trying to be humble about their accomplishment. It's not even the title of the article.

    • > they were probably just trying to be humble about their accomplishment

      Thanks for your reply. To be honest, I simply recognize that depending on open-source software a trivial choice. Any non-trivial Rust project can pull in hundreds of dependencies and even when you audit distributed system written in C++/Java, it's a common case.

      For example, Cloudflare's pingora has more than 400 dependencies. Other databases written in Rust, e.g., Databend and Materialize, have more than 1000 dependencies in the lockfile. TiKV has more than 700 dependencies.

      People seem to jump in the debt of the number of dependencies or blame why you close the source code, ignoring the purpose that I'd like to show how you can organically contribute to the open-source ecosystem during your DAYJOB, and this is a way to write open-source code sustainable.

  • You forgot 4: To break when somebody foolishly does a ``cargo install`` without passing ``--locked``.

  • lots of crates by different authors: you need to trust each one not to be compromised

    lots of crates by a cohesive group of authors: you "only" need to trust the group reviews each others work properly and they're not all compromised together (less likely).

The title of the submission is somewhat bait, unfortunately the Cargo.lock doesn't seem to be public. Since my current Rust side-project also has some kind of database (along with, well, a p2p system) and also totals 454 dependencies, I've decided to do a breakdown of my dependency graph (also because I was curious myself):

  - 85 are related to gix (a Rust reimplementation of git, 53 of those are gix itself, that project is unfortunately infamous for splitting things into crates that probably should've been modules)
  - 91 are related to pgp and all the complexity it involves (aes with various cipher modes, des, dsa, ecdsa, ed25519, p256, p384, p521, rsa, sha3, sha2, sha1, md5, blowfish, camellia, cast5, ripemd, pkcs8, pkcs1, pem, sec1, ...)
  - 71 are related to http/irc/tokio (this includes a memory-safe tls implementation, an http stack like percent-encoding, mime, chunked encoding, ...)
  - 26 are related to the winapi (which I don't use myself, but are still part of the resolved dependency graph)
  - 8 are related to web assembly (unused when compiling for Linux)
  - 2 are relatd to android (also unused when compiling for Linux)

In some ways this is a reminder of how much complexity we're building on top of for the sake of compatibility.

Also keep in mind "reviewing 100 lines of code in 1 library" and "reviewing 100 lines of code split into 2 libraries" is still pretty much the same amount of code (if any of us actually reviewed all their dependencies). You might even have a better time reviewing the sha2 crate vs the entirety of libcrypto.so, if that's all you needed.

My project has been around for (almost) two years, I scanned every commit for vulnerable dependencies using this command:

    for commit in $(git log --all --pretty='%H'); do git show "$commit":Cargo.lock > Cargo.lock && cargo audit -n --json | jq -r '.vulnerabilities.list[] | (.advisory.id + " - " + .package.name)'; done | sort | uniq

I got a total of 25 advisories (basically what you would be exposed to if you ran all binaries from every single commit simultaneously today). Here's the list:

    RUSTSEC-2020-0071 - time
    RUSTSEC-2023-0018 - remove_dir_all
    RUSTSEC-2023-0034 - h2
    RUSTSEC-2023-0038 - sequoia-openpgp
    RUSTSEC-2023-0039 - buffered-reader
    RUSTSEC-2023-0052 - webpki
    RUSTSEC-2023-0053 - rustls-webpki
    RUSTSEC-2023-0071 - rsa
    RUSTSEC-2024-0003 - h2
    RUSTSEC-2024-0006 - shlex
    RUSTSEC-2024-0019 - mio
    RUSTSEC-2024-0332 - h2
    RUSTSEC-2024-0336 - rustls
    RUSTSEC-2024-0345 - sequoia-openpgp
    RUSTSEC-2024-0348 - gix-index
    RUSTSEC-2024-0349 - gix-worktree
    RUSTSEC-2024-0350 - gix-fs
    RUSTSEC-2024-0351 - gix-ref
    RUSTSEC-2024-0352 - gix-index
    RUSTSEC-2024-0353 - gix-worktree
    RUSTSEC-2024-0355 - gix-path
    RUSTSEC-2024-0367 - gix-path
    RUSTSEC-2024-0371 - gix-path
    RUSTSEC-2024-0373 - quinn-proto
    RUSTSEC-2024-0421 - idna

I guess I'm doing fine. Keep in mind, the binary is fully self-contained, there is no "look, my program has zero dependencies, but I need to ship an entire implementation of the gnu operating system along with it".

so,npm hell,or pip hell again?

to be fair, python pkg dependency are fine to me,there might be a lot of pip pkgs still,but not a few hundreds like npm and cargo normally pulls in.

golang also has a reasonable amount of dependencies. npm and cargo dependencies are just scary due to the huge number.

  • NPM and pip hell come about for several reasons, one of the biggest being that package versions are global.

    In rust, you can project A can use dependencies B and C which can both depend on different versions of D. Cargo/crates generally also solve some of the other metadata problems Python has.

    This means the developer experience is _significantly_ improved, at a potential cost of larger binaries. In practice, projects seem to have sufficiently liberal bounds that duplication isn't an issue.

I automatically don't want to use this database because the number of third party dependencies are an unfixable, never-ending source of security vulnerabilities.

  • Yes, the amount of effort it takes to audit dependencies scales roughly linearly, so unless you're going to blindly install them, choosing to use a project with so many dependencies means taking on a tremendous amount of ongoing work.

    • > the amount of effort it takes to audit dependencies scales roughly linearly

      With the lines of code, not the number of dependencies. 10 dependencies of 100 lines of code are arguably easier, but certainly not harder than a single dependency of 1000 lines of code.

      4 replies →

  • Nowadays this applies to everything that depends on modules that depend on more modules (e.g. NodeJS).

  • yeah, rust copied the dumpster fire that was npm, i shudder to think of the future of supply chain security when people say rewrite it in rust.

    • I'm pretty sure everybody just copied from Perl.

      Go did something nice, and it would be good if more people copied. But it was also fairly recent.

      1 reply →

    • What would a better model to manage dependencies in your opinion? I do like that is easy to add dependencies, but also don't like that a simple hello world Axum app IIRC is around 150 dependencies.

      8 replies →

Is the dependency count supposed to be impressive?

  • Past a number of dependencies, actually getting anything to build deterministically, run reliably and then not get 0wnd to bits becomes an actual challenge, which many enthusiastic developers have a masochistic kink for.

    The thrill of complexity is real.

  • i think the implication is that it's precarious...how does one know all are bug free, for example?

    • Is it? You know for a fact that there are bugs in some of your dependencies. But how many bugs would the code you wrote from scratch instead of adding a dependency have?

      2 replies →