Build a Database in 3000 Lines with 0 Dependencies

1 year ago (build-your-own.org)

If you want 1 dependency rather than 0, there are various low-level libraries designed for building full-featured databases.

For one of these, RocksDB, one reference point is how CockroachDB was built on top of them for many years and many successful Jepsen tests (until they transitioned to an in-house solution).

https://news.ycombinator.com/item?id=37552085

A similar resource I recently discovered and it’s not that popular: https://github.com/pingcap/talent-plan/tree/master/courses/r...

Writing a Bitcask(KV wal) like db in Rust. Really cool and simple ideas. The white paper is like 5 pages.

  • Bitcask is great, it's such an elegant idea. I've built a few different toy implementations in different languages and learned something new each time. YMMV based on how many deps you do or don't want to use, how complete you want to go, but it's a totally doable small-ish project.

I looked into this at one point, I was typing out entire codebases for didactic purposes: SQLite 3 was 120,000 lines of code, but SQLite 2 was 12,000.

So for a bit more effort you get a battle tested real world thing!

  • Really puts the auto- in didact! Very curious to hear how this worked for you; it’s almost directly the opposite of the copilot approach.

    I learned assembler by typing in listings from magazines and hand dis-assembling and debugging on paper. Your approach seems similar in spirit, but who has the times these days?

    • I learned this from Zed Shaw's Learn X The Hard Way books. He says this approach is mainstream in other disciplines, like music, languages, or martial arts.

      I also heard the philosopher Ken Wilber spent a few years (in what kids today call Monk Mode) writing out great books by hand.

      The main effect I noticed is that I rapidly gain muscle memory in a new programming language, library or codebase.

      The other effect is that I'm forced to round-trip every token through my brain, which is very helpful as my eyes tend to glaze over — often I'll be looking right at an obvious bug without seeing it.

      1 reply →

    • I program in neovim with no plugins, no autocomplete and no syntax highlighting. I type everything myself (though I will use copy and paste from time to time). There is a discipline to it that I find very beneficial. As a language designer, it also makes me think very carefully about the syntactic burden of languages that I design. It keeps my languages tight. One of the nice things about typing all of my own code without suggestions is that it eliminates many distractions. I may get some things wrong from time to time, but then I only have myself to blame. And I never waste time messing around with broken plugin configs or irritating syntax highlighting nits.

      It's not for everyone but I love it.

      2 replies →

  • Wait you took a repo and started typing it into the IDE? Could you please expand on what benefits you noticed and how it affected your understanding of the language? It sounds like a fascinating way to force attention to the code simply reading it wouldn't.

    • Yeah I just open two panes in Sublime Text, with the source on the right and then I type it out verbatim on the right.

      I make an effort to keep the line numbers synced. Sometimes I skip long repetitive blocks or comments. But I do type out like 80% of the actual characters in the file.

      It's about 500 lines per hour fot me, so I can estimate reasonably well how long it'll take.

      It's not necessarily an efficient thing to do — you'd get way more bang for your buck just poking around, asking questions, trying to make small changes. But for reasonably small projects, you can type it out in a few hours, or a day or two. Then you've "round-tripped" every single token through your brain (though sadly not with a meaningful amount of conscious reflection) -- unless you pause and ask questions along the way.

      See also my other comment above.

      3 replies →

  • can you elaborate on typing out for didactic purposes, please?

    • It's a new iteration on the ancient form of learning by copying. I've only ever seen people copy stuff when writing by hand when wanting to memorize something though, I imagine with a keyboard the memory-enhancement effect of writing by hand is lost, but it's probably more effective than just reading alone.

Re: copy-on-write (CoW) B-tree vs append-only log + non-CoW B-tree, why not both?

I.e., just write one file (or several) as a B-tree + a log, appending to log, and once in a while merging log entries into the B-tree in a CoW manner. Essentially that's what ZFS does, except it's optional when it really shouldn't be. The whole point of the log is to amortize the cost of the copy-on-write B-tree updates because CoW B-tree updates incur a great deal of write magnification due to having to write all new interior blocks for all leaf node writes. If you wait to accumulate a bunch of transactions then when you finally merge them into the tree you will be able to share many new interior nodes for all those leaf nodes. So just make the log a first-class part of the database.

Also, the log can include small indices of log entries since the last B-tree merge, and then you can accumulate even more transactions in the log before having to merge into the B-tree, thus further amortizing all that write magnification. This approaches an LSM, but with a B-tree at the oldest layer.

I've read a similar series from Phil back in 2020: "Writing a SQL database from scratch in Go" https://notes.eatonphil.com/database-basics.html

The code is available on GitHub: https://github.com/eatonphil/gosql (it's specifically a PostgreSQL implementation in Go).

It's cool to build a database in 3000 lines, but for a real production-ready database you'll need testing. Would love to see some coverage on correctness and reliability tests. For example, SQLite has about 590 times more test code than the library itself. (https://www.sqlite.org/testing.html)

I've started learning Velox last year, and it's a staggering amount of code. Sure, it has a ton of dependency because it wants to support so many things, but I feel like the core itself is also very complex.

I'm not sold on complexity being a necessity in software engineering, as I'm sure a lot of you also aren't. Yet we see a lot of behemoth projects.

Man I got the first edition of this book and it was so bad. Hopefully this is better…

  • Can you elaborate on why it was so bad?

    • Check the reviews online, but generally random bits of Go code are presented with no build-up or explanation. B Trees are implemented immediately and without explanation. The book is incredibly short and seems just like a loose collection of notes, rather than an actual book. Very interested to check out the new version

I've recently started to think about covering some advanced topics like compilers, databases, interpreters. Do u know any good oss repos where I can learn and build them from scratch on my own