B-trees and database indexes (2024)

12 hours ago (planetscale.com)

41 comments

tosh

bddicken 10 hours ago

Oh hey, I wrote this! Happy to chat more about the article here. Databases are kinda my thing.

amarant 8 hours ago
Thanks for writing this! The visualisations really drive a better understanding than pure text does, and it's quite clear that you have a better understanding of what database do under the hood than I do.
As such, I have a question for you: contrary to your article, I've always been taught that random primary keys are better than sequential ones. The reason for this, I was told, was to avoid "hotspots". I guess it only really applies once sharding comes into play, and perhaps also only if your primary key is your sharding key, but I think that's a pretty common setup.
I'm not really sure how to formulate a concrete question here, I guess I would like to hear your thoughts on any tradeoffs on sequential Vs random keys in sharded setups? Is there a case there random keys are valid, or have I been taught nonsense?
- bddicken 8 hours ago
  
  B+trees combined with sequential IDs are great for writes. This is because we are essentially just appending new rows to the "linked list" at the bottom level of the tree. We can also keep a high fill % if we know there isn't a lot of data churn.
  If you're sharding based purely on sequential ID ranges, then yes this is a problem. Its better practice to shard based on a hash of your ID, so sequential id assignments turn into non-sequential shard keys, keeping things evenly distributed.
  
  5 replies →
- traderj0e 8 hours ago
  
  Spanner in particular wants random primary keys. But there are sharded DBMSes that still use sequential PKs, like Citus. There are also some use cases for semi-sequential PKs like uuid7.
  
  1 reply →
mamcx 9 hours ago

I remember this article for when I was researching for https://spacetimedb.com/. The interactivity is very cool, BTW!
One neat realization is that a database is in fact more about indexes than the actual raw tables (all things interesting work under this assumption), to the point that implementing the engine you get the impression that everything start with "CREATE INDEX" than "CREATE TABLE". This includes sequential scans, where as visualized in your article show that lay the data sequentially is in fact a form of index.
Now, I have the dream of make a engine more into this vision...

game_the0ry 10 hours ago

This has been post before, but planetscale also has a great sql for developers course:

https://planetscale.com/learn/courses/mysql-for-developers

kuharich 2 hours ago

Past comments: https://news.ycombinator.com/item?id=41489832

traderj0e 8 hours ago

I've known for a long time that you usually want b-tree in Postgres/MySQL, but never understood too well how those actually work. This is the best explanation so far.

Also, for some reason there have been lots of HN articles incorrectly advising people to use uuid4 or v7 PKs with Postgres. Somehow this is the first time I've seen one say to just use serial.

evil-olive 5 hours ago
> incorrectly advising people to use uuid4 or v7 PKs with Postgres
random UUIDs vs time-based UUIDs vs sequential integers has too many trade-offs and subtleties to call one of the options "incorrect" like you're doing here.
just as one example, any "just use serial everywhere" recommendation should mention the German tank problem [0] and its possible modern-day implications.
for example, if you're running a online shopping website, sequential order IDs means that anyone who places two orders is able to infer how many orders your website is processing over time. business people usually don't like leaking that information to competitors. telling them the technical justification of "it saves 8 bytes per order" is unlikely to sway them.
0: https://en.wikipedia.org/wiki/German_tank_problem
- jim33442 33 minutes ago
  
  PK isn't the same as public ID, even though you could make them the same. Normally you have a uuid4 or whatever as the public one to look up, but all the internal joins etc use the serial PKs.
omcnoe 4 hours ago

DB perf considerations aside, a lot of software pattern around idempotency/safe retries/horiz-scaling/distributed systems are super awkward with a serial pk because you don’t have any kind of unambiguous unique record identifier until after the DB write succeeds.
DB itself is “distributed” in that it’s running outside the services own memory in 99% of cases, in complex systems the actual DB write may be buried under multiple layers of service indirection across multiple hosts. Trying to design that correctly while also dealing with pre-write/post-write split on record id is a nightmare.
bddicken 7 hours ago
Simple sequential IDs are great. If you want UUID, v7 is the way to go since it maintains sequential ordering.
- omcnoe 4 hours ago
  
  There are subtle gotchas around sequential UUID compared to serial depending on where you generate the UUIDs. You can kinda only get hard sequential guarantee if you are generating them at write time on DB host itself.
  But, for both Serial & db-gen’d sequential UUID you can still encounter transaction commit order surprises. I think software relying on sequential records should use some mechanism other than Id/PK to determine it. I’ve personally encountered extremely subtle bugs related to transaction commit order and sequential Id assumptions multiple times.
- jwpapi 6 hours ago
  
  Does all of that apply to Postgresql as well or only Mysql?
  
  1 reply →
sgarland 5 hours ago

> just use serial
Ideally you use IDENTITY with Postgres, but the end result is the same, yes.

daneel_w 9 hours ago

"The deeper the tree, the slower it is to look up elements. Thus, we want shallow trees for our databases!"

With composite indices in InnoDB it's even more important to keep the tree streamlined and let it fan out according to data cardinality: https://news.ycombinator.com/item?id=34404641

whartung 9 hours ago

I keep hearing about the downside of B(+)-Trees for DBs, that they have issues for certain scenarios, but I've never seen a simple, detailed list about them, what they are, and the scenarios they perform badly in.

bddicken 9 hours ago

It's really just a matter of tradeoffs. B-trees are great, but are better suited for high read % and medium/low write volume. In the opposite case, things like LSMs are typically better suited.
If you want a comprehensive resource, I'd recommend reading either Designing Data Intensive Applications (Kleppman) or Database Internals (Petrov). Both have chapters on B-trees and LSMs.
faangguyindia 3 hours ago

If your application is write intensive LSM is better than Btree.
But you'd rarely need it. We mostly have write intensive counters. We just write to redis first then aggregate and write to postgres.
This reduces number of writes we need in postgres a lot
daneel_w 8 hours ago

See my comment in the main thread for an example. In a worst case scenario, some data is simply too "frizzy" to index/search efficiently and with good performance in a B-tree.
Retr0id 9 hours ago
For pure write throughput, LSM trees tend to beat btrees.
- bddicken 8 hours ago
  
  +1

threatofrain 9 hours ago

Also curious to hear what people think of Bf-tree.

  https://vldb.org/pvldb/vol17/p3442-hao.pdf
  https://github.com/microsoft/bf-tree

bddicken 9 hours ago

I've read this paper and it's a neat idea. It hasn't been introduced into popular oss databases like postgres and mysql, and my understanding is it has some drawbacks for real prod use vs ths simplistic benchmarks presented in the paper.
Would love to know if anyones built something using it outside of academic testing.

photochemsyn 3 hours ago

Sqlite’s btree is available here:

https://github.com/sqlite/sqlite/blob/master/src/btree.c

I always thought this was too complicated to every really understand how it worked, especially the lock policy, but now with LLMs (assisted with sqlite’s very comprehensive comment policy) even a relative neophyte can start to understand how it all works together. Also the intro to the file is worth reading today:

* 2004 April 6 * * The author disclaims copyright to this source code. In place of * a legal notice, here is a blessing: * * May you do good and not evil. * May you find forgiveness for yourself and forgive others. * May you share freely, never taking more than you give. * ************************************* * This file implements an external (disk-based) database using BTrees. * See the header comment on "btreeInt.h" for additional information. * Including a description of file format and an overview of operation. */

viccis 4 hours ago

A B+ tree with deletion was one of the most difficult algorithms I had to do back in college. You'd hit edge cases after billions of insertions...

hybirdss 5 hours ago

interactive viz on this kind of topic is just unfair compared to text

jiveturkey 9 hours ago

> MySQL, arguably the world's most popular database management system,

bddicken 7 hours ago

It may not have the popularity it once did, but MySQL still powers a huge % of the internet.
traderj0e 8 hours ago
Is there a problem with that?
- shawn_w 8 hours ago
  
  Not the original commenter, but I thought sqlite had that title.
  
  2 replies →

alexwelsh 2 hours ago

[dead]