← Back to context

Comment by sgarland

2 days ago

> 200ms for DB queries

No. Just no. There’s an entire generation of devs at this point who are convinced that a DB is something you throw JSON into, use UUIDs for everything, add indices when things are slower than you expected, and then upsize the DB when that doesn’t fix it.

RAM access on modern hardware has a latency of something like 10 nanoseconds. NVMe reads vary based on queue depth and block size, but sub-msec is easily attainable. Even if your disks are actually a SAN, you should still see 1-2 msec. The rest is up to the DB.

All that to say, a small point query on a well-designed schema should easily execute in sub-msec times if the pages are in the DB’s buffer pool. Even one with a small number of joins shouldn’t take more than 1-2 msec. If this is not the case for you, your schema, query, or DB parameters are sub-optimal, or you’re doing some kind of large aggregation query.

I took a query from 70 to 40 msec today just by rewriting it. Zero additional indexing or tuning, just unrolling several unnecessary nested subqueries, and adding a more selective predicate. I have no doubt that it could get into the single digits if better indexing was applied.

I beg of devs, please take the time to learn SQL, to read EXPLAIN plans, and to measure performance. Don’t accept 200 msec queries as “good enough” because you’re meeting your SLOs. They can be so much faster.

Beg all you want. They're still going to dump JSON strings (not even jsonb) and UUIDs in them anyway, because, "Move fast and break things."

I lament along with you.

  • “We’re disrupting!”

    “Yeah, you’re disrupting my sleep by breaking the DB.”

>RAM access on modern hardware has a latency of something like 10 nanoseconds

What modern hardware are you using that this is true? That's faster than L3 cache on many processors.

  • Correction: DRAM latency is ~10 - 20 nsec on most DDR4 and DDR5 sticks. The access time as seen by a running program is much more than that.

    As an actual example of RAM latency, DDR4-3200 with CL22 would be (22 cycles * 2E9 nsec/sec / 3200E6 cycles/sec) == 13.75 nsec.

"All that to say, a small point query on a well-designed schema should easily execute in sub-msec times if the pages are in the DB’s buffer pool"

Shopify is hosting a large number of webshops with billions of product descriptions, but each store only has a low visitor count. So we are talking about a very large and, hence, uncacheable dataset with sparse access. That means almost every DB query to fetch a product description will hit the disk. I'd even assume a RAID of spinning HDDs for price reasons.

  • Shopify runs a heavily sharded MySQL backend. Their Shop app uses Vitess; last I knew the main Shopify backend wasn’t on Vitess (still sharded, just in-house), but I could be wrong.

    I would be very surprised if “almost every query” was hitting disk, and I’d be even more surprised to learn that they used spinners.

> just unrolling several unnecessary nested subqueries, and adding a more selective predicate

And state of the art query optimizers can even do all this automatically!

  • Sometimes, yes. Sometimes not. This was on MySQL 5.7, and I wound up needing to trace the optimizer path to figure out why it was slower than expected.

    While I do very much appreciate things like WHERE foo IN —> WHERE EXISTS being automatically done, I also would love it if devs would just write the latter form. Planners are fickle, and if statistics get borked, query plans can flip. It’s much harder to diagnose when all along, the planner has been silently rewriting your query, and only now is actually running it as written.

    • Explicit query plan pinning helps a lot, alongside strong profiling and monitoring tools.

I think 500ms P75 is good for an app that hits network in a hot path (mobile networks are no joke), but I agree that 200ms is very very bad for hitting the DB on the backend. I've managed apps with tables in the many, many billions of rows in MySQL and would typically expect single digit millisecond responses. If you use EXPLAIN you can quickly learn to index appropriately and adjust queries when necessary.

  • 500ms p75 is not good for the (low) complexity of the shopify app.

    Also reporting p75 latency instead of p99+ just screams to me that their p99 is embarrassing and they chose p75 to make it seem reasonable.