Comment by daneel_w

3 years ago

My source is my experience from having worked with MariaDB and MySQL for about 15 years now. The two articles you cited are 10 and 13 years old, so let's run a fresh test with two identical test tables that have their columns indexed in opposite order of eachother:

  CREATE TABLE a/b
  (
    x int,
    y int,
    z int,
  )
  
  CREATE INDEX a ON a (x, y, z)
  CREATE INDEX b ON b (z, y, x) # index in reverse order

Some pseudocode to illustrate how I filled the tables:

  for x = 1 to 10:
      for y = 1 to 1000:
          for z = 1 to 10000:
              INSERT INTO a/b VALUES (x, y, z)

Table a took 25 minutes and 2 seconds to populate with 100 million rows.

Table b took 37 hours and 40 minutes to populate. Something's up. I built table a a second time, again taking pretty much the same 25 minutes, and I tried building table b again as well but gave up after four hours. It's definitely not the computer acting up, nor the storage.

  +--------+------------+
  | Table  | Checksum   |
  +--------+------------+
  | test.a | 4243132544 |
  | test.b | 4243132544 |
  +--------+------------+

Not the data either. Both tables checksum identically. Here's what the storage looks like:

  -rw-rw---- 1 mysql mysql 6618611712 Jan 15 15:42 a.ibd
  -rw-rw---- 1 mysql mysql 8715763712 Jan 16 17:42 b.ibd

  +------------+-----------+-----------+
  | table_name | data_mb   | index_mb  |
  +------------+-----------+-----------+
  | a          | 3478.0000 | 2132.0000 |
  | b          | 3769.0000 | 4290.7344 |
  +------------+-----------+-----------+

So the index of table b is obviously a horrendous crow's nest. It's twice as large as the index of table a, and it's even occupying more space than the actual rows themselves. I ran 100k reproducible queries on each of the tables to see how the composite indices perform. More wonky pseudocode:

  srand ( 1234567 )
  for i = 1 to 100000:
      # this includes 0 which is intentional
      xr = prng ( 0 to 10 )
      yr = prng ( 0 to 1000 )
      zr = prng ( 0 to 10000 )
      SELECT * FROM a/b WHERE x = xr AND y = yr AND z = zr

  table a
  =======
  42 seconds
  48 seconds
  35 seconds

  table b
  =======
  87 seconds
  71 seconds
  69 seconds

Some caching is of course involved. I queried the tables one after another just to see if rattling the cage a bit would have any effect:

  table a: 48 seconds
  table b: 83 seconds
  table a: 61 seconds
  table b: 72 seconds

A little bit, but nowhere near enough to outperform the lean and correctly built index on table a. And this is just a simple three-column composite index - the problem I'm illustrating grows exponentially with every column you add to the composite.