Comment by SigmundA

13 hours ago

Would be nice if PG supported clustered indexes (Index Organized Tables in Oracle speak) as an option if you have a table thats accessed mostly the same way you can get a index without the write amplification because the table is the index.

Clustered indexes only save up to 2x write amplification in the very rare case where you're indexing the entire table (e.g. if it has very few columns).

However, that is usually the least of your concerns with write amplification. If you don't batch your writes, you can easily get 100x write amplification. For any primary key or any other index not strongly correlated with your INSERTs, you can get perhaps another 100x write amplification even if you batch you writes.

  • For inserts, you cannot escape writing into the base table and all indexes. However, my understanding is that for updates PostgreSQL has a write amplification problem due to the fact that each time a row is updated this creates a new row (to implement MVCC), and a new physical location in the heap, so all indexes need to be updated to point to the new location, even those not containing the updated columns.

    OTOH, with a heap-less (aka. clustered, aka. index organized) table, you would only have to update the indexes containing the columns that are actually being updated. You don't need to touch any other index. Furthermore, only if you are updating a key column would you physically "move" the entry into a different part of the B-tree. If you update an included column (PK columns are automatically "included" in all secondary indexes, even if not explicitly mentioned in the index definition), you can do that in-place, without moving the entry.

    Here is how this works in SQL Server - consider the following example:

        CREATE TABLE T (
    
            ID int,
            NAME nvarchar(255) NOT NULL,
            AMOUNT int NOT NULL,
    
            CONSTRAINT T_PK PRIMARY KEY (ID)
    
        );
    
        GO
    
        CREATE INDEX T_I1 ON T (NAME);
    
        GO
    
        CREATE INDEX T_I2 ON T (AMOUNT);
    

    Now, doing this...

        UPDATE T SET AMOUNT = 42 WHERE ID = 100;
    

    ...will only write to T_PK and T_I2, but not T_I1. Furthermore T_PK's entry will not need to be moved to a different place in the B-tree. SQL Server uses row versioning similar to PostgreSQL, so it's conceivable that PostgreSQL could behave similarly to SQL Server if it supported clustered (index-organized) tables.

  • >in the very rare case where you're indexing the entire table (e.g. if it has very few columns).

    Not sure I follow most tables are accessed primarily in one way (primary key) while maybe sometimes in others for analysis. Having the PK written twice because it's almost always indexed is normally a waste and good candidate for a clustered index. So much so that many DB's like SQLite and MySql always do clustered indexes on primary key because their storage engine is built such that tables are a b-tree anyway vs PG that has separate b-tree indexes and heap tables. MSSQL and Oracle give you a choice whether the table is a index structure or a heap.

    If you have very specific use case tables they can typically have a clustered index and no secondary indexes, you can still scan them for ad-hoc analysis but you get better insert performance and space usage because you aren't double writing to the heap and a PK index like you would in PG.

    As far as batch writes that is a separate issue and has to due with whether that even makes sense for durability, if you need to commit a single random row due to something occurring you can't batch that up and maintain consistency, if your bulk loading data sure and is common practice to do commit batches there, clustered indexes could still be a 100 vs 200x write amplification if you have to insert both an index row and heap row vs just a single clustered index row.

  • Clustered indexes aren't just about write amplification. They also reduce the reads needed to get the data. Sometimes by quite a bit.

    • That's true for seeks into the clustered (primary) index because that index includes all fields, so you don't need to "jump" to the heap to get them.

      However, seeking into a secondary index, and then reading a column not included in that index incurs an additional index seek (into the clustered index), which may be somewhat slower than what would happen in a heap-based table.

      So there are pros and cons, as usual...

It does support index organized tables with the CLUSTER command, or you meant something else?

  • CLUSTER command is not the same as index organized tables, it's a one-time "physical sort" operation. New data is not organized until you run CLUSTER again. Index organized tables are maintained automatically by Oracle/SQL Server.

Another option would be a good way of placing indexes on a different physical disk. You could use fast, ephemeral storage like you can for a WAL without amplifying the writes to the same device that is your expensive bottleneck. You could rebuild on data loss.

But it would add complexity to detect out-of-sync indexes and tables.