Comment by didgetmaster
1 month ago
This article points out some of the pain associated with index maintenance. It should also point out that ALL indexes on a table suffer from the same issue. If your 20 column table has 7 indexes, then the suggestions should be applied 7x.
It is conventional wisdom that indexes are absolutely essential for any relational table of at least reasonable size (e.g. thousands of rows) and is accessed more often than daily. Indexes can be a pain to create and maintain; but can greatly speed up queries and primary key validations. The pain mostly comes from having to figure out what indexes to create and how often to maintain them, rather than doing the actual thing.
Indexes also have a performance penalty for any table updates. Creating new rows, updating existing rows, or deleting rows all require updates to each index.
But are indexes really required? I am creating a new kind of general purpose data management system (a kind of object store) called Didgets. The tagging mechanism that I invented to allow tags to be attached to each data object, are key-value stores that essentially form a set of columnar stores.
I found that these columnar stores could also be used to create regular relational database tables. The data is structured such that indexes are not needed. All the tests that I have run (up to a thousand columns with over 100 million rows), show that query speeds are equal to, or better than other database systems that are well indexed.
The system is still under development, so it is still missing some key features that would make it a drop-in replacement for other databases; but it proves that it is possible to structure relational data such that query speeds can be optimal without needing separate indexing structures that have to be maintained.
I’m not clear on how you’re deviating from a normal columnar/OLAP database?
> I found that these columnar stores could also be used to create regular relational database tables.
Doesn’t every columnar store do this? Redshift, IQ, Snowflake, ClickHouse, DuckDB etc
> but it proves that it is possible to structure relational data such that query speeds can be optimal without needing separate indexing structures that have to be maintained.
Doesn’t every columnar database already prove this?
I am not an expert on all the other columnar stores out there; but it is my understanding that they are used almost exclusively for OLAP workloads. By 'regular database tables', I meant those that handle transaction processing (inserts, updates, deletes) along with queries.
My system does analytics well, but it is also very fast with changing data.
I also think that some of those systems (e.g. Duckdb) also use indexes.
They’re used by OLAP workloads because columnar properties fits better — namely, storing data column-wise obviously makes row-wise operations more expensive, and column-wise operations cheaper; this usually corresponds to point look-ups vs aggregations. Which cascades into things like constraint-maintenance being more expensive, row-level triggers becoming a psychotic pattern, etc. Column-wise (de-)compression also doubles-down on this.
They still do all the regular CRUD operations and maintain transactional semantics; they just naturally prefer bulk operations.
Redshift is the most pure take on this I’ve seen; to the point that they simply don’t support most constraints, triggers and data is allocated in 2MB immutable chunks such that non-bulk-operations undergo ridiculous amounts of write amplification and slow to a crawl. Afaik other OLAP databases are not this extreme, and support reasonable throughput on point-operations (and triggers, constraints, etc) — in the sense that it’s definitely slower, but not comically slower. (Aside: Aurora is also a pure take on transactional workloads, such that bulk aggregations are comically slow)
> I also think that some of those systems (e.g. Duckdb) also use indexes.
I’m pretty sure they all use indexes, in the same fashion I expect you to (I’m guessing your system doesn’t do table-scans for every single query). Columnar databases just get indexes like zone-maps for “free”, in the sense that it can simply be applied on top of the actual dataset without having to maintain a separate copy of the data ALA row-wise databases do. So it’s an implicit index automatically generated on every column — not user-maintained or specified. I expect your system does exactly the same (because it would be unreasonable not to)
> My system does analytics well, but it is also very fast with changing data.
Talk more, please & thank you. I expect everything above to be inherent properties/outcomes of the data layout so I’m quite curious what you’ve done
1 reply →
Interesting ideas. Im very interested in database ideas that bring new capabilities or better ways to acconplish old ones.
W.r.t. query speeds on your columnar storage engine, you will obviously have much better writes that row oriented storage engines. This limits your write capabilities though. Any effort you put into restoring write speeds necessitates an extra step to the maintain the columnar stores - which puts you back into the group of databases naintaining indices that you criticize above.
I think modern databases are bringing new ideas on how to accelerate both write and query speeds simultaneously with tradeoffs around CAP.
Although I have done many more benchmark testing against other databases for query speeds; I haven't noticed any significant speed degradation on writes.
Could you clarify what you mean by 'this limits your write capabilities'?
> W.r.t. query speeds on your columnar storage engine, you will obviously have much better writes that row oriented storage engines.
This should have said reads, not writes. Columnar storage takes significantly more effort to handle writes because it must do many more IOs across the different columns, potentially more de/compression cycles, etc.