Comment by theanonymousone
4 days ago
The way I understood it, you can do your inserts with SQLite "proper", and simultaneously use DuckDB for analytics (aka read-only).
4 days ago
The way I understood it, you can do your inserts with SQLite "proper", and simultaneously use DuckDB for analytics (aka read-only).
Aha! That makes so much sense. Thank you for this.
Edit: Ah, right, the downside is that this is not going to have good olap query performance when interacting directly with the sqlite tables. So still necessary to copy out to duckdb tables (probably in batches) if this matters. Still seems very useful to me though.
Analytics is done in "batches" (daily, weekly) anyways, right?
We know you can't get both, row and column orders at the same time, and that continuously maintaining both means duplication and ensuring you get the worst case from both worlds.
Local, row-wise writing is the way to go for write performance. Column-oriented reads are the way to do analytics at scale. It seems alright to have a sync process that does the order re-arrangement (maybe with extra precomputed statistics, and sharding to allow many workers if necessary) to let queries of now historical data run fast.
It's not just about row versus column. OLAPs are potentially denormalised as well, and sometimes pre-aggregation, such as rolling up by day, by customer.
If you really need to get performance you'll be building a star schema.
Not all olap-like queries are for daily reporting.
I agree that the basic architecture should be row order -> delay -> column order, but the question (in my mind) is balancing the length of that delay with the usefulness of column order queries for a given workload. I seem to keep running into workloads that do inserts very quickly and then batch reads on a slower cadence (either in lockstep with the writes, or concurrently) but not on the extremely slow cadence seen in the typical olap reporting type flow. Essentially, building up state and then querying the results.
I'm not so sure about "continuously maintaining both means duplication and ensuring you get the worst case from both worlds". Maybe you're right, I'm just not so sure. I agree that it's duplicating storage requirements, but is that such a big deal? And I think if fast writes and lookups and fast batch reads are both possible at the cost of storage duplication, that would actually be the best case from both worlds?
I mean, this isn't that different conceptually from the architecture of log-structured merge trees, which have this same kind of "duplication" but for good purpose. (Indeed, rocksdb has been the closest thing to what I want for this workload that I've found; I just think it would be neat if I could use sqlite+duckdb instead, accepting some tradeoffs.)
2 replies →