← Back to context

Comment by wgjordan

4 days ago

Nice set of experiments! I appreciate that you're running benchmarks on real object storage setups to validate rapid design variations. (Meta-note: I love how agents have recently made this kind of experimental-research work possible with much less human time investment.)

I've been doing some experiments of my own in a relatively similar space, also focusing on S3/Tigris-backed SQLite on ephemeral compute, also with B-tree aware prefetching (see https://github.com/wjordan/sqlite-prefetch).

I think the idea of storing grouped pages together to optimize read-locality is interesting. Note that it steers in the opposite direction of the temporal locality that a format like LTX/Litestream uses to provide transaction-aware features like point-in time restore. The tradeoff also involves significantly greater write amplification (re-upload the entire page group every time a single page dirties), heavily favoring cold-read-heavy workloads over mixed-write or warm-read workloads.

The query-plan frontrunning is a very novel experiment as well, discovering in advance that SQLite is about to run a full-table scan seems like a very useful optimization hint to work with. I'd love to see experiments validating how much of an improvement that offers compared to simple reactive prefetch (which takes at least a couple page faults to get up to speed).

Fantastic comment, thanks for jumping in. I have well-tuned replies here lol given we're working on exactly the same problems!

First: your prefetch solution is precisely what I have in the roadmap for the next level of optimization: B-tree introspection at the page-child level not the table level. I haven't launched it yet as I only realized the potential in the past couple days and I wanted to focus on stability before showing this to folks. I am 100% going to try to use sqlite-prefetch approach in that experiment. I'm curious what kind of results you're seeing.

On write amplification: you're right, re-uploading an entire page group when one page is dirty feels inefficient. This tradeoff is intentional because turbolite believes all PUTs are equal, and it is aimed at optimizing for queries on cold databases with bursty reads, not write-heavy workloads (though writes work too). Checkpoints are ideally infrequent due to upload latency, and the page group sizes are tunable (default 256 64KB pages = 16MB uncompressed, a few MB compressed to S3). For the use case I'm targeting, the read locality wins dominate. That being said, turbolite does let the user choose when to checkpoint locally (durable on disk) vs to S3 (atomic commit in case of disk crash or new compute).

On LTX temporal locality vs page-group spatial locality: agreed, they're different design goals. LTX optimizes for transaction-aware features like PITR. turbolite optimizes for cold query speed from object storage. You could imagine a system that does both (WAL shipping for durability + grouped pages for reads), which is roughly where my roadmap goes. In fact, WAL + immutable page groups on checkpoints gives a much faster restore time than Litestream's snapshots+WAL if only because uploading the entire snapshot on each commit is slow. I'm starting to explore embedded WAL shipping in my walrust project https://github.com/russellromney/walrust.

On frontrun vs reactive: I have very specific benchmarks for this I think you will be intrested in. The tiered-bench binary has a --plan-aware flag that toggles between prefetch schedule (reactive, similar in spirit to your sibling detection) and frontrun (EQP-based). Look at the benchmark/README.md for detail. On 100K rows, local to Tigris, EQP is:

- who-liked (one-to-many JOIN): 4.4x faster cold, 1.4x with interior cached

- scan + filter: 2.9x faster cold

- indexed filter: 2.9x faster cold

- point lookup + JOIN: 1.8x cold

- simple queries (mutual friends): 1.3x, less to gain

The wins are largest on cold index lookups and SCANs where reactive prefetch has to discover the scan through sequential misses. Frontrun knows the full plan upfront and fires everything in parallel. For single-table scans, reactive catches up quickly after 1-2 cache misses, so the gap is smaller.

Finally - I included a query tuner CLI in the repo that lets you compare different prefetch aggression levels for a given query on given data. You may be interested in using this.