Comment by simonw
5 years ago
Pop open the network pane in the browser tools and try running this SQL query for a demo of how clever this is:
select country_code, long_name from wdi_country
order by rowid desc limit 100
It fetches just 54.2KB of new data (across 49 small HTTP requests) to return 100 results - from a statically hosted database file that's 668.8MB!
I have an animated GIF demo of this here: https://twitter.com/simonw/status/1388933800445452290
Huh, that's actually kind of a worst case I didn't think about: Since you're doing a reverse table scan my "sequential access" detection doesn't kick in. If you do the same query but with a forward scan it should fetch roughly the same amount of data but only do like 5 HTTP requests since the request size doubles for every sequential access.
e.g.:
select country_code, long_name from wdi_country where rowid >= 164 order by rowid asc limit 100;
I solved a similar problem recently: given a stream of data, how should you choose packet size in an online way to minimize regret (a linear combination of spare capacity of last packet and total packets used).
Turns out doubling isn’t the best strategy. The optimal solution is actually to add a constant increment to packet size. How much depends on relative cost of the terms in the regret function.
> I’ve set the page size to 1 KiB for this database.
> That’s because I implemented a pre-fetching system that tries to detect access patterns through three separate virtual read heads and exponentially increases the request size for sequential reads.
> Since you're doing a reverse table scan my "sequential access" detection doesn't kick in.
You know, starting off with the default 4kB page size naturally adds some resistance to these kinds of failure cases. If the VFS isn't issuing many requests in parallel, I would think that setting up a page size near target_bandwidth * round_trip_time would be a better initial guess. 1kB would be appropriate for a pretty low latency-bandwidth product.
That's true, but it also means that random access will always use at least that amount of data even if it only has to fetch a tiny amount. I did a few (non-scientific) benchmarks on a few queries and 1kB seemed like an OK compromise.
And note that the request chunk size is bound to the SQLite page size, and to change that page size you have to rewrite the whole DB. So it can't be set on the fly unless you have multiple copies of the database.
3 replies →
Hah, I thought "in reverse order by ID" might be a stress test but I was still very impressed by how it performed!
Would it be possible to use a datasette frontend with this as a "backend" and statically host the whole thing?
Not easily - Datasette is written in Python, so you could try running it in WebAssembly (like Mozilla did for Jupyter with https://hacks.mozilla.org/2019/03/iodide-an-experimental-too... ) but it would be enough work that it might be easier to reimplement a subset of Datasette directly in JavaScript.
I've got no experience with Datasette, but if it's all in pure python, then you can use Pyodide[0] to run it within a wasm module. I was surprised how easy it was - took about 15 minutes and I had a Python package working perfectly in the browser. The Python package is about 3x faster than the JS/wasm version, but I'm guessing that performance gap will narrow when we get wasm-gc.
[0] https://github.com/pyodide/pyodide
Thank you!
Amazing use of SQLite! Thanks!
mind blown. how is this possible???
Use Http Range (normally used for pausing and continuing large file downloads) to request specific byte ranges from the file. From there you can pull only what you need. With sql indexes it'll be very tiny since the lookup is optimized. Of course if you select *, you're still going to pull the entire database locally.
TL;DR http, properly implemented, supports a ton more stuff than even many “web developers” are aware of, like… range requests, which are exactly what you’d think they’d be.
Also SQLite store data in pages and the page size can be tweaked. Combined with range requests any part of the database can be requested.
the most recent update to the W3C's own research webserver, written in Java, called Jigsaw, seems to be dated in 2007. I used it for a lot of purposes until 2002 but I don't know why I stopped working with Jigsaw only that by the time F# emerged in 2004 I was absorbed into a new direction :
https://jigsaw.w3.org/
iirc Jigsaw was used to develop and validate the WebDAV protocols and XQUERY which at the time I remember thinking XQUERY was sure to be the future as implementations of advanced data management and manipulation and query and distribution and declaration looked to be what the whole point of webservers were for. The incredible distractions caused by "rich media" as opposed to multimedia as it was understood then, are really worth thinking about. Saying that, however, the BBC is doing excellent work on restoring the balance of necessary powers to the network standards engineers https://www.bbc.co.uk/sounds/help/questions/about-bbc-sounds...
https://www.bbc.co.uk/rd/blog/2014-03-media-source-extension...
https://www.bbc.co.uk/rd/projects/nearly-live-production
2 replies →
BTW thank you havernator, because I have just realised what I can do with the setup I'm almost ready to pull the trigger on that'll give me a surfeit of online capacity (at least a baseload can be maintained while the rest is used for work instead of cloud time) : I am definitely going to investigate the possibility of providing a high level of standards specifications for simple web serving. If the W3C Jigsaw project had been maintained, I'd simply put it up and invite interested users to persuade me to send them a shell login. OK obviously that's far too naive today, but I would love to run a especially standards compliant host for negligible nominal or even no charge so people could maybe get a view of better ways to present the WWW.
frankly I think that unless we do things like this, the Internet is simply going to become a closed shop to anyone not wielding enterprise budgets and legal department capabilities.
3 replies →
So assuming no country has a name longer than 98 characters and that all country codes are 2 characters, that is over 500% overhead? Are you missing a /s in your post?
Since random accesses across the internet are really slow, for this kind of fairly small table (where SQLite stores the row data inline within the B-Tree of the table) it basically fetches the whole content for each row - so even if you query only the long_name and country_code column, it will in fact fetch the data of all 29 columns in that table.
If you want it to fetch less data for querying a subset of columns, you could create create an index on those columns - then SQLite will do an COVERING INDEX scan and thus read only the necessary data (with the B-Tree itself and the start / end page alignment being the only overhead).
Nothing to add to the conversation, just wanted to say I absolutely adore this. Years ago I used to think it'd be nice if search engines simply exposed all their shards to clients and let clients do all the results merging and suchlike. Of course that's probably a terrible idea in practice for performance, but this library is definitely implementing the spirit in a very practical way!
It also reminded me of a vague inverse of this hack. In old versions of Qemu (possibly it is still implemented, but I have vague memories it got ripped out), you could point Qemu at a directory on disk and it'd produce an emulated floopy disk drive with a virtual FAT12 image containing the directory contents. AFAIK it didn't keep the actual data in memory, I guess all it needed was file sizes to know how to build a virtual memory mapping that contained the filesystem metadata + proxied reads from the underlying files for data sectors. I look forward to seeing your implementation of this concept in a virtualized SQLite file <-> GraphQL proxy ;)
edit: insane, it still exists and apparently supports write mode?! https://en.wikibooks.org/wiki/QEMU/Devices/Storage#Virtual_F...
Have you actually read the article? SQLite is unmodified, and thinks it runs on a virtual file system, which fetches file chunks via HTTP range headers.
It's REALLY impressive that you only need to read 54 KB out of 700 MB, to fetch the records.
> It's REALLY impressive that you only need to read 54 KB out of 700 MB, to fetch the records.
the harsh reality is that doing sensible queries that only reference and return the data actually needed always makes things faster. Even with server DBMS. Oh, how many times have I lamented the naive "select *" for forcing all the row contents even when there was index coverage for the actually needed data.
Do most static site hosters support range requests?
5 replies →
It's impressive on one hand.
On the other it's still a lot of overhead.
6 replies →
> sql.js only allows you to create and read from databases that are fully in memory though - so I implemented a virtual file system that fetches chunks of the database with HTTP Range requests when SQLite tries to read from the filesystem: sql.js-httpvfs. From SQLite’s perspective, it just looks like it’s living on a normal computer with an empty filesystem except for a file called /wdi.sqlite3 that it can read from.
From this paragraph it should be pretty clear that it's actually a great result. The database will obviously need to read more data than it presents, so more is fetched.
This might be true.
But this approach lets you actually work out what the optimal size is:
gives: 6307
Or on average:
gives: 23
(Note that it doesn't seem possible to use aggregation functions with a limit clause)
The clever part here is only fetching 50KB out of 680MB.