Comment by WJW

5 years ago

So assuming no country has a name longer than 98 characters and that all country codes are 2 characters, that is over 500% overhead? Are you missing a /s in your post?

Since random accesses across the internet are really slow, for this kind of fairly small table (where SQLite stores the row data inline within the B-Tree of the table) it basically fetches the whole content for each row - so even if you query only the long_name and country_code column, it will in fact fetch the data of all 29 columns in that table.

If you want it to fetch less data for querying a subset of columns, you could create create an index on those columns - then SQLite will do an COVERING INDEX scan and thus read only the necessary data (with the B-Tree itself and the start / end page alignment being the only overhead).

  • Nothing to add to the conversation, just wanted to say I absolutely adore this. Years ago I used to think it'd be nice if search engines simply exposed all their shards to clients and let clients do all the results merging and suchlike. Of course that's probably a terrible idea in practice for performance, but this library is definitely implementing the spirit in a very practical way!

    It also reminded me of a vague inverse of this hack. In old versions of Qemu (possibly it is still implemented, but I have vague memories it got ripped out), you could point Qemu at a directory on disk and it'd produce an emulated floopy disk drive with a virtual FAT12 image containing the directory contents. AFAIK it didn't keep the actual data in memory, I guess all it needed was file sizes to know how to build a virtual memory mapping that contained the filesystem metadata + proxied reads from the underlying files for data sectors. I look forward to seeing your implementation of this concept in a virtualized SQLite file <-> GraphQL proxy ;)

    edit: insane, it still exists and apparently supports write mode?! https://en.wikibooks.org/wiki/QEMU/Devices/Storage#Virtual_F...

Have you actually read the article? SQLite is unmodified, and thinks it runs on a virtual file system, which fetches file chunks via HTTP range headers.

It's REALLY impressive that you only need to read 54 KB out of 700 MB, to fetch the records.

  • > It's REALLY impressive that you only need to read 54 KB out of 700 MB, to fetch the records.

    the harsh reality is that doing sensible queries that only reference and return the data actually needed always makes things faster. Even with server DBMS. Oh, how many times have I lamented the naive "select *" for forcing all the row contents even when there was index coverage for the actually needed data.

  • Do most static site hosters support range requests?

    • Most web servers do out of the box, so I would assume most do. Basically all unless they have some reason to turn range processing off or are running a custom/experimental/both server that have implemented the feature (yet).

      Not supporting range requests would be a disadvantage for any service hosting large files. Resuming failed long downloads wouldn't work so users might not be happy and there would be more load on your bandwidth and other resources as the AU falls back to performing a full download.

    • Generally yes. Because not having range support means you can't resume file downloads. Which is a pretty essential feature for a static file host.

    • More interestingly, do reverse-proxies like Varnish / CDNs like Cloudflare support range requests? If so, do they fetch the whole content on the back, and then allow arbitrary range requests within the cached content on the front?

      1 reply →

    • I was wondering that too. Support was spotty in general ~20 years ago but I assume things have improved since then.

  • It's impressive on one hand.

    On the other it's still a lot of overhead.

    • I would say it's less overhead than downloading the entire db to query it locally...? What is your suggestion for accessing a static database with less overhead?

      3 replies →

    • For a casual or personal use case though, the alternative of running a client-server database on something like a VPS is probably more overhead than this. It's unlikely to be a very scalable option, but for use cases as described by the author it seems like a good fit.

      1 reply →

> sql.js only allows you to create and read from databases that are fully in memory though - so I implemented a virtual file system that fetches chunks of the database with HTTP Range requests when SQLite tries to read from the filesystem: sql.js-httpvfs. From SQLite’s perspective, it just looks like it’s living on a normal computer with an empty filesystem except for a file called /wdi.sqlite3 that it can read from.

From this paragraph it should be pretty clear that it's actually a great result. The database will obviously need to read more data than it presents, so more is fetched.

This might be true.

But this approach lets you actually work out what the optimal size is:

  select sum(length(country_code) + length(long_name)) from wdi_country;

gives: 6307

Or on average:

  select sum(length(country_code) + length(long_name))/count(*) from wdi_country;

gives: 23

(Note that it doesn't seem possible to use aggregation functions with a limit clause)