Comment by crabmusket

2 days ago

Something I have been considering is a ETL pipeline that, for each customer in our system, writes only their data to a SQLite file. Then, just expose a full SQLite query facility on that file.

This only works when your customers are of a reasonable size (e.g. small businesses or individuals) but could provide arbitrary analytics power.

It's also a safe target for AIs to write sql against, if you're into that sort of thing.

We do this with DuckDB, and leave the connection in read only mode, exposing the database over HTTP. SQL injection by design?

  • Cool to hear I'm not off the deep end. Have you written anything up on this, or were you inspired by anything else? How has it worked out for you?

    • haven’t written anything up on it, but I wouldn’t use my anecdote as an indication we aren’t off the deep end together!

      At a high level, we use DuckDB as an in-memory OLAP cache that’s invalidated via Postgres logical replication.

      We compute a lot of derived data to offer a read-heavy workload as a product.

      Possibly the most dubious choice I tried out recently was letting the front end just execute SQL directly against DuckDB with a thin serialization layer in the backend.

      The best apart about it is the hot reload feedback loop. The backend doesn’t have to rebuild to iterate quickly.

Zuora (billing and payment SaaS) has a user facing SQL interface backed by Trino that works pretty well

alternatively, why not use database views? That way you can make sure that customers can only see their own data. Permissions can be handled with normal database users by only granting them SELECT rights on the views.

If you're afraid of users tanking performance, read replicas. As instantaneous as it gets, and no customer can tank others.

  • I think my gut would feel safer having a serialisation boundary between the customer's queries and the primary data store. But yes in principle I don't see why this approach wouldn't work just as well. Food for thought.

    • The boundary is the read-replica, it cannot write back to the primary DB cluster. But yeah if you require absolute isolation of tenant data, aka even a compromise / local privilege escalation on the read-replica node shouldn't allow an attacker to read other tenants' data, no way around.

      Or... depending if your database layout allows, you might be able to achieve that with a per-tenant read replica server and MySQL replication filters [1] or Postgres row filters [2].

      A sqlite db is effectively the safest option because there is no way to bypass an export step... but it might also end up seriously corrupting your data (e.g. columns with native timezones) or lack features like postgre's spatial stuff.

      [1] https://dev.mysql.com/doc/refman/8.4/en/change-replication-f...

      [2] https://www.postgresql.org/docs/current/logical-replication-...