Comment by bosky101

8 months ago

The website and example, and usage looks clean. Kudos! I have some questions around what's happening under the hood that werent evident from an initial read of both your website as well as GitHub.

1. Does subscribe listen for new changes on a transient server(just a queue). Or from a more persistent store?

2. Where do the events persist? I didn't see a connector to postgres. I did see one for s3.

3. What is the default persistence layer you are advocating?

4. Let's say you run 3 instances of the self hosted server. And a random one of them gets a teacher. And 2 random other students gets load balanced to two other servers. How does the teacher get all messages? What's the thread in a distributed setting

5. How do you filter only messages. Eg: only since time T.

6. Pagination / limits to avoid any avalanche?

7. Auth? Custom auth/jwt?

8. REST API to produce?

9. Are consumers restricted to browsers? What about one in node?

10. BONUS: Have you tested if this works embedded as an iframe or embedded in an native/react native mobile app?

4 comments

bosky101

jchanimal 8 months ago

I answered 1-5 in my other reply, hit save and so here's more:

6. There is a limit parameter on the query API, and the underlying data structures utilize async iterator patterns so we have a go-forward path to a streaming (data / query larger than memory) implementation. But for now the decrypt implementation is eager instead of lazy, so that's the first place we'd want to focus to make data > memory workloads no problem.

7. Other embedded databases don't have auth, but we are network-aware so it's a different ballgame. Our next step is read/write access control on a per-ledger basis.

UCAN capability delegation allows us to keep an embedded mindset here, in that authorization becomes a matter of data validity, not something that has to be fetched from a centralized resources. How it works: client device agents generate non-extractable keypairs (like Passkeys) and can link them to account principals via any signing endpoint Fireproof trusts (for starters just the one we run, to an end user it looks like clicking a validation link in an email.) Agents create a new cloud database clock register by locally generating an ephemeral keypair that signs itself over to the principal. Our centralized clock register endpoint only allows updates to the resource identified by the clock's public key ID, from agents which have a valid signed delegation chain to the ephemeral key.

To a developer it will look something like `db.share("bob@example.com")` and now Fireproof Cloud will let Bob read and/or write the db also.

What's cool about this is that access control changes are just data manipulations, so they can happen offline. And the valid delegations can be safely delivered over any channel. In fact there are no secrets in this system except for the non-extractable keypairs.

If you are thinking to yourself "what about revocation?" -- we are hiring.

8. The sync endpoint has the minimal blob k/v (no list) and register APIs. And can all be floated on top of any raw kv with check-and-set semantics if needed.

We have plans for a REST API in Fireproof Cloud, where if you allow the cloud to decrypt and process your data, we can give you raw queries instead of you replicating and then querying locally. I am thinking a CSV output here would be a good place to start.

9. Runs great anywhere JS runs. We have examples (like CatBot linked above) that subscribe to the ledger on the backend and operate locally, often responding the user events. So the DB is acting as an RPC bus... this is a common pattern in CouchDB so I made sure Fireproof works great like that.

To run in an edge function, you usually aren't gonna replicate to local filesystem, instead you can configure the database to read and write directly with the cloud store. Because of the eager decrypt we do, this is actually pretty fast and not that chatty.

10. The CodePen demo on our homepage is an iframe, works great. We have a contributor (I think I see in the thread here) who is working on React Native -- most of the heavy lift is done, but our gateway interface is only now settling down to where it makes sense to finalize the integration. I have also done Socket Supply for mobile and that works great.

jchanimal 8 months ago

These are awesome questions, I'll try to fold the answers into the docs also.

1. The embedded database subscribes to the remote sync endpoint when it is connected. This subscription might be polling, websocket, or anything else. The local embedded database will try to keep up with changes anyone pushes to the remote endpoint. This is more a backend mechanical thing than an API you'll see.

Your code can subscribe to the local database -- this is a JavaScript event loop, and any updates, local or remote, will cause your callback to run. The upshot is all you have to do is connect your database to the sync endpoint and it will stay up to date, and you can also connect your UI to the database via `db.subscribe()`

2. Updates are written to local storage (indexed db or the filesystem) as encrypted blobs. These are then replicated to the cloud (without being parsed by the cloud). We have SQL connectors also, but we haven't done the Postgres specific stuff (just started designing it). That is the data side. There is also the clock register, which the client updates to point to the most recent blob. This register is multi-writer safe, and can occasionally point to more than one "head" blob, in which case the client does the deterministic merge on read.

3. In my experience most people use the defaults, so we have Fireproof Cloud which uses R2 and durable objects. We also have a SAM template for AWS, and a connector for Netlify, in addition things that are more like parts for building your own backend (file and http endpoints).

4. Each ledger replicates 100% when it syncs, so all hosts have the same data (no sharding within a ledger.) Typically you have one centralized endpoint to sync via. (p2p is possible but you'd end up contributing some plumbing to the project I bet). So in this case the class would have a URL that is the sync point, and everyone would pull from it periodically or via streaming.

Merges are idempotent, deterministic, associative, and commutative, so it doesn't matter what order the teacher and students apply updates to their local instance, once all updates are applied, they have the same state.

5. The e2e encryption means you'd have to give the keys to the server to allow it to create subsets for sync, so we haven't done that yet. Our next optimization is to sync the readonly current dataset first, then any extra data needed for writing, and only when necessary, the historical log. This still doesn't solve the subset sync issue, but will benefit all use cases immediately.

There is some cool research we might use for subset sync: https://g-trees.github.io/g_trees/

But more practical is probably to finish the Postgres backend and then build subsetting at the global (multi-ledger) dataset level.

bosky101 8 months ago
wrt (3) Being able to self-host is extremely important. I noticed a lot of focus on the docs on the Quickstart/client usage. But things like default storage engine as a ENV, path for storage as an ENV. These are very important.
hmm. Replicate state to all clients. Ok.
Seems like an opinionated but well thought through project. Godspeed!
- jchanimal 8 months ago
  
  Thanks, and thanks for the encouragement to fully document the gateway interface. We have been flux-ing it lately but as soon as it settles down we’ll do that.
  The vision is many small ledgers, so the full replication per ledger makes sense, but we have work to do on cross-ledger queries