Comment by derefr
20 days ago
> That's not how training works - adjusting model weights to memorize a single data item is not going to fly.
Apologies; I think I got us all kind of off-track in this comment thread by stretching the definition of the term "fine-tuning" in my ancestor comment above.
Actual fine-tuning of the base model's weights (as one would do to customize a base model into a domain-specific model) works the way you're talking about, yes. The backprop from an individual training document would be a drop in the ocean; a "memory" so weak that, unless it touched some bizarre part of the latent vector-space that no other training document has so far affected (and so is until then all-zero), would be extremely unlikely to affect output, let alone create specific recall of the input.
And a shared, global incremental fine-tune of the model to "add memories" would be a hare-brained idea, anyway. Not even just that it wouldn't work, but that if it did work, it would be a security catastrophe, because now the model would be able to recall all this information gleaned from random tenant users' private chat transcripts, with nothing to differentiate that info from any other info to enable the model (or its inference framework) to compartmentalize it / prevent cross-tenant info leaks.
But let me rephrase what I was saying before:
> there's a way to take many transcripts of inference over a period, and convert/distil them together into an incremental-update training dataset (for memory, not for RLHF), that a model can be fine-tuned on as an offline batch process every day/week, such that a new version of the model can come out daily/weekly that hard-remembers everything you told it
As:
> for a given tenant user, there's a way to take all of their inference transcripts over a given period, and convert/distil them together into an incremental-update training dataset (for memory, not for RLHF), that a LoRA can be rebuilt (or itself fine-tuned) on. And that the work of all of these per-tenant LoRA rebuilds can occur asynchronously / "offline", on a batch-processing training cluster, gradually over the course of the day/week; such that at least once per day/week (presuming the tenant-user has any updated data to ingest), each tenant-user will get the effect of their own memory-LoRA being swapped out for a newer one.
---
Note how this is essentially what Apple claimed they would be doing with Apple Intelligence, re: "personal context."
The idea (that I don't think has ever come to fruition as stated—correct me if I'm wrong?) is that Apple would:
1. have your macOS and iOS devices spend some of their idle-on-charge CPU power to extract and normalize training fulltexts from whatever would be considered the user's "documents" — notes, emails, photos, maybe random text files on disk, etc.; and shove these fulltexts into some kind of iCloud-persisted database, where the fulltexts are PKI-encrypted such that only Apple's Private Compute Cloud (PCC) can decode them;
2. have the PCC produce a new/updated memory LoRA (or rather, six of them, because they need to separately imbue each of their domain-specific model "adapter" LoRAs with your personal-context memories);
3. and, once ready, have all your iCloud-account-synced devices to download the new versions of these memory-imbued adapter LoRAs.
---
And this is actually unnecessarily complex/circuitous for a cloud-hosted chat model. The ChatGPT/Claude/etc version of this architecture could be far simpler.
For a cloud-hosted chat model, you don't need a local agent to extract context from your devices; the context is just "past cloud-persisted chat transcripts." (But if you want "personal context" in the model, you could still get it, via an OpenClaw-style "personal agent"; such agents already essentially eat your files and spit them out external memories/RAGs/etc; the only change would be spitting them out into plain-old hidden-session chat transcripts instead, so as to influence the memories of the model they're running on.)
And you don't need a special securely-oblivious cluster to process that data, since unlike "Apple looking at the data on your computer" (which would upset literally everybody), nobody has any kind of expectation that e.g. OpenAI staff can't look at your ChatGPT conversation transcripts.
And cloud-hosted chat models don't really "do" domain-specific adapters (thus the whole "GPT" thing); so you only need to train one memory-LoRA per model. (Though I suppose that might still lead to training several LoRAs per user, if you're relying on smart routing to different models within a model family to save costs.)
And you don't need to distribute the memory-LoRAs back to client devices; as they can just live in an object store and get just-in-time loaded by the inference framework on a given node at the moment it begins an inference token-emission loop for a specific user. (Which might thus cause the inference cluster's routing to benefit from sticky sessions in a way it didn't before—but you don't need it; the LoRAs would likely be small enough to fetch and load within the ~second of delay it takes these cloud-hosted models to allocate you a node.)
This is a fine thought, I'm reluctant about it. It could work, I don't think it's obvious. It's very, very hard to know what to train for and not, and this still leaves the 'fact v. skill' problem - even LORA won't enable a model to remember your favourite lunch place.
This is kind of an existential problem with context I think. Maybe we need new architectures.