Comment by dataflow
18 hours ago
Thanks for sharing this, it's super helpful. I have a question if you don't mind: I want a model that I can feed, say, my entire email mailbox to, so that I can ask it questions later. (Just the text content, which I can clean and preprocess offline for its use.) Have any offline models you've dealt with seemed suitable for that sort of use case, with that volume of content?
If your inbox is as big as mine, you won’t be able to load all the text content into a prompt even with SotA cloud hosted models.
Instead you should give it tools to search over the mailbox for terms, labels, addresses, etc. so that the model can do fine grained filters based on the query, read the relevant emails it finds, then answer the question.
Thanks, yeah. I think strong prefiltering is pretty much always doable because, if nothing else, I usually know the time range of the relevant emails and probably the sender/recipient or some keywords, plus I know how to filter out a big chunk of the irrelevant emails (like mailing lists, etc.), so I'm hoping it's not actually that much data for each search. What I don't know is which models would be most suitable even in the case where I can fit the data.
As an example of the kind of query I'm interested in, I want a model that can tell me all the flights I took within a given time range (so that means it'd have to filter out cancellations). Or, for a given flight, the arrival and departure times and time zones (or the city and country so I can look up the time zone). Stuff like that. (Travel is just an example obviously, I have other topics to ask about.) It's not a terribly large number of emails to search through in each query, but the email structures are too heterogeneous across senders to write custom tooling for each case.
Prompt injection is a problem if your agent has access to anything.
The local models are quite weak here.
Security is not a concern for the purpose of my question here, please ignore that for now. I'm just looking for text summary and search functionality here, not looking to give it full system access and let it loose on my computer or network. I can easily set up VM/sandboxing/airgapping/etc. as needed.
My question is really just about what can handle that volume of data (ideally, with the quoted sections/duplications/etc. that come with email chains) and still produce useful (textual) output.
> I'm just looking for text summary and search functionality here
Couldn't someone just send you an email with instructions to "jailbreak" your local model?