Comment by akssassin907
3 days ago
The pattern I found that works ,use a small local model (llama 3b via Ollama, takes only about 2GB) for heartbeat checks — it just needs to answer 'is there anything urgent?' which is a yes/no classification task, not a frontier reasoning task. Reserve the expensive model for actual work. Done right, it can cut token spend by maybe 75% in practice without meaningfully degrading the heartbeat quality. The tricky part is the routing logic — deciding which calls go to the cheap model and which actually need the real one. It can be a doozy — I've done this with three lobsters, let me know if you have any questions.
Maybe I’m out of touch but why do you need an LLM to decide if there’s any work to be done? Can’t it just queue or schedule tasks? We already have technology for that that doesn’t require an LLM.
Tasks might have prerequisites or conditions.
Like "if it's raining, remind me to grab my umbrella before I leave for work"
-> "is it raining?" requires a tool call to a weather service
-> "before I leave for work" needs access to the user's calendar and information when they leave compared to the time their work day starts
-> "remind me" needs a way to communicate to the user in an efficient way, Telegram, iMessage or Whatsapp for example.
Totally valid for fixed, well-defined tasks — a cron job is cheaper and more reliable there. The LLM earns its keep when the heartbeat involves contextual judgment: not just "is there a task in the queue" but "given everything happening right now, what actually matters?" If the agent needs to reason about priority, relevance, or context before deciding what to surface — that's where the local model pulls its weight. If your agents only do fixed tasks, you're totally right, you don't need it!
It seems to me like it would be a rather useful exercise to have the smaller model make the routing decision, and below certain confidence thresholds, it sends it to a larger model anyways. Then have the larger model evaluate that choice and perhaps refine instructions.
That's a cleaner implementation than what I described. Small model as meta-router: classify locally, escalate only when confidence is low. The self-evaluation loop you're suggesting would add a quality layer without much overhead — the large model's judgment of its own routing is itself a useful signal. Haven't shipped that yet but it's on the list.