Comment by 2001zhaozhao

10 hours ago

> We switched to the "triager" pattern: a Haiku agent with a very specific and narrow job. Is this issue already tracked or not? If it is, stop right there. If not, escalate to Opus.

I'm planning to self host qwen3.6 27b basically for this purpose

2 comments

2001zhaozhao

shad42 10 hours ago

Nice, it's on our todo list to use oss models too. What are you building?

2001zhaozhao 8 hours ago

The basic idea is to run a 24/7 SWE agent loop on local hardware to maximize the cost effectiveness. The agents just keep running and refining development tasks in a project board. When the human is in the loop they do just enough to complete the tasks with semi-frequent human review. However, whenever human is absent or human feedback becomes the bottleneck they start autonomously debating, delegating to cloud LLMs, etc. to try to clear bottlenecks autonomously instead. So essentially the system will try to do useful things to fully utilize the hardware, which is a specific optimization for local models (you'd never need to do this if you use cloud models).
The local models would also be queryable on-demand (which overrules the 24/7 tasks in terms of priority) as cheap inference. The idea is that in user-queried interactive tasks, the main Claude agent primarily only gets summaries from other agents and makes decisions based on it, thus saving a ton of tokens compared to giving it access to the codebase. These small-model calls would preferentially route to my local model to save costs but overflow to a cloud provider if demand is momentarily too high.