Comment by jryio
8 days ago
This is interesting for offloading "tiered" workloads / priority queue with coding agents.
If 60% of the work is "edit this file with this content", or "refactor according to this abstraction" then low latency - high token inference seems like a needed improvement.
Recently someone made a Claude plugin to offload low-priority work to the Anthropic Batch API [1].
Also I expect both Nvidia and Google to deploy custom silicon for inference [2]
1: https://github.com/s2-streamstore/claude-batch-toolkit/blob/...
2: https://www.tomshardware.com/tech-industry/semiconductors/nv...
Note that Batch APIs are significantly higher latency than normal AI agent use. They're mostly intended for bulk work where time constraints are not essential. Also, GPT "Codex" models (and most of the "Pro" models also) are currently not available under OpenAI's own batch API. So you would have to use non-agentic models for these tasks and it's not clear how well they would cope.
(Overall, batches do have quite a bit of potential for agentic work as-is but you have to cope with them taking potentially up to 24h for just a single roundtrip with your local agent harness.)
Openai has a "flex" processing tier, which works like the normal API, but where you accept higher latency and higher error rates, in exchange for 50% off (same as batch pricing). It also supports prompt caching for further savings.
For me, it works quite well for low-priority things, without the hassle of using the batch API. Usually the added latency is just a few seconds extra, so it would still work in an agent loop (and you can retry requests that fail at the "normal" priority tier.)
https://developers.openai.com/api/docs/guides/flex-processin...
That's interesting but it's a beta feature so it could go away at any time. Also not available for Codex agentic models (or Pro models for that matter).
I built something similar using an MCP that allows claude to "outsource" development to GLM 4.7 on Cerebras (or a different model, but GLM is what I use). The tool allows Claude to set the system prompt, instructions, specify the output file to write to and crucially allows it to list which additional files (or subsections of files) should be included as context for the prompt.
Ive had great success with it, and it rapidly speeds up development time at fairly minimal cost.
Why use MCP instead of an agent skill for something like this when MCP is typically context inefficient?
MCP is fine if your tool definition is small. If it's something like a sub-agent harness which is used very often, then in fact it's probably more context efficient because the tools are already loaded in context and the model doesn't have to spend a few turns deciding to load the skill, thinking about it and then invoking another tool/script to invoke the subagent.
Models haven't been trained enough on using skills yet, so they typically ignore them
2 replies →