Comment by vinceguidry
17 hours ago
Has anyone come up with a decent harness for small local models, say, gemma4 e4b? I'm trying to roll my own but man, the capability gap is real.
17 hours ago
Has anyone come up with a decent harness for small local models, say, gemma4 e4b? I'm trying to roll my own but man, the capability gap is real.
This is precisely what I've been working on targeting with https://dirge-code.github.io/
I've written up an explanation of what trips small models ups and how the harness can address that here https://yogthos.net/posts/2026-06-08-dirge-code.html
Very interesting work! I put some effort into getting it to work with models my hardware can actually run well and they just fall over immediately. gemma4 12b runs like molasses on my 2080 super but it was the only model able to, with your harness, actually do anything useful. It was the only useful thing I've gotten any model runnable with my hardware with any harness I've tried, very impressive!
I suspect smaller models need more work than is practical to fit harnesses around. The smaller the model, the more work, and it doesn't carry over to other small models.
Deepseek r1 7b could not emit tool calls to save its life, gemma4 e4b couldn't get the names of files right, qwen3.5 4b gets stuck in dumb rabbit holes, I pointed it at a ruby script and asked it to run it, it tried running it with bash then got caught in a loop investigating.
Noble effort though! I guess I'll keep working on my barebones ruby_llm harness, with very tempered expectations. Each of these failure modes can be worked around, but there's too many of them to work around in the general sense.
Thanks, glad to hear the harness is actually doing its job with smaller models on your end. There definitely seems to be a limit of how small a model can get before it can't do any practical work.
I find I tend to view agentic coding similarly to a genetic algorithm. The model is the mutator function, and the harness along with the tests acts as the selection function. Each round the model generates some plausible code, it gets tested against the constraints, the model gets feedback and iterates on it until it converges on something that's workable. So, the real trick is to make sure the environment is producing correct pressures to guide the model in the needed direction.
Another interesting project in this space I can recommend checking out is ATLAS https://github.com/itigges22/ATLAS
Do you have benchmarks comparing against Pi? The blog post doesn't include any hard numbers.
For example, so far I haven't seen any evidence that LSP integration improves performance for small models vs using grep via a bash tool.
I haven't really seen anybody come up with a good test to show hard numbers on comparing agentic harnesses. It's a bit tricky to set up a definitive test given the whole non deterministic nature of LLMs. What I've been focusing on is watching the loop and seeing where model does things that it shouldn't have to. For example, I notice models doing stuff like writing python scripts to match parens for Clojure all the time using editors like Pi. So, having a mechanical way to repair parens, and when that fails, to give the model clear error regarding where syntax is broken removes that whole cycle.
As it stands, it's kind of subjective, you just have to try the harness and see if the model seems to be have better than with the other ones you've been using.
2 replies →
This really resonates. Thanks for mentioning.
This is very impressive!
Thanks, it's been a fun and educational experience working on the project.