Comment by vinceguidry
16 hours ago
Very interesting work! I put some effort into getting it to work with models my hardware can actually run well and they just fall over immediately. gemma4 12b runs like molasses on my 2080 super but it was the only model able to, with your harness, actually do anything useful. It was the only useful thing I've gotten any model runnable with my hardware with any harness I've tried, very impressive!
I suspect smaller models need more work than is practical to fit harnesses around. The smaller the model, the more work, and it doesn't carry over to other small models.
Deepseek r1 7b could not emit tool calls to save its life, gemma4 e4b couldn't get the names of files right, qwen3.5 4b gets stuck in dumb rabbit holes, I pointed it at a ruby script and asked it to run it, it tried running it with bash then got caught in a loop investigating.
Noble effort though! I guess I'll keep working on my barebones ruby_llm harness, with very tempered expectations. Each of these failure modes can be worked around, but there's too many of them to work around in the general sense.
Thanks, glad to hear the harness is actually doing its job with smaller models on your end. There definitely seems to be a limit of how small a model can get before it can't do any practical work.
I find I tend to view agentic coding similarly to a genetic algorithm. The model is the mutator function, and the harness along with the tests acts as the selection function. Each round the model generates some plausible code, it gets tested against the constraints, the model gets feedback and iterates on it until it converges on something that's workable. So, the real trick is to make sure the environment is producing correct pressures to guide the model in the needed direction.
Another interesting project in this space I can recommend checking out is ATLAS https://github.com/itigges22/ATLAS