Comment by zambelli
6 hours ago
Yes it does! I haven't published those evals yet, but I'm actually running 24-35B class models on a custom coding harness built on forge (even 120B class recently).
I just need more GPU wall clock time to get more evals done. ETA is...a few weeks? Got distracted by the coding harness.
But the results are the same. Reforged models do better than bare, even at those sizes. As for published results, I ran forge on Anthropic models and reforged doe better than bare for them as well :)
>But the results are the same. Reforged models do better than bare, even at those sizes
>I haven't published those evals yet
Don't forget to post the complete settings for those evals, please, because local LLMs' failure modes are often caused by incorrect setups (bad quants, bad chat templates, non-recommended temperatures, ridiculously small context, not enabling "preserve thinking" etc.). In my setup I've never seen Qwen3.6-27b get truly stuck so far. What it usually gets wrong are poor architectural decisions or forgetting to update something.
Good call! The latest forge version has per-model-parameter configs sourced from official sources (can be overridden), that's what I'll use for evals and each eval set will be paired with a commit hash. But I'll make sure to call out the location of the params and maybe highlight some for the popular models.
For the paper - more academic in nature - I wanted to isolate the model performance variable from guardrail lift. The delta is what mattered more than final score. For the paper, everyone got temp=0.7 - that was intentional.
As for Qwen3.6, it's really solid. It'll do really well on forge I can call that now. When I pushed it into agentic coding specifically and the eval suite I use there (separate from forge), even it needed help on long-running tasks - but it's definitely a top model right now.
However, entirely possible there are better settings than the "official recommendations" I found - which would be a neat finding in itself.
If it's worth it to you, you could try running it on Deepseek v4 flash which is very cheap right now...
Exactly what I was thinking - even on frontier or near-frontier models I still see my agents get stuck in these pointless loops where it's very obvious to me what they need to do to get "unstuck".
Yeah, it's a useful framework even with frontier. And it definitely lifts "cheap" frontier models like Haiku into more solid territory. I haven't done a ton of forge integrations into frontier (like pointing claude code into proxy mode) yet, but if you run into any issues let me know!
And we're off! It's working great with DeepSeek V4, although DeepSeek V4 Pro tends not to really run into problems anyway being near-frontier, but I definitely see improvement with Flash.
1 reply →
I'm attempting to make a replica of your Anthropic method that will do the same for DeepSeek. I'll let you know how it goes.
For our local Qwen, your setup works great out of the box!