Comment by vineyardmike
11 hours ago
Recently I had switched to OpenCode to try out many of the Non-US-Frontier-Labs models. My unexpected favorite model to use was Mercury (a diffusion model). Not because it was “smart” but because it was stupid fast. It was more of a pair-programming experience instead of the SOTA agentic experience of prompting and waiting. Honestly, it was also way more fun and brought back some of the pre-AI coding experience while still getting some benefits of AI. It felt less of a slot machine where you prompt, wait, and hope it went in the right direction. It made me even use the tiny models like Gemini Flash Lite and GPT Mini/Nano more too.
Anyways, so excited for an open-weight model and I hope it performs well. I’ll be testing this ASAP.
If you can run your tests fast and cheaply, and have metrics that show what bad/sloppy code is that are cheap & fast to generate, a worse fast model can outperform a far better far slower model if you value time...
I've had pretty good success with LLMs after putting in place metrics to measure true complexity (not cyclomatic), and automatically pushing back everything until the added complexity is within reason for the feature.
How do you measure “true” complexity? Cyclomatic seems a bit… I dunno, artificial? Blunt? But it has the benefit of being defined.
There's a ton of research on this in the 80s... and interestingly, I haven't seen a lot of recent research.
Surprisingly, it seems most languages don't have a standard package to do a lot of these detections.
Ruby has Flay to detect similarity (something LLMs are prone to do). Basically re-write a huge function with only a couple of minor differences that should probably be params...
One of the things I rely on most is "pressure" -> which conditions are causing the most checks throughout the code-base. Those are things you should Type away.
Dynamically typed languages like Ruby create a huge surface area for type slop for LLMs, and why I would not recommend using a dynamically typed language for vibe coding.
You can have type "pressure" and nil "pressure" -> where you set a value to nil somewhere (that you probably shouldn't have) -> and that has ripple effects all throughout your codebase. Similarly, you can do this for values -> one place it's a string (where it shouldn't be), everywhere else a symbol (what it should be) -> but now you've got hundreds of casts to_sym or to_s in your codebase.
There's also state drift & reification misses -> you constantly update two states (that should probably just be one new value or a function) and sometimes you forget to update one (more of a bug possibility than complexity). Same for reification misses -> you constantly check for multiple conditions -> that should probably be one value or a function, and similarly (buggy, you may sometimes miss one).
Complexity comes down to state and control flow -> so you want to check what's causing you to make the most decisions (especially state/time based), and where it's coming from. Where do you have the most state and why...
I'm hoping to release everything in the next few weeks, but it takes a while to polish things, especially when it's a side-quest of a side project...
2 replies →
What metrics have you found useful?
Mercury-2 is amazing. I am using it frequently as the arbiter in llm-consortium The context window is relatively small, so to make it work with larger consortiums I can construct a recursive sort-of meta consortium like this:
Now when I prompt cns-meta-glm-kimi it will pick the best of five from kimi and glm before creating a synthesis from the two winners.
I've found the average output of many suboptimal models is still suboptimal, especially when it comes to judging the accuracy/correctness of the work of other models.
I did some benchmarks recently of how well various models find security vulnerabilities, and then follow up testing of the judging process of whether the models found the right bug and whether other bugs it reported were false positives or legitimate other bugs. A committee of good-not-great models (DeepSeek, MiMo, Gemma 4) cannot replicate the accuracy of Opus by itself. Even when all three of the other models disagreed with Opus, Opus was almost always the one that was actually right.
It's an interesting area for research. And, a model that's very fast can make a lot more attempts at a solution, and in cases where there is an unambiguous "right" solution that can be proven by some sort of static rule, "very fast" may be a useful characteristic. Small classification problems, where you need to make thousands of decisions about some specific aspect of a large corpus of data, seems like a sweet spot for a model like Mercury.
I have had a better experience with my own use. I use it every day and it rarely fails to improve tasks. Perhaps the prompts and rubrics make a difference. And finding bugs is one of the better use cases because it is essentially a search problem. As long as models are non-deterministic and there is some diversity in training data, then an ensemble that iterates on the problem is more likely to cover the ground needed to find solve a problem.
Some tasks benefit from this approach more than others. There was a paper from google on a version they made which was very similar and achieved SOTA then on planning and pathfinding benchmarks.
edit:
Mind Evolution paper https://deepmind.google/research/publications/122391/
(That was a month after I published llm-consortium :) https://xcancel.com/karpathy/status/1870692546969735361
I wonder how much this will impact locally used models for coding. I can imagine using diffusion models that are x-times faster than Qwen or Gemma 4 - where I have to do more "pre-ai" work which is a good thing and can have a very fast, very cheap model to work with locally. I assume since it doesn't do heavy computing for a long time that it's cheaper to run on local hardware as well?
I get exactly what you mean. After getting frustrated with how slow Claude was on my personal projects, I switched to Google Antigravity with Flash models and the speed difference is huge. I feel more in the flow and just more focused on the task. I did not realize how much a difference speed can make.
Claude is better for extremely complicated, large codebases where its slower response time might be a good trade-off for the complexity of the task. Antigravity and other fast models works so much better for smaller projects where you want a "flowy" code, run, debug cycle.
YESSSS!!! speed is THE way! I like my boilerplate POJOs/data classes generated at breakneck pace of 300+ tok/s, Flash-Lite is more useful than GPT-5.5 for me this way. if it's too slow, you just stay in that goddamn async death loop
> I like my boilerplate POJOs/data classes generated at breakneck pace of 300+ tok/s
Regardless of speed, use the LLM to eliminate the need for boilerplate rather than just creating more code faster.
> if it's too slow, you just stay in that goddamn async death loop
Things get slow when you're ballooning the size of your code, files, design and architecture, and things get more involved and complicated, piling fast hacks on top of fast hacks and everything get brittle.
Slow is fast, longer-term anyways.
For boilerplate, yeah. But when asking research or exploratory questions, or weighing whether a feature is well designed, or asking "can I implement _x_ feature using these libraries without introducing unnecessary complexity", then GPT-5.5 medium is still fast enough.
10-20 seconds times a couple turns on a new feature isn't bad. Kimi is also similarly fast if not faster.
I do agree with smaller models for more constrained/routine tasks though.
Could you say more about how you use it? What does your workflow look like?
Imagine you’re entirely pre-AI… to do some work, you read code, think, then write some code across a number of files. Usually then a small dance with compilation/unit tests to address anything broken. Along the way, you use your human judgement on style and quality, and midway through your change you might refactor something based on learned best practices (eg, when to use a static method, or helper class).
Today, even the dumbest AI agents can trivially loop through the final dance to get compilation, and often unit tests (depending on scope of failure). Big SOTA agents have OK code quality, but if left unattended or unchecked will still generate pretty sloppy repos after a while. That’s true even when using models like Opus which is ridiculously expensive in comparison.
When using the models in this fast “pair programming” style, I find that I (the human) mostly do all the “plan and think” work, and usually point the smaller agent towards specific files/directories, with specific targeted changes. It’s slower than 1-shot prompting an entire feature, but slightly faster than doing it manually, and I find the code is less “slop” because the changes are smaller and more human. It retains the agentic benefits of handing imports, compilation iteration, etc and can do basic cross-file plumbing. It’s also cheap and fast to do iterations like “wait make that method static” or “let’s update this to use <other util class>” and things like that. When the agent is slow to make localized edits, I find I’m less likely to push for minor nit-picks and style updates.
So you're making smaller edits?
Wow... I forgot about that. Mercury is brutal. I had him review lint errors and the speed is just insane