← Back to context

Comment by kethinov

5 hours ago

Can someone explain what the current state of model benchmarking is? If you try to look up what the best locally runnable model is, you get a bunch of random blog posts using idiosyncratic criteria to rank things seemingly based on one dude's opinion.

Ideally I would love to see a leaderboard with relatively objective ranking criteria that 1. lets you filter by open weight / locally runnable, 2. filter by date of release (nothing older than x), and 3. is agnostic to hardware requirements. I just want to know what the best model is. Let me worry about how I will afford to run it.

I love the llmfit project for seeing what will run on your hardware, but it would be nice to know what I'm missing out on by not having better hardware, thus why objective hardware-agnostic ratings would be helpful.

That would be nice, but it's not going to be possible.

Any open benchmark has a very short life, since it will be pulled in and DPO / RL trained quickly for benchmaxxing purposes. So, you'll need a private test to have a hope of something fair. (These also get leaked over time, btw, so even then there's a window of usability).

These are expensive to run.

Now consider that there might be 15-20 viable quants for a given open model release; someone would have to want to pay for these private evals to be run on them. Even then, a good read through unsloth's commits and blog posts will remind you that there's quite a lot of engineering work to be done to get model inference working properly, even for models released by frontier or near-frontier labs. So, you'd want to make sure that you have a replicable 'best engineered' deployment to evaluate, or at least one that's closest to your hardware and fits the bill.

Upshot - it's much faster to download and try out a model, and possibly cheaper too. Well, cheaper since hugging face is paying the bandwidth bills.

>I just want to know what the best model is. Let me worry about how I will afford to run it.

This is a very typical manager question that I suppose many people have who fail to see the simple truth: There is no "best" model. There are only best models for certain use-cases. Sometimes you'll find these in custom community leaderboards on platforms like huggingface, but for most business applications you'll probably have to come up with your own benchmark. Most common benchmarks are pretty worthless by now because all the usual ones are being gamed hard by model providers, to the point that there are now sometimes drastic differences between models that perform very similarly on common benchmarks.

The best thing I have come up with is just make a bunch of prompts / tasks that I personally care about and need a model to know how to do. As an example, when qwen3.6 27B dropped, I ran it, kimi, claude and glm 5/5.1 on a bunch of LLM-architecture specific tasks (stuff like 'implement an incremental KV-cache for autoregressive transformer inference' or 'implement flash Attention backward pass with D-optimization') and analyze the results, who made tests, are the tests valid, does their implementation actually work or are they only claiming it to, that sort of thing.

It is a day/weekend worth of work, but I think this is the best way to determine if the model fits your need specifically. This is what lead me to finding out that qwen 27b outperformed even kimi on those tasks, and that opus tries gaslighting me when I give it a spec of something that has been proven, but no published solution exists online. All other models gave their best shot at solving it, opus just said it's not possible (even when I gave it the finished working product that obviously works).

Especially for small models (but also big ones) I think the only way to know if a model will improve your workflow is this, personal benchmarks, expanded over time, ran in private.