Comment by teiferer
5 days ago
> It's possible Opus or GPT-5.5 could have done this too, I've not tried the exact same sequence. The Fable vibes are good here, though.
And that's the thing. These comparisons are all gut feelings. I'm missing objective unbiased measurements to actually have real comparisons between different models, their different generations, or even just the convention that everybody adds "you are an expert software engineer" and "don't make mistakes" to their prompts because they think it improves anything. Nobody knows if it actually does.
Vibes are all that matter. As soon as you start measuring it, that measurement becomes a target and vendors start optimizing for it at expense of the general usefulness of the model. We’ve seen plenty of models with great benchmark scores flop when people start using it.
If benchmarks didn’t exist we would have to invent them because “vibes” is a ridiculous idea: oh I know I’ll be super unscientific and horrendously biased and that’s far better than a team of experts carefully AND CONTINUALLY developing a variety of benchmarks of varying quality that…hmm all point to the same thing.
You can’t benchmaxx an eval that comes after your model release.
Consider also benchmaxxing makes no sense from an incentive structure: the quality of these models is directly correlated by how well you can measure true performance in the wild. If they were just stupidly benchmaxxing they would be unable to do trustworthy ablations or know how well the model will perform in their product.
Remember the famous case of asserted benchmaxxing from llama 4? The entire org was gutted and the ceo spent billions hiring better people. Every lab takes evaluations extremely seriously.
> You can’t benchmaxx an eval that comes after your model release
Sure you can, just do it silently and don't tell the people hitting your API that the model is different now. Unless it's open weight, we're just taking your word for it. Even better, do a VW and try to detect which benchmark is running, then change to a hyper specialized model that is trained on it.
11 replies →
Vibes is just UX. There's whole careers, teams, and even industries dedicated to it, and yeah it isn't easy because you need aggregate data from people.
4 replies →
ya gotta have a vibe for everything if you want to compare vibes, though. you can't just have a vibe for fable 5 alone AND say that it's better than anything out there. there's no weight in that verdict at all, no meaning. it's like reviewing a book without reading it.
throw the same prompt at multiple models and see how far each one gets. change the prompt used in the benchmark every day so models can't be optimized for that one prompt. use your vibe glands all you want, but don't issue model judgements without any ability to compare apples to apples.
You are literally describing a benchmark
100% agree on this! These new models best performance is always experienced in the first hour of communicating with them. If you have a specific problem with a clear goal in mind, then you have one hour to get the best out of any AI model. Personally, every time I took an AI suggestion, I walked through a wall sideways. AI is hands down a smart technology that throws dictionary vibes!
Benchmaxxing isn’t the only problem. Evaluating an intelligence is a task that generally requires at least an equally capable intelligence, if not one of greater capability.
That’s why students are evaluated by teachers with more knowledge and experience than them. It follows that any mechanical evaluation scheme is hopelessly inadequate for measuring the true capabilities of a frontier language model.
> students are evaluated by teachers with more knowledge and experience than them
This starts to break down in college when the professors often at best only slightly ahead. (they have more knowledge and experience - but in a slightly different area and so it isn't relevant to the depth of whatever is under consideration) Grad school is about advancing the state of the art - if you don't know more than your professor you are doing it wrong.
7 replies →
> Evaluating an intelligence is a task that generally requires at least an equally capable intelligence, if not one of greater capability.
How is this remotely true. You can have verifiable tasks that you can’t do. Where does this idea come from??
2 replies →
I've been testing some models that score higher than Opus 4.6.
They:
- hallucinate constantly
- can't follow basic instructions
- think they're Claude for some reason ;)
The only one I see that thinks it is claude other than claude itself is the GLM series.
2 replies →
Lots of things in life are gut feelings. It would be really great if we could determine quantitatively forever whether Rust is a superior programming language to Go, but real life resists those kinds of measurements.
> real life resists those kinds of measurements
no it doesn't, there's just no single measurement that will answer everyone's "which is better" question.
Go is better for some stuff. Rust is better for other stuff. Perl is better for other things.
"better" can mean anything, but if you define it, then it has definition, and you can measure it. So, you have multiple definitions of "better" and you use them all when you compare.
zero people have the same weights of the various definitions of "better", even among programming languages; look at how much javascript is written today. JS is not a better language in any measure that is based on rational thought, but for some people "this is javascript and nothing else is javascript" is enough for them to know that javascript is the better choice for their project.
Don't you think this applies to LLMs too?
> determine quantitatively forever whether Rust is a superior programming language to Go
Ha, of all examples you had to pick this :D I think we can very well determine that qualitatively.
So .. where can we read about the results?
2 replies →
There are tons of benchmarks in the announcement. But we also know that benchmarks are problematic.
So the best we can do right now seems to be to combine imperfect case studies like this with imperfect benchmarks to get some unreliable impression of where we are...
Yes, these are gut feelings. That said, I have lots of experiences with Opus and I have lots of projects and contributions (all reviewed and tested) made with the help of it. Definitely useful, to me and to people whose project matters to them. :P
Adding "do not make mistakes" is silly, in my opinion. There is always a good chance it will make mistakes. You should rather be more specific about a thing rather than as broad as "do not make mistakes" is. It just does not work that way.
"Check your work for mistakes after the first draft" maybe :)
Ok but isn’t that true of all software development? It’s not like anybody’s done a rigorous test of writing their entire codebase in Python vs Java. It’s all vibes based there. People create post-hoc justifications for why they use certain technologies but the reality is a lot more vibes than anything else.
No, relative performance between Python and Java can absolutely be measured.
Yes, but performance is not the only factor in whether a specific language is better than another for a specific project.
I added "you can do anything if you believe" to my agent and it went from not even attempting things to just doing them effortlessly.
I know how stupid that sounds but it's true.
Well what do they say... "If it sounds stupid but it works, then it's not stupid!"
How do you measure the performance of people? This is subjective and biased every time.
I have a couple projects that have completely stalled because none of the frontier models could advance any further with them - I'm going to give fable a try at them this coming weekend.
I believe the "you are an expert software engineer" thing puts them into a "mindset" of cosplaying a software engineer - whereas I get astounding results by talking to them in the information-dense, jargon-heavy mode I use with my peers. I can't prove it but I believe that places my session in a better place in latent space.
ymmv
Yes, words matter.
My favourite example is that if you use "timestamp" when using an LLM to process video you get worse results than if you'd use "timecode".
AV professionals always say "timecode" - timestamp is a programming term.
Using the right word pushes the model closer to the correct spot in the cloud of vectors that is it's "brain".
fwiw, I gave it the same vibecoding project I'd previously tried with Sonnet 4.5 and it took Fable 2 hours to go well beyond (like, 2x beyond) where I got in 8 hours with Sonnet 4.5. (beyond that idk, because past 8 hours with the Sonnet 4.5 version I hit the "vibe limit" where it becomes easier to just write/edit the code yourself than get the agent to do what you want; and past 2 hours with Fable I hit my usage limit.)
Addendum: Interestingly, it ended up taking me about the same amount of time - 8 hours or so - to hit the "vibe limit" with Fable. But in that amount of time I made about 5-10x as much progress. So my feelings are:
1. It's exponentially better
2. yet, somehow, hand coding still isn't dead, at least for me
How many $ do you guys spend when your session runs for 30min? What's the total budget?
I just have a regular Claude subscription and keep within its usage limits
2 replies →
Just treat it like an employee with infinite energy. You can never really measure the productivity or ability of employees, it’s just pretty obvious when one is better than another. You’re asking them to do things and they’re either coming up with the goods or they aren’t. You can’t really expect much more from agents either but I’m not sure why you need anything more.
That’s what evals are for.
And there’s no reason evals can’t be done on multi-turn agents in a loop (or not): it’s pretty much what all these benchmarks do.
I think (related to the threads below) properly running evals in the state of the art models is likely outside the budget for most individuals. It's undoubtedly the right thing.
It would be very useful for companies to isolate interesting programming challenges in their past and publish evals on them (without revealing the actual codebase). In theory companies adopting these models should already be doing this to evaluate cost/benefit for each model, so it would be a matter of publishing them on a regular basis.
[dead]
IMO comparing different models is like comparing songs or paintings or modern art.
There is no true objective measure, can you mathematically determine which song is the best for everyone for example? Or which painting different people feel is the nicest to look at or what emotion it gives them.
Yea, you can do the fucking strawberry tests or carwash trick questions, but that doesn't really measure anything useful.
You can also do benchmarks but how do you measure the output of those?
The easiest way is just to use them all and get the feels of which of them works best for you. For me it's Claude first, pi.dev + gpt5.5 second. Plain Codex is a distant third and Gemini exists - it's pretty good at finessing web UIs as it does aria labels and usability better than other, but I wouldn't write backend code with it.
> IMO comparing different models is like comparing songs or paintings or modern art.
I don't think this is that subjective or vague.
There are a couple of crisp metrics that can be used to evaluate a model:
- given a prompt, does it finish a task (times X tasks)
- how much did it cost to finish the task
- how long did it took?
If all models are able to handle a class of tasks, they perform equally well.
If a model costs much more to finish a task, it is worse than other models.
If a model takes longer to finish a task, it is worse than other models.
The ugly truth is that since the GPT4.1 days, new model releases have shown diminished returns. Context windows were increased, reasoning steps help improve the usefulness of a user's prompt,... That's it. Even those are UX improvements, instead of huge breakthroughs.
"Diminishing returns", so are you claiming unironically that GPT4.1 can achieve anything Fable 5 can?
Or just that it's so much cheaper that the cost/benefit ratio is better?
Also "finish a task" is also subjective. I can "finish the task" of building a table, but it will be a shitty table. Are you also measuring the quality of the result - which is subjective again?
1 reply →
The first thing in the release page is benchmark results...
https://www.anthropic.com/news/claude-fable-5-mythos-5
The benchmarks are now the equivalents of SAT/ACT/other standardized exams for humans. They are directionally quite predictive, but with plenty of outcome variance on the margins
Yeah, if the jump is big, then we should be able to see the qualitative improvements, or see where Opus was tripped up in a task and Fable did succeed
It’s almost like they’re interchangeable. We need to start asking these models to solve extremely difficult, contrived DSA coding questions before deciding which ones we employ
I believe there is hard evidence that role-playing prompts are effective at leading it towards particular strategies and trains of thought. Not sure that SWE has been specifically studied, but proper science is very slow in the context of rapid change and broad context. It's good to stay grounded in the science that has been done, but we're going to have to do our best in uncharted territory for a while.
"Don't make mistakes" does seem dumb. It's not guidance.
> These comparisons are all gut feelings.
https://simonwillison.net/about/#disclosures
"I have not accepted payments from LLM vendors, but I am frequently invited to preview new LLM products and features from organizations that include OpenAI, Anthropic, Gemini and Mistral, often under NDA or subject to an embargo. This often also includes free API credits and invitations to events."
But I'm totally unbiased on my gut-feeling posts, trust me bro.
-- AI influencers.
Anthropic didn't give me early access to this model, shouldn't that bias me against it?
You kinda proved the point...
8 replies →
This isn't some random dipshit, this is Simon Willison[1]. He has a bit more cred than some "AI influencer".
[1]https://en.wikipedia.org/wiki/Simon_Willison
[dead]
[flagged]
[flagged]
check the backlinks[1][2] in the article before you start throwing around accusations. I am not (yet) a person that has advanced notice and access to models.
Fable just got announced and I did a rush out article because people are curious. I released the post mere hours afterwards and it takes time to create the output, slice into videos, make a wordpress article on top of taking my son to basketball training and eating dinner. I’m in London and this was all happening at 1am.
If you check the links my previous articles have all the juicy stuff you are criticising me for not having with little preparation.
How is a side by side direct comparison NOT precise?
[1] first in series from 2025: https://generative-ai.review/2025/05/vibe-coding-my-way-to-e... . This has all the background you are talking about in the Appendix
.
[2] https://generative-ai.review/2026/05/vibe-coding-my-way-to-e... . Second in series 2026 has a side by side table of what changed. This is what is possible with more than a few hours advanced warning.
3 replies →
This is NOT a misplaced rant, this is a very good description of what I feel as well. You've put it very well.
9 replies →
How is this meaningfully different than simonw's pelicans riding a bicycle? If anything, this seems to be of a higher caliber?
2 replies →
It feels like hand written software will now be "bespoke"
1 reply →