← Back to context

Comment by simonw

5 days ago

I've spent enough time with this now in Claude Code (and Claude.ai and Claude Code for web) to have an opinion on Fable 5: it's a beast. I'm throwing some VERY difficult problems at at - things I've been dragging my heels on for months - and it's crunching through them very happily.

One that I'm willing to share (albeit from just a week ago) - I built a Python library last week that bundles MicroPython compiled to WASM to create a sandboxed code execution library: https://github.com/simonw/micropython-wasm

I just told Claude.ai (not even Claude Code - this was the standard Claude chat interface) running Fable 5:

  Clone simonw/micropython-wasm from GitHub
  and research how this could use a full
  Python as opposed to MicroPython

A few prompts later (and I uploaded the zip files from https://github.com/brettcannon/cpython-wasi-build/releases/t... because Claude chat can't access those files itself) and I have a wheel file that bundles Python itself, compiled to WASM:

  uv run --with https://static.simonwillison.net/static/cors-allow/2026/cpython_wasm-0.1.0-py3-none-any.whl \
    cpython-wasm -c 'print(45 ** 56)'

Here's the transcript: https://claude.ai/share/a73b8b8b-8ebc-4fef-9e5c-7438e5e7ae35

(It's possible Opus or GPT-5.5 could have done this too, I've not tried the exact same sequence. The Fable vibes are good here, though.)

> It's possible Opus or GPT-5.5 could have done this too, I've not tried the exact same sequence. The Fable vibes are good here, though.

And that's the thing. These comparisons are all gut feelings. I'm missing objective unbiased measurements to actually have real comparisons between different models, their different generations, or even just the convention that everybody adds "you are an expert software engineer" and "don't make mistakes" to their prompts because they think it improves anything. Nobody knows if it actually does.

  • Vibes are all that matter. As soon as you start measuring it, that measurement becomes a target and vendors start optimizing for it at expense of the general usefulness of the model. We’ve seen plenty of models with great benchmark scores flop when people start using it.

    • If benchmarks didn’t exist we would have to invent them because “vibes” is a ridiculous idea: oh I know I’ll be super unscientific and horrendously biased and that’s far better than a team of experts carefully AND CONTINUALLY developing a variety of benchmarks of varying quality that…hmm all point to the same thing.

      You can’t benchmaxx an eval that comes after your model release.

      Consider also benchmaxxing makes no sense from an incentive structure: the quality of these models is directly correlated by how well you can measure true performance in the wild. If they were just stupidly benchmaxxing they would be unable to do trustworthy ablations or know how well the model will perform in their product.

      Remember the famous case of asserted benchmaxxing from llama 4? The entire org was gutted and the ceo spent billions hiring better people. Every lab takes evaluations extremely seriously.

      17 replies →

    • ya gotta have a vibe for everything if you want to compare vibes, though. you can't just have a vibe for fable 5 alone AND say that it's better than anything out there. there's no weight in that verdict at all, no meaning. it's like reviewing a book without reading it.

      throw the same prompt at multiple models and see how far each one gets. change the prompt used in the benchmark every day so models can't be optimized for that one prompt. use your vibe glands all you want, but don't issue model judgements without any ability to compare apples to apples.

      2 replies →

    • Benchmaxxing isn’t the only problem. Evaluating an intelligence is a task that generally requires at least an equally capable intelligence, if not one of greater capability.

      That’s why students are evaluated by teachers with more knowledge and experience than them. It follows that any mechanical evaluation scheme is hopelessly inadequate for measuring the true capabilities of a frontier language model.

      11 replies →

    • I've been testing some models that score higher than Opus 4.6.

      They:

      - hallucinate constantly

      - can't follow basic instructions

      - think they're Claude for some reason ;)

      3 replies →

  • Lots of things in life are gut feelings. It would be really great if we could determine quantitatively forever whether Rust is a superior programming language to Go, but real life resists those kinds of measurements.

    • > real life resists those kinds of measurements

      no it doesn't, there's just no single measurement that will answer everyone's "which is better" question.

      Go is better for some stuff. Rust is better for other stuff. Perl is better for other things.

      "better" can mean anything, but if you define it, then it has definition, and you can measure it. So, you have multiple definitions of "better" and you use them all when you compare.

      zero people have the same weights of the various definitions of "better", even among programming languages; look at how much javascript is written today. JS is not a better language in any measure that is based on rational thought, but for some people "this is javascript and nothing else is javascript" is enough for them to know that javascript is the better choice for their project.

      1 reply →

    • > determine quantitatively forever whether Rust is a superior programming language to Go

      Ha, of all examples you had to pick this :D I think we can very well determine that qualitatively.

      4 replies →

  • There are tons of benchmarks in the announcement. But we also know that benchmarks are problematic.

    So the best we can do right now seems to be to combine imperfect case studies like this with imperfect benchmarks to get some unreliable impression of where we are...

  • Yes, these are gut feelings. That said, I have lots of experiences with Opus and I have lots of projects and contributions (all reviewed and tested) made with the help of it. Definitely useful, to me and to people whose project matters to them. :P

    Adding "do not make mistakes" is silly, in my opinion. There is always a good chance it will make mistakes. You should rather be more specific about a thing rather than as broad as "do not make mistakes" is. It just does not work that way.

  • Ok but isn’t that true of all software development? It’s not like anybody’s done a rigorous test of writing their entire codebase in Python vs Java. It’s all vibes based there. People create post-hoc justifications for why they use certain technologies but the reality is a lot more vibes than anything else.

  • I added "you can do anything if you believe" to my agent and it went from not even attempting things to just doing them effortlessly.

    I know how stupid that sounds but it's true.

    Well what do they say... "If it sounds stupid but it works, then it's not stupid!"

  • How do you measure the performance of people? This is subjective and biased every time.

  • I have a couple projects that have completely stalled because none of the frontier models could advance any further with them - I'm going to give fable a try at them this coming weekend.

    I believe the "you are an expert software engineer" thing puts them into a "mindset" of cosplaying a software engineer - whereas I get astounding results by talking to them in the information-dense, jargon-heavy mode I use with my peers. I can't prove it but I believe that places my session in a better place in latent space.

    ymmv

    • Yes, words matter.

      My favourite example is that if you use "timestamp" when using an LLM to process video you get worse results than if you'd use "timecode".

      AV professionals always say "timecode" - timestamp is a programming term.

      Using the right word pushes the model closer to the correct spot in the cloud of vectors that is it's "brain".

  • fwiw, I gave it the same vibecoding project I'd previously tried with Sonnet 4.5 and it took Fable 2 hours to go well beyond (like, 2x beyond) where I got in 8 hours with Sonnet 4.5. (beyond that idk, because past 8 hours with the Sonnet 4.5 version I hit the "vibe limit" where it becomes easier to just write/edit the code yourself than get the agent to do what you want; and past 2 hours with Fable I hit my usage limit.)

    • Addendum: Interestingly, it ended up taking me about the same amount of time - 8 hours or so - to hit the "vibe limit" with Fable. But in that amount of time I made about 5-10x as much progress. So my feelings are:

      1. It's exponentially better

      2. yet, somehow, hand coding still isn't dead, at least for me

  • Just treat it like an employee with infinite energy. You can never really measure the productivity or ability of employees, it’s just pretty obvious when one is better than another. You’re asking them to do things and they’re either coming up with the goods or they aren’t. You can’t really expect much more from agents either but I’m not sure why you need anything more.

  • That’s what evals are for.

    And there’s no reason evals can’t be done on multi-turn agents in a loop (or not): it’s pretty much what all these benchmarks do.

    • I think (related to the threads below) properly running evals in the state of the art models is likely outside the budget for most individuals. It's undoubtedly the right thing.

      It would be very useful for companies to isolate interesting programming challenges in their past and publish evals on them (without revealing the actual codebase). In theory companies adopting these models should already be doing this to evaluate cost/benefit for each model, so it would be a matter of publishing them on a regular basis.

      1 reply →

  • IMO comparing different models is like comparing songs or paintings or modern art.

    There is no true objective measure, can you mathematically determine which song is the best for everyone for example? Or which painting different people feel is the nicest to look at or what emotion it gives them.

    Yea, you can do the fucking strawberry tests or carwash trick questions, but that doesn't really measure anything useful.

    You can also do benchmarks but how do you measure the output of those?

    The easiest way is just to use them all and get the feels of which of them works best for you. For me it's Claude first, pi.dev + gpt5.5 second. Plain Codex is a distant third and Gemini exists - it's pretty good at finessing web UIs as it does aria labels and usability better than other, but I wouldn't write backend code with it.

    • > IMO comparing different models is like comparing songs or paintings or modern art.

      I don't think this is that subjective or vague.

      There are a couple of crisp metrics that can be used to evaluate a model:

      - given a prompt, does it finish a task (times X tasks)

      - how much did it cost to finish the task

      - how long did it took?

      If all models are able to handle a class of tasks, they perform equally well.

      If a model costs much more to finish a task, it is worse than other models.

      If a model takes longer to finish a task, it is worse than other models.

      The ugly truth is that since the GPT4.1 days, new model releases have shown diminished returns. Context windows were increased, reasoning steps help improve the usefulness of a user's prompt,... That's it. Even those are UX improvements, instead of huge breakthroughs.

      2 replies →

  • The benchmarks are now the equivalents of SAT/ACT/other standardized exams for humans. They are directionally quite predictive, but with plenty of outcome variance on the margins

  • Yeah, if the jump is big, then we should be able to see the qualitative improvements, or see where Opus was tripped up in a task and Fable did succeed

  • It’s almost like they’re interchangeable. We need to start asking these models to solve extremely difficult, contrived DSA coding questions before deciding which ones we employ

  • I believe there is hard evidence that role-playing prompts are effective at leading it towards particular strategies and trains of thought. Not sure that SWE has been specifically studied, but proper science is very slow in the context of rapid change and broad context. It's good to stay grounded in the science that has been done, but we're going to have to do our best in uncharted territory for a while.

    "Don't make mistakes" does seem dumb. It's not guidance.

  • > These comparisons are all gut feelings.

    https://simonwillison.net/about/#disclosures

    "I have not accepted payments from LLM vendors, but I am frequently invited to preview new LLM products and features from organizations that include OpenAI, Anthropic, Gemini and Mistral, often under NDA or subject to an embargo. This often also includes free API credits and invitations to events."

    But I'm totally unbiased on my gut-feeling posts, trust me bro.

    -- AI influencers.

Yes, exactly this. If I didn't care about price at all, I'd exclusively use this model. It functions more like an actual engineer. I'm in the midst of a DB migration, and eg 5.5 continually suggests stuff like "use DB X instead of DB Y for task Z because its 30% faster" which is an impossibility of reality, given we are migrating DBs. Fable jumped in, reduced allocs by literally 46x, found multiple bugs 4.8 and 5.5 created (max file system usage, correctness issues, etc), and continually suggested awesome improvements unprompted. As in, it would finish a task and then suggest we tackle this other existing problem I didn't know about in a very specific manner... this is the first model that feels like its coming for my job.

  • I'm having the same experience. I'm in the process of implementing a new CRDT for realtime collaborative editing. There just aren't a lot of implementations of CRDTs kicking around online for opus or any of the other models to have good design instincts.

    Fable is doing - so far - a great job. I just had one big question around how part of it should work. I had a design sketch, but with some big unknowns. I asked fable to figure it out via reasoning and prototyping, and it did - it even, under its own initiative, wrote a fuzzer for its prototype which explored and verified that its reasoning was correct. It absolutely nailed it. And it found, and fixed, a couple bugs that I'd missed.

    I'm sure its weaknesses will become apparent in time. But, wow this thing is a beast. Its the first time I'm reading the work of an LLM without spotting obvious weaknesses in its reasoning and code. I'm really impressed.

    • I was about to ask where you work that you’re implementing new CRDTs and then I noticed your username! Thanks for all that you do!

      I work on the live collab at my company, and using AI while coding has into recently sort of “clicked” for me. We use an (I’m pretty sure) unheard of algorithm for collaborative editing, and I’ve had a long term goal of turning it into an implementation of EG Walker, but our document model is very complex and most out of the box CRDTs don’t quite fit. Maybe Fable will be what gets me over the hump.

      4 replies →

    • Hello joseph,

      I saw scanning the comments and saw you mentioned CRDT. Just wanted to mention that I implemented a CRDT-flavoured sync engine for the product I'm working on a while ago, I think it was with Opus 4.6 if I'm not mistaken (or earlier) so it's not something new to Fable 5, just fyi.

      2 replies →

    • > wrote a fuzzer for its prototype which explored and verified that its reasoning was correct. It absolutely nailed it.

      For such a data structure, "nailing it" means a formal proof of correctness. Fuzzing, as useful as it is, is merely throwing dirt at the wall and seeing if anything sticks.

      6 replies →

  • > this is the first model that feels like its coming for my job

    Damn you must be good, I've been feeling this for around 2 years now

    • It's been obvious for at least 2 years, anyone who doesn't see the writing on the wall simply hasn't learned how to use these well or has severe exponential blindness.

      "But it doesn't do well when writing my undertrained language" - yeah, fine. Yet. Reasonable code in that is probably one RAG + verification scaffold deployment around Mythos or maybe mythos+1. Just like it was for you learning it, because you knew how to _program_.

      4 replies →

  • Gosh, I must be doing something wrong. I spent 15 minutes (of which a lot was waiting while it was thinking about "backwards rationalising" it's decision and "gaslighting"[1]) arguing with it over why it keeps using `node -e "console.log(require('fs').readdirSync('…'))"` instead of `ls -l …`.

    Like it did everything:

    - this is not a Linux system (true, it was macOS) - it is not an available command - the binary is corrupted - node/js is more precise - V8 JavaScript is faster than bash (true technically??? But not in this context lol) - JavaScript is more versatile

    I forgot what else we went through but there were a few more things. I indulged it because it was incredulous and funny. The prompts from my side were all questions, never instructions. I assume an instruction would've helped here, but also I don't think Opus ever did this (but on the other hand Opus wrote python scripts to format/indent, instead of just running cargo fmt, so I guess potato potato)

Yeah same here, Fable on "high" is producing substantially better results than Open 4.8 on xhigh for me and my actual real-world evals today. It "feels" smarter and doesn't use nearly as many tokens running in circles. As a result I've been able to run two large refactors today without hitting the context limit danger zones - it's more expensive but also more efficient. It's been able to find some bugs that Opus missed. Pretty impressive stuff.

  • I keep getting this message:

    > Fable 5's safety measures flagged this message for cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Switched to Opus 4.8. Send feedback with /feedback or learn more

    I'm working on an internal tool that does new business prospecting data collection, scoring, etc. This is ridiculous.

    • I don't know if you are aware, but some people reported in Twitter that Fable 5 may flag the message regardless of content if it knows (from either pretraining knowledge or memories) that you work in either of those fields. I don't know if that's your case.

      https://x.com/i/status/2064449457869984035

    • I asked a question for my son about how mosquitos carry malaria and Fable was like “ok now hold it right there”

    • Obviously, soon, for anything valuable, you will have to buy from Anthropic "special license for biology/security/finance advises".

      Question is if there will be any competition in this area...

    • Interesting! I have not used Fable, but so far have not hit trouble. I'm a hobby biologist with a home mol bio lab. It wouldn't answer my questions about LNPs, but so far has been fine for my recombinant DNA workflows, lab techniques, environmental DNA protocols etc. I suspect this may become more difficult!

    • Same I am working on music firmware for existing device. I can't proceed as it keeps switching to Opus.

Still does not crack my hardest nuts. Gave it one of them and it blew through my entire allowance on thinking about one question, with no apparent answer in sight!

I see a lot of people saying they are happy with weaker models, but I am the opposite, I need more strength, more intelligence!

I am quite happy that opus 4.8 can do some medium intelligence problems. And maybe Fable 5 can do some more more of those! I have a lot of problems to solve!

  • I also see a lot of people saying they are happy with weaker models.

    At work I had to switch to using GPT 5.4 Mini and Qwen 3.6 27B.

    The results were near useless.

    The error rate is through the roof, it's constantly incorrect in its conclusions even when investigating very simple issues.

    Further the models are too unreliable to even move 20 line snippets around without inadvertently modifying them. Ask them to correct it and they still get it wrong.

    Maybe the larger Chinese models are better, but the Mini stuff is next to useless to me.

    • I have Qwen 3.6 27B and 35B running locally and and coming from Opus it feels like talking to an imposter. Someone who pretends to be competent, but really isn’t. Results are always disappointing. Sonnet is better, but I have given up on asking it. even for simple things I wait for my opus limits to reset.

      1 reply →

  • What kind of problems are you trying to have it solve ?

    • The medium ones are results where one needs to construct some object, which my intuition tells me should exist. The difficult ones are typically to show that certain objects can not be constructed.

      These are not Fields medal type problems, nor know difficult/open conjectures. Just small stuff I have collected in my todo list over the years.

      2 replies →

That is pretty wild, it took me a hell of a lot more coaxing and persevering to get to a similar point with eryx [0] (we spoke a bit about this before on Mastodon) using Opus, Fable seems to have a more optimistic 'sure, let's proceed as if this is possible' mindset based on your transcript. Looking forward to trying it out for some hairier problems.

[0]: https://github.com/eryx-org/eryx

One thing I can tell you is you are either favored by Anthropic, or your version of the CLI does not exhaust limits, or there's some major bug, as two people around me (myself included) claim it took half an hour to hit the ceiling. Which makes it practically unusable, where the same workflow a day ago produced a good 5-6 hours of workload with several agents.

  • Monetization is coming. They'll tell companies, AI is replacing your workers, so it is still worth to pay 100K/year for the license, as those AI are not going to jump to other job, get sick, be late, complain, require free coffee and so on.

    Soon the times of AI for $20/$200 a month will be long gone.

    • Get people hooked, tell them spending time coding is no longer needed, let their skills deteriorate, tell them they need cough up for a licence to do their job

      Forcing developers to pay for models that were build on code they scraped scott-free

      A tax to do their job that developers are jumping at the chance to pay

      Everybody's finally realising that node dependencies are a threat, but letting these AI companies gatekeep the industry is a bandwagon people are scrambling towards

      12 replies →

    • I've been saying this since the beginning, the rug pull is coming. If these models can eventually replace a human worker, there is no reason these companies won't charge (and get away with it) very close to a typical SWE salary.

      It would not surprise me one bit to see anywhere from $80k-$100k/seat pricing.

      1 reply →

    • As someone noted here recently - use the frontier models as much as u can, while you can.

    • AI for $20/month won't ever go away, but it won't be the absolute latest and greatest frontier model.

      Most of us don't need a model that can prove the Riemann hypothesis or Goldbach's conjecture in order to get work done.

  • It’s not meant for subscription users; the subscriptions are just the gateway drug to Enterprise pricing which Anthropic intends to use to juice their numbers before IPO.

  • Are you on the $100/month subscription?

    • I am, and I used up the entire 5 hour window in 8min using the highest thinking setting. It also ate up $15 of extra usage before I noticed.

      I’ve done the same thing with opus multiple times with no issue. According to ccusage I racked up just shy of $100 of tokens using Fable.

      It spun up subagents or workflows or whatever so obviously that contributed but “double opus” was not my experience. I’ve done the exact same prompt with opus on the highest setting and only once before (not even while using this prompt) hit my limits.

      My prompt? I’m not a prompt wizard or anything but it was literally:

      > Please review the uncommitted code in this repo for bugs/issues/code smells.

      I use variations on that all the time with opus and never had issues. I figured it was a good one to kick the tires with Fable. Little did I know it would mean no more Claude Code for the next 4.5hrs (unless I wanted to pay) after this being the first time I had used CC that day (yesterday).

      All in all, a pretty crappy first experience.

      6 replies →

    • simonw, if you are not bumping up against the same false-positive guardrail problems and budget consumption that everyone else is, then that is something worth digging into. I would normally say that's crazy but IPOs put weird pressure on companies.

      1 reply →

Just tried it. Fable is extremely strong. The fact that we can't point to any concrete architectural upgrade is worrying - that means "it just gets bigger" is kind of viable.

To be clear, the jump from Opus to Fable was like the jump from pre o3 -> o3 for me. Very sharp improvement, not incremental. But that could be explained by dummy long thinking times.

It one shot a task that Opus burned hundreds of dollars on to get nowhere. Very tricky semantic refactor, got it right. Granted, again, the semantics Opus and I fleshed out 3 months prior, but Opus couldn't execute on the vision. Fable could.

Then I discussed some philosophy and it was actually both pleasant (GPT constantly "corrected" you for the sake of correction without clarification, also still often just wrong; it's like it refused to think critically about philosphy) and accurate, and actually helped resolve some deep but subtle misconceptions I had around representationalism. When talking with GPT I felt like I was talking with someone who either was sycophantic or "anything that is not absolute truth is relativism" - Fable actually discussed.

Both is exciting and kind of makes me depressed. I can definitely see why people are getting hyped about AGI again. All the models were extremely strong technically but I felt like couldn't match the developer's tacit state - Fable definitely did, and that's a basic quailty to be considered "usefully intelligent" IMO, at least to me.

Shame that it's going away in 2 weeks and probably going to be nerfed if/when it's re-released.

  • Worrying? Depressing? Why are people who are clearly enthusiasts (since they are testing the capabilities on release) always using these words? Is this a genuine interest, something that is pleasurable, or a morbid curiosity to test the bleeding edge of Humanity’s Doom? Bizarre.

    • It would be amazing in a perfect and just world. This technology is revolutionary. I'm very interested in LLM's because I'm personally interested in how one thinks better and comes up with better ideas - I think LLM's might elucidate some structure on that.

      But technological serfdom is waiting just around the corner. Well, to be fair, I think that societal forces would've pushed us to it anyways, no AI needed, but AI is a visceral, immediate, fast-moving instantiation of it.

      1 reply →

Fable has been producing some really good work on my end as well. Definitely better than Opus 4.8. The only problems are the cost and constant cybersecurity refusals. A single session uses up 100% of my 5h window without finishing, and that's when it doesn't get derailed by nonsensical refusals.

It still does make errors, yes? Because it is not usable, if we need to verify everything. AI is only interesting if it can do things that humans can not do. If you can verify results because you can do it yourself, then why use AI? It will just bind highly skilled people to do verification work. Instead these people should do the actual work, results will come quicker.

So AI is only interesting to you / your org / humans if it can do things that you can not achieve. But if it still does errors, how could we ever know that super-invention by AI is not wrong?

If we can not rely on the correctness of the result, it is not usable at all. AI must create reliable and correct results always. That was a very fundamental requirement for computing. This problem has not been solved.

  • > AI is only interesting if it can do things that humans can not do.

    AI is interesting as long as it can save time and/or money in getting an acceptable result. Anything that runs on a computer and can do "things that humans can do" will automatically end up doing things that humans won't do, simply by virtue of the fact that it runs on a machine that doesn't require sleep, doesn't get bored or demotivated, etc.

    Verifying code (to a level where a responsible person is willing to take ownership for it) isn't trivial, sure; but writing the code by hand requires the same level of care, and the fact that the same person wrote it doesn't actually allow for shortcuts (if we're being properly responsible).

    • It doesn’t get bored or demotivated, but it also lacks interest and motivation generally so it comes with the same pitfalls of having nothing to lose and being utterly unaccountable, (e.g. destructive actions, lying, and being coercive or Machiavellian for no reason other than efficiency in achieving an arbitrary and artificial status of completion).

      2 replies →

  • Humans make mistakes too, does it mean humans are unusable? We accept as empirical fast that most production quality code has 2 - 10 bugs per 1k LoC. According to your premise, virtually all existing software is therefor unusable.

    What if an LLM overall starts to make less mistakes than a medium developer, costs less than its salary and is 100 x faster? For sure, the companies that will leverage these with just a few senior devs doing prompting, testing and requirements analysis, will outcompete other organizations.

    • Humans make mistake then to learn from it. A really good expert would never deliberately copy-paste an obscure solution from the internet, then to ask for forgiveness later.

      AI agents do that, perhaps not always, but still do. Now the question: would I trust AI without verifying its output?

      2 replies →

  • There is plenty of work that does not need to be perfectly verified, because the risk is controlled. Prototyping a javascript game for example. Or code that runs just on your local machine where good enough is good enough. I'm sure a lot of you do super important work that needs 100% quality code all the time, but... some of us don't.

  • > Because it is not usable, if we need to verify everything.

    Do you verify every line of code written by your fellow developers? I doubt it, which is strange because they make errors don't they?

    What matters is the error rate. Past some threshold and they're better than senior devs who you don't supervise closely.

  • AI is like a junior developer. You have to review her code carefully but she is most definitely useful.

    • Why is your AI a she? What's up with gendering LLMs. Reminds me of Richard Dawkins calling Claude "Claudia" and insisting it to be conscious.

      3 replies →

  • One does not need to be able to create it themselves to evaluate if the output is correct. Consider for example that you can easily determine if a meal tastes delicious without being an expert chef, or the fact that NP problems are very difficult to solve but make for easily verifiable solutions.

The difficult part here is supposed to be the actual compilation to create the .wasm file ? Or what am I missing here? The wheel is only a few hundred lines of code outside of the Python implementation, and it would seem that the MicroPython version of the project already demonstrates the necessary techniques for operating wasmtime.

  • Read the transcript if you want to see all of the details that make this hard: https://claude.ai/share/a73b8b8b-8ebc-4fef-9e5c-7438e5e7ae35

    • Thanks. I had a quick run-through and I'm not really that impressed, though I'll cede that I have an atypical perspective on these kinds of issues. HN comments don't seem like the right place for a detailed critique of Claude's work here, but I've added it to my blog roadmap.

      I will say that there are hardly any mis-steps in its chain of reasoning, but some odd approaches to problems and a fair bit of redundancy. Probably the most impressive part was spontaneously coming up with non-obvious issues to test, but this came with a fair handful of tests for obvious non-issues (like whether pip can extract a nested zip from a wheel without corrupting it).

I have to agree. I'm working on a complex technical proposal that's a bit too far outside my expertise (I tend to submit it to actual experts for a more thorough review). I've worked with Opus and Gemini to review it and work out all the problems and inconsistencies, and I thought it was in a pretty good state.

As an additional check, I just submitted it to Fable, and it eviscerated it. Tons of inconsistencies found, issues skimmed over or ignored, too optimistic assumptions, math that doesn't really add up if you look at it in context. And as far as I can tell, all of these issues are entirely valid. I now feel embarrassed I'd already sent it to a few people for review. This clearly needs more work.

> Clone simonw/micropython-wasm from GitHub and research how this could use a full Python as opposed to MicroPython

I might be missing something important but that doesn't seem to be an impressive task.

On a surface level it sounds like the taks requires gathering calls to MicroPython-specific libs, assess which ones are not compatible with Python, and proceed to determine how to replace the ones that are incompatible.

From that first iteration, the rest would boil down to troubleshooting the issues missed on the first shot.

I would be extremely surprised if the likes of GPT4.1 wasn't already capable of handling that task.

So, beyond Claude Fable finishing a task, what exactly is the differentiating factor?

What can it do that Opus couldn’t?

  • Always hard to say for sure because I'm not sitting around running the exact same situations through both models in parallel to compare them.

    It feels like you can give it a big chunky problem and leave it alone and it gets it done, with less questions and fewer design decisions that I wouldn't have made.

    In reviewing its code I'm finding less to complain about than Opus. But it's all vibes, if you want a more scientific comparison you'll have to look elsewhere.

  • I gave it a complete database migration of our app, opus failed hard each time... Untyped Json b for some rows, no proper normalisation, falling back asking me questions in between.

    Fable just did it, clean code, one timeout with a hanging bash script, fixed a couple very old very structural bugs in the codebase

if it’s of interest I’ve been working on https://github.com/HubSpot/boomslang

Which has a full build of python to WASM with a bunch of static libs built in already.

I will say I built this pre fable and actually the first build of the interpreter to WASM opus pretty much nailed, cpython has secondary support for WASM as a target since like 3.9 or something and it just pulled from that.

I’ve been meaning to write up a blog post about this sometime, building this has been pretty interesting, including using opus to run a full auto research like loop for days to hyper optimize it’s performance.

I’m hoping to use fable to power some even crazier WASM adventures tho.

I hate how the Instagram/TikTok/YouTube influencer cancer is getting into AI. With early access and all that.

It made sense for people doing proper and fair AI breakdowns waiting on an embargo, but now it's just slop I don't trust anymore.

How much does it cost? How much did those tasks you did cost?

But, but, how does the pelican look?!

  • See parallel thread: https://news.ycombinator.com/item?id=48464054

    • Given how bad some of the models do on somewhat similar problems, I'm sure pelican is included in training set now. Similar problems - given airplane outline and implementation constraints do painting scheme (constraints something like "it will be implemented using covering film, hence no gradients, no impossible cuts, not more than 2 colors on engine cowl, etc). Google Gemini is meh, but GPT models are just terrible, don't have Anthropic subscription at home, hence have not tested.

      1 reply →

> Here's the transcript

It's frustrating that superfluous tokens are burning up our quotas:

key insight, crucially this, real engineering deltas, net assessment, definitive picture, acid tests, real limits, sharp boundary, proper patch, real root cause, big progress, actually wrong, path finagling, the catch, root cause pinned, everything passes cleanly.

[flagged]

  • AI models decompose problems down into tiny pieces that exist in their training data, so in a sense, you're correct.

    Though that's also what makes humans so good at solving problems as well, it turns out.

    Also, slight tangent: but I do find the "clanker" insult kind of funny. I feel like it counter-intuitively makes the models sound cooler than they are, if anything. I love clankin' shit.

    • The amount of computations for a human to do the same tasks is thousands of orders of magnitudes less. And when a human learns these things they usually remember how to, and are able to extrapolate that knowledge into new and fresh problem spaces. That is how the first person to run CPython in WASM did that, and that is why the plagarism machine can now do the same (only a thousand times more lame and uninspiring).

      Next time you get a new and a fresh and an inspiring idea, and you spend hours solving a unique problem nobody has ever done before. You can take comfort in the fact that a few months later some lame and uninspiring developer can write the same problem in a prompt and get the plagiarism machine to steal your work, just in a more lame and uninspiring way.

      11 replies →

    • On one hand, "clanker" has good steampunk vibes.

      On the other hand: "Stop trying to make 'clanker' happen! It's not going to happen!"

      "AI slop" caught on but "clanker" did not.

      6 replies →

  • If you've got a real argument to make, by all means, make it. Your anger does not magically "make it so".

    • It's still a vote, and votes don't require reasons, and shouldn't be dismissed out of hand. There's a growing chorus of those who are fed up with rules for thee but not for me.

      1 reply →