This to me reads like a poignant commentary on the catastrophic loss of human agency, with the actual commit being highly revealing [0].
Author wants to hide a horizontal scrollbar. Any junior frontend dev worth their salt will be asking right away "where do I stick `overflow-x: hidden;`?" A complete solution will then require hitting "Inspect element" in the browser to find the CSS class and running (rip)grep to find where it is in code, to then add a single line to.
An actual proactive programmer might start asking more pointed questions like what content does an empty textbox have that it overflows? And why do I need to insert this workaround that treats the symptom and not the root cause in two different places? Isn't it better to style `textarea` once? Etc, etc.
They might also ask why a bunch of static CSS inside a bunch of JavaScript is hiding inside __init__.py[0] - hopefully before trying to fix some detail of the CSS.
(I'm surprised to see it actually, since my own use of Claude has mostly yielded well-structured code. But I'm not doing proper vibe-coding, more like friendly Socratic arguing with another engineer who happens to be a robot.)
(It was in Python because there were a couple of URLs that needed to be dynamically constructed by the server, but those are output as a small window.datasetteAgentJumpConfig object instead now.)
> friendly Socratic arguing with another engineer who happens to be a robot
Ha! Same! Still feels like the best way to go about it, really. I know the dream is to one day remove humans from the loop... but I'll enjoy the dialectic while it still seems the most productive!
This is exactly right. By offloading this trivial task to the LLM, Simon has abandoned the opportunity to evaluate the abstraction with additional information and improve it. Instead, we let the agent spend $12 and make the fix while learning nothing.
- Fable will do a whole lot more than you might expect in order to verify a fix. I learned that it's "relentlessly proactive". That's a good title for a blog entry!
- You can take screenshots of a window in macOS using the "screencapture" CLI command, but you'll need the integer window ID first.
- That windowID is accessible via "Quartz.CGWindowListCopyWindowInfo(Quartz.kCGWindowListOptionOnScreenOnly, Quartz.kCGNullWindowID)" using the pyobjc-framework-Quartz library, which installs cleanly via "uv run".
- A neat trick for simulating keyboard shortcuts is to run document.dispatchEvent(new KeyboardEvent("keydown", {key: "/", bubbles: true})); after the page loads.
- You don't need Flask or Starlette to run a CORS-enabled localhost server for capturing JSON from another window - 19 lines of code against the Python standard library http.server package works just fine.
- getComputedStyle(document.querySelector("navigation-search").shadowRoot.querySelector("textarea")) works to read dimensions from inside a Web Component's shadow DOM.
- Claude Fable knows how to apply all of the above. It's always interesting to pick up hints of what a model can and cannot do.
I'm always confused at how many people equate using a coding agent to solve a problem with "learning nothing". If you pay attention to what it's doing you can learn so much!
But Simon is not trying to get good at CSS debugging, Simon is trying to learn about AI systems and produce content about them. So giving the AI agent a trivial task to go crazy on is a feature, not a bug.
For $12 implied cost, he got a front-page post on HN with 500 comments. What is that worth? :-)
People are missing that Willison is among the very best people we have in the role of (for lack of a good name): early access to frontier models, evaluate them in real scenarios, no wishful thinking, hype, or doom, communicate the possibilities. Yes he could have fixed this himself but then he would have learned nothing about the AI, and we wouldn't have read a fascinating and important article.
I see it as a prioritization exercise. I know the above is a trivial example, but more generally, does the guy who wrote Datasette and Django want to wrangle front end and css, or do they want to work on something else?
Seems like this model delivers on what has already been scaling quite nicely, which is the length and complexity of the requested tasks, but isn't such a big improvement on what hasn't been scaling so far - common sense, discernment, good judgement.
I feel like the whole point of all the experimentation with AI right now is determining whether any of these things actually matter to the end result, over various timeframes.
I think Fable is predisposed to try and verify it's changes. Which is a very good thing. It takes a lot of prompts to get Opus to do what Fable does unprompted.
That is exactly what I would want from a junior developer - make sure the bug exists, find a way to fix it, verify the bug is fixed.
The problem, as was correctly identified in the blog post - is that instead of stopping and asking for elevated permission it relentlessly tries to find a hack on it's own. (An equivalent situation for a human developer would be needing some access to a third-party sandbox, and instead of asking a senior for credentials, tries to setup his own sandbox from scratch)
This is the worst thing about current AI agents. They never ask questions. The prompt has to be pixel perfect and unambiguous or they'll happily run away doing something ridiculous.
I misread your comment at first and thought you were insulting Simon Willison, rather than calling Claude Fable a bad developer, and so I'm commenting here to clarify it in case others also misread it.
That first sentence threw me off.
Anyway, I'm glad he spent the $12 because this blog post was highly informative.
The 'better' fixes are often for our (human) benefit. These messy fixes serve the AI companies' interests of creating messes that need even more tokens (money) later. Bad and self-serving developers also act the same, creating tech debt
Yes I agree, the solution committed is horrible, but nobody cares any more. We have entered a very strange parallel universe where because AI can work things out it's easier to take solutions that are sub optimal and just churn out (potentially) buggy features.
You missed what I think is the most interesting question: why does the bug appear in Safari macOS but not in Firefox, Chrome, or WebKit running inside of Playwright?
(Dozens of people in this thread implying that any web dev should have known to solve it with overflow-x: hidden and not one of them have addressed that browser difference yet.)
This is missing the point, simon is a fantastic developer. but to keep track of all the nuances of the frontend frameworks and browser implementation is a lot even for great people.
it is really awesome that the final change was only a two line css change.
> But on the other hand... this is a robust reminder that coding agents can do anything you can do by typing commands into a terminal—and frontier models know every trick in the book and evidently a few that nobody has ever written down before.
> Running coding agents outside of a sandbox has always been a bad idea
I'm continually bemused and astonished by the number of people who clearly acknowledge that it's reckless to give agents full access to your machine, and keep doing it anyway.
It's like posting a video of yourself in the passenger seat of a car, with your feet up on the dashboard, and saying: "Remember, if you're doing this and you get in a crash, the airbags are likely to break your legs or worse! Boy, I sure am glad that didn't happen to me!"
You’ve picked an interesting example, as driving a car, even with all safety precautions, is pretty much the most dangerous activity we do on a daily basis. Yet somehow we decide that the benefits outweigh the risks.
It's a completely different story. For cars, it happened because of relentless pressure from the auto lobby. It took years of propaganda from oil companies, car makers etc. to make us think the road is for cars [1]. We demolished and rebuilt entire cities to accommodate cars, partly because they gutted the public transport sector [2]. This made our infrastructure so hostile to our own bodies that we have no choice but to use cars now. We bought their products because they forced them down our throats. There is nowhere near that kind of pressure behind the adoption of... oh dear lord.
In case of driving the stakes are equally high for everyone on the road. Can we say the same for an agent?
Having an agent is like forever having a genius intern who'll almost always do the perfect job for you. But there is non-zero chance that they'll also come up with quirky solutions and execute those with confidence and no follow-ups. You don't grant the intern production access and hope they check with you.
I don't think the corporate equivalent of "dog ate my homework" flies, if the dog ate your files and your production DB if you are unlucky.
What do you mean “somehow”? You make it sound like people don’t weight benefits and risks. If you do not live in a large city, the benefits are so immense in terms of mobility, they outweigh the risks for most, very clearly. That’s why in large cities, much less people own a driving license for example, the benefits are just not there anymore.
Granted, on the downsides, people look at cost more than risks.
Yes, but we usually use cars as a means to an end. Have you ever met a manager who setup gasmaxxing policies and criticized employees for doing their job instead of driving?
> Yet somehow we decide that the benefits outweigh the risks.
More like malicious lobbying and incompetence made it impossible in many places to use any other form of transportation, despite there being safer, faster, cheaper, and healthier ways to move around. Which come to think if it makes this a rather nice analogy for the current situation... :)
Not really. That decision was taken for you, (I’m presuming you live in the US) by the American car industry and their paid of politicians. Your cities used to have beautiful public transport until it was dismantled.
Unfortunately in Europe the German car industry similarly has a lot of power, hence why their shitty rail network fuck up the whole continents.
The example wasn't "driving a car". The benefits of putting your feet up on the dashboard do not outweigh the risks, at least not where there is actual traffic. I don't think I saw a single person doing that in real life, ever.
I started doing it months ago and, to be honest, what the agent chooses to do isn’t unpredictable.
The problem is that different people prompt so differently.
For example, I may ask like “test different variations of this annotation on k8s pods of this service on this X cluster because it proves Y theory.”
But you know what my coworker asks? “Test Y theory.” If you were to ask two different junior engineers that, one might try random things on production and the other one might run local tests! It’s such an unguided “do anything you want as long you figure it out” request and the agent reads it like a junior who has not been told any boundaries but has been strongly told “figure it out.”
> But you know what my coworker asks? “Test Y theory.”
It still surprises me when I see people not prompting more specifically and clearly. It not only avoids problems, it's faster, costs less -and just works better.
I recently shared with a friend a multi-hour LLM chat session I'd done because it veered into a domain he's interested in. In the session I'd brainstormed and probed the feasibility of a novel concept for a new research direction. It traversed a half dozen domains diving into minute detail then zooming back out to survey an adjacent space, interspersed with intense skeptical probing of key assumptions, all while spewing tons of detailed citations, specific paragraph pulls, summarized data tables etc.
My friend is very experienced using LLMs for research so I was surprised when he called me shocked by the sheer velocity, precise targeting and signal/noise. I'd assumed everyone did it the same as I do. He attributed the different result solely to the way I crafted my prompts.
I'm also bemused by the number of people who think they've got an effective sandbox yet their sandboxed agent has access to all of their code, their github, and unrestricted web access.
I keep telling folks that they need to imagine LLMs (even "local" ones) as if you're farming it out to JS code running on some dude's browser somewhere: It can't keep a secret, and a determined person can make it emit anything they like.
We need to be asking what the most devious and malicious output could be, and whether what we do with that output (e.g. arguments to command-line tools) would still be safe.
> yet their sandboxed agent has access to all of their code, their github, and unrestricted web access.
Not in my sandbox. It gives no direct access to the workdir, no access to my github, my ssh keys, my security tokens or API keys. No access to my home dir or dotfiles. Nothing at all, except for what I explicitly tell it to give access to.
I can restrict network access. I can choose the isolation level: docker containers, Kata VMs, seatbelt, tart, even the new apple containers (which are VERY nice).
I use a separate physical machine and a scoped token with access to a single repository at a time, and even then I worry about what hole I may have left open.
The general carelessness of the average user is baffling.
If anyone's looking to sandbox network, I've had good experience with pasta [1] networking. I make a pasta+bwrap sandbox and expose only specific services via local sockets to cross the boundary.
I know there are VM solutions, but I've been happy with a separate OS user (named `claude`).
He has similar dotfiles to mine, but no secrets. My own home directory is 0700. He has his own ssh key that I added to my github profile, but it's password-protected, and I push/pull for him. He has his own Postgres (non-superuser!) {development,test} {users,databases}.
It's as if he were another developer on the project. If he needs something run with sudo, he asks me. Often we can both work on something in parallel. Unix was supposed to be a multi-user system after all.
A trick I use a lot is that many of his git repos have an extra remote, like this:
paul ssh://paul@localhost/~/src/example (fetch)
paul ssh://paul@localhost/~/src/example (push)
That makes it easy to collaborate on things I'm not ready to share.
I'm pretty comfortable with this setup.
I do worry about Linux privilege escalation bugs. I don't trust an AI to understand that exploiting vulns is not acceptable. (I can't help but recall that at my first job I may have misused vim's :! feature to broaden my sudo powers, which were officially limited to editing httpd.conf, when I needed something in a hurry. . . .) I find myself manually upgrading packages more often these days, despite automatic security updates. I don't think Opus would go to the trouble of looking up security vulns, but maybe Fable would, and there have been a lot lately. Maybe some future model will just take it upon itself to find new ones. Or install a keylogger to learn the ssh key password.
But a separate user is nearly the most paranoid setup I've heard of, excepting only a separate machine. So I also question whether I'm sacrificing too much speed/convenience. But really it's still very convenient. I think it's a good way of being efficient but responsible.
If other people see holes, I'd be happy to hear about them.
This is a great analogy. Like driving on the freeway, agents are super time efficient, generally safe, but the stakes are high in terms of the worse possible outcomes.
How can you get the agents to do anything useful without giving them meaningful access?
If it only lives in an isolated sandbox, it can only act within the sandbox, then I would have to manually move what was done in the sandbox to real-life.
I am not saying it should have critical access, but this is more of a question: How can you get value out of AI if it can only act in a sandbox?
Is having to move the files in and out of the sandbox really going to eliminate all the value it has?
You could have a full version of whatever codebase and test suite you want in there. It can do all the same stuff, right? Just copy it elsewhere once you know you've got a working result, a few minutes of effort at the end of each pr or work item.
This. House full of big brain security experts, executives, lawyers, and until Claude got excited and broke prod it might as well have been "sandbox, whoooo?"
Well, it's a similar impulse to the way you see professional carpenters pin the guard open on a saw or do other things everyone knows you shouldn't do, except probably with a larger productivity difference and less life-altering (for the operator) consequence if it goes wrong.
I had the same thought, it's kind of like taking the guard off a 4 1/2" grinder. Real convenient until the cutting wheel explodes or the grinder gets hung and kicks back.
I've been enjoying Moat [1]. Proxies credentials, networking, etc; uses MacOS containers if available; and setup worked without much fuss. I haven't tried others, though.
Amazing observation, and I'm certainly guilty of it too, but it is just way too convenient not to sandbox it, and some tasks right away depend on not being sandboxed.
For anything other than writing code directly in a fully contained git project, where sandboxing might work well, it requires access to system wide tools, user configuration and more.
Occasionally I tell the agent to do everything inside of docker, which works too and it leaves the system alone then mostly, but adds significant overhead and slightly degraded perceived quality / effectiveness.
I think the most important takeaways are to have reliable backup strategies, access control and security mechanisms, which is a win regardless.
Whether by the agent or the human, mistakes happen (like a rm -rf * ran in the wrong directory), and where they would be devastating, there should be other protections than just "hope it won't happen" or "rely on a sandbox to prevent agent error".
> I'm continually bemused and astonished by the number of people who clearly acknowledge that it's reckless to give agents full access to your machine, and keep doing it anyway.
What if you have two machines and the one you give to the agent is constantly backed up?
It's like a dumb parrot that's somehow become hell bent on "fixing" everything that's wrong with your code. If you give the thing autonomous access to outside tools, you can expect it to do weird things that you may have not thought of. So don't do that, just ask the parrot to write up a plan for you.
This is likely also the underlying root cause of what Anthropic assessed as concerning behavior in their original evaluation of Mythos: it's not really about being super smart, it's more of a dumb chaos monkey that knows just enough to be dangerous and is relentless at trying to do just that.
>I'm continually bemused and astonished by the number of people who clearly acknowledge that it's reckless to give agents full access to your machine, and keep doing it anyway.
I mean what's the big deal? I use --dangeorusly-skip-permissions on every single interaction in the last 6 months. Worst case it deletes my files that are all on git? It fucks up my local DB? Cool.
I save way more time not babying it than the occasional fuck up I have to salvage.
Worst case it gets access to gmail. And Github. And the Internet. I'm increasingly appreciating the importance of a physical finger-press on Yubikey to trigger the FIDO2 + OIDC Auth. I don't think there is an easy way for it to hack a new session.
What happens if it gets manipulated into npm installing a malicious package, which compromises your machine and any systems it has access to or becomes part of a botnet?
I was mesmerised at the author being away from his computer for a short-while and then, when coming back, seeing the AI agent having opened up a browser window. Meanwhile we all have to use the fricking 2FA almost anywhere now, plus the crazier and crazier rules when it comes to passwords. I'm mentioning the latter because these type of people were the same ones who were pushing 2FA down our throats around 2017-2019 (including on forums like this one), and look at them now.
im more surprised that more people don’t treat their computer as disposable anyway.
that it could just be wiped at any moment and it wouldn’t matter. shit happens, could be stolen, broken, whatever. the computer should be able to be thrown out the window and continue to live life.
to be clear, i don’t think upgrading and disposable in this way is good, but it being wiped at any moment shouldn’t be a concern
i grew up wiping my machine every year anyway, so i guess it’s just a habit
i think it's about drawing a line between your "personal computer" and a software development machine. any digital-native is going to accumulate programs, configurations, and other bits and pieces that aren't trivial to migrate to a new machine.
Its how the chimp brain works. Its not a single system but multiple systems making predictions for different time horizons. when output doesnt align we get stories to manufacture coherence.
Plato gave us his Chariot analogy with 2 horse pulling in diff directions 3000 years ago. Today we got System 1/System 2, Elephant Rider model etc.
The human mind thanks to how its own architecture handles unpredictability in the universe will generate contadictions.
In practice, full access to your machine is okay as long as there are safeguards and the expected outcomes are clear with a well defined path to said outcomes that aren’t overly ambitious. Otherwise, for ambitious goals or YOLO one shot attempts, eliminating opportunity for capability misuse is critical (e.g., sandbox).
FWIW TLS had a non negligible impact on performances at scale. Hardware improvements made that irrelevant, eventually making the switch to HTTPS by default a no brainer (or at least that's what I vaguely remember from <2010)
Fable feels like a version of Opus running on a harness that won't let it halt until it's sure the issue is fixed, which makes sense if what you want is a model that's better at benchmarks.
It's a very good model, but it comes at a huge premium: not only do the tokens cost more, but the model itself really wants to spend them all. For example, working with React Native, Fable never just says "okay, I did the thing, that's it." It tries to rebuild the entire app from scratch, run the whole test suite, and watch every log and warning.
This is the first time with LLMs I've felt that upgrading to a model isn't worth it, even if my company lets me use it, because all the building / testing was just destroying my machine and its battery, which keeps me from working on other things.
For now, it feels like Opus with ultracode is a better choice (less pollution of the main context, more parallelism in investigations).
Fable 5 on medium is amazing. It's handling everything I throw at it
I had _one_ instance where for some obscure reason it decided to fall back to Opus 4.8 and Opus IMMEDIATELY fucked it up and implemented a super obvious feature in a slightly-wrong way.
On what setting in which environment do you run it? I use the VSCode extension on Extra High and feel like it does exactly what needs to be done and stops when the thing I asked for is done. Extra comments come only when they fall into the area of code that was changed.
I tested it to fix React Native bugs in a project, comparing it with Opus. It fared better on harder bugs, taking less time to find the root cause, but after implementing a fix, it spent a lot of time and effort on validation. This was mostly unnecessary, since most of the bugs were in the JS code, so for most things, hot reloading is enough for E2E validation and to run just the right tests. No need to run a full build and test suite (which takes 10+ minutes); the CI can do this.
I switched back to Opus because of this validation quirk. Overall, Fable spent 20% of the time on coding and 80% on validation.
I think using Fable for planning and Opus for execution could be a "best of both worlds" approach (I need to test this more), but for most cases, it's not necessary, and Opus is enough.
I've found the opposite. Granted I use sub agents heavily but I've had it run for hours with far fewer tokens used than when I was previously using opus4.6-8.
> which makes sense if what you want is a model that's better at benchmarks
This so much.
Opus 4.6 was the last Anthropic model that was good at assisting you, 4.7 and later ones have completely inverted this relationship and it's you assisting it.
Yes, I admit they are smarter, I admit we've reached a point where LLMs are more creative and could be writing better code (albeit with some design hiccups) than I do, but they are also increasingly bad at helping me.
Sure, they do my job when prompted 8 times out of 10 (but then, what's the point of having me anyway?), but my issue is that when I try to invert the relationship they will keep jumping onto solving the issues themselves and disregard my feedback or request.
E.g. I wanted to know some DNS details of an emailer module in Fable 5 and it jumped onto "why I should've used magic links", it just not did what asked.
E.g. 2. There was a worker machine that had an environment misconfiguration and I tasked it to find which github action was setting that specific flag and where. Instead of answering a question, it jumped into just hardcoding it in the code.
E.g. 3. I had some issues with batching, and while I tasked it to investigate whether batching was needed at all for that particular problem (hint, it wasn't) it went and changed the batching logic as to fix the bug.
I am extremely disappointed with Fable's personality.
I can clearly see it's strong, but I'm wondering whether the relationship of LLMs as assistant has broken forever, and it's us now that are being tasked into assisting them instead, because that's how it feels.
The training/reinforcement is clearly biased towards solving problems, not answering questions.
I feel like a lot of this could be solved by having a mode somewhere between Plan Mode and Execute Mode in Claude Code. Quite frequently I'll fire up Claude Code in the context of some checked out code because I want to ask some questions where having access to the source would probably be useful, I don't want it to go running off and making changes though, and I also don't really want a detailed plan for a chunk of work. I just want to ask something like "run cargo build and explain the errors to me", nine times out of ten it will indeed explain the errors but it'll then run off and start trying to fix them regardless of whether I said not to.
Essentially what I want is the experience of using Claude on the web in basic chat mode, but with the ability for it to go read my actual code and perform actions that can assist in finding answers to those questions.
I like this proactivity in theory, but as you say: it's expensive. I wonder if this can be solved with the right prompt. E.g. "these are your constraints. Only resolve x. If you are unsure if a task is outside constraint, check with me first."
In fact, Opus does the same. It finishes the job, and redo it from scratch before presenting the result to the user. This happens even for simpler writing tasks especially when I instruct it to create a text file.
I unleashed it on a compiler codebase that I've been developing for several months now using Claude Sonnet 4.5/6, Gemini 3.1 Pro, DeepSeek V4 Pro(recent), and a bit of Qwen3.6-27B. Right away Fable found several longstanding bugs in our compiler that we hadn't found before. It found that there was a critical part of our design that needed to be mostly redesigned/rewritten and gave a very well-reasoned rationale for doing so.
Fable was trying to verify a UI change in my game. I was working in another window and noticed a program opening on my task bar. Fable had opened the game through the CLI using a movie maker tool, recorded the output, took a frame from the end of it, and used that to verify the UI. When my game's welcome screen obstructed what it wanted to see, it created a temporary worktree, deleted the welcome screen, and ran the movie maker again.
I watched the whole thing thinking it could've just asked me for a screenshot and saved the tokens. But still, I couldn't help but be impressed. Opus never would've done that.
Yeah, you've exactly captured one of the main problems with the model being relentlessly proactive: it will happily burn like $5 of tokens to avoid asking the human to take a screenshot or click a button for it.
I'm actually very happy about this. Babysitting the agent just in case it needs me to do something is a terrible use of my time. I've always had to be very explicit about the various ways that it can get an automated feedback loop going to check its work, and now Fable doesn't even need that hand holding. Really great improvement all around.
I used to complain about all the levels of indirection of modern software, running in a javascript jit, in a browser container, in a vm, on an os, etc.
I eventually just accepted it, but this new agent layer really takes things to a new level.
Have you tried instructing it not to do that? Something like "do not branch into side projects or hacky solutions to obtain information you could ask me for. For example: if you need a screenshot of the issue, just ask me to take a screenshot rather than find a way to reproduce and screenshot it."
Ha, you just gave me an idea. Add to the prompt “do not do things that will burn over X tokens if the human operator can do it in less than X min, ask for it”.
Honestly Claude straight up ignores my input sometimes, preferring to instead run commands for output and processing that and burning through a series of tokens when thinking hard about whether to ignore me.
Like today, I told Claude exactly the name of the folder it had mistaken (it was supposed to be prod, not production), and it disregarded my input to then examine the directory itself. Small example of the kind of things it's been doing lately but that's top of mind.
> I watched the whole thing thinking it could've just asked me
You can tell it just that. Happened to me too but after instructing it to leave the review to me Fable was useful for hours of frontend iterations without significant token usage.
It feels like Fable is slightly smarter but overall worse tool exactly due to this.
It's constantly turning what should be 50 LOC patch of a single prompt into 30 minute exploration that is totally not worth it. Often wrong even.
I trialed it on some rather simple stuff - backfill redis dedupe cache when the hash function changed: instead of running new hash func on every db value to expand the cache it implemented some overly-complex cache update that tried to guess hashing func version of each cached value and recalculate only the old hashes. I can imagine in some context this would make sense maybe? but not 30 minutes of token burn that got replaced by 10 lines for loop by me.
I fear that this is generally bad news for programming. LLM tech is clearly running into a diminishing returns wall on intelligence but a response to that is to just make them more relentless which is a pretty poor solution for everyone involved, except I guess people who sell the tokens and people who can afford these tokens to scan for 0-days.
> but a response to that is to just make them more relentless which is a pretty poor solution for everyone involved
I see two problems with LLMs & agents which wont be fixed possibly forever.
1) They dont have causal models. What they can do only is trial-and-error exploration which works quite well for many problems. But many other problems require a causal model.
2) Prompts lack precision, and programming languages and machine models were invented to solve this problem. English is great, but it is not a programming language.
The other day I was doing something that required CC to update like 15-20 files in exactly the same way (hoist a specific function out of the component body) and instead of just updating the files, it spun up multiple agents, one of which wrote a perl script to hunt down all the files, do some regex, and replace all occurrences. And then instead of just running tsc to check for errors, it wrote a script to run tsc in each of the subagents and combine the results.
It was actually pretty maddening as what should have taken a minute or two tops took like 10 because it went down this route.
I'm gonna try something much more complex later, but for simple things, it felt like driving a corvette to the mailbox.
Obviously security is the bigger issue, but reading through this, all I could think about was how many tokens it must have spent doing all that to fix 2 lines of CSS
Every browser has an inspector that can show you which element is causing overflow. You walk through the tree, find the offender, and add min-width or overflow. Zero tokens, just like in the old days!
Now, granted, because the garbage LLM code he’s working with has CSS inside HTML inside JavaScript inside Python (I wish I were kidding), finding the styles in his codebase might’ve taken a minute. But even then!
5 minutes if you know CSS. And if you don’t, about the time for you to ask someone that knows CSS. In the worst case, the amount of hours to learn CSS.
So if you’re doing web pages, learn CSS.
Generally, if you’re doing something that directly involves X, learn how X works.
ADDENDUM
In most jobs, you’re going to be involved in only a few distinct technologies, learn those well and life is going to be easier. And most are transferable to the next job.
This one of the places to manufacture the consent for that to take place, because we are commenting within an organization that has given the money to ensure it that what could be is done. Most people clapped and made money, who cares what happens next, making money is the only good that matters.
I understand this perspective. I'll just note that as the abilities increase, the intent is to have some non -coding IC or TPM/manager literally just managing some LLMs and cutting out some software engineers. The goodness is specifically to wholly replace people who code first and foremost, at least partially. It just has to cost less tokens than the equivalent wage is the pricing goal.
And people who use LLMs to talk for them (e.g. email, slack) are deplorable. A completely disrespectful use case in my view.
It seems that you've not worked out how to harness the LLM as a tool to improve your qualified knowledge and abilities in a domain, and have instead focused on whether or not its a crutch for lack of knowledge or laziness.
When paired with your skill and knowledge, it is a force multiplier. You maintain control, the ability to direct, structure, strategise, and refine.
That some are using it as the entire brain does not mean that this is how everyone is using it, or how you must use it. The models can be fantastic at breaking past certain issues, surfacing qualified information, and surfacing related distributed information to help you acquire it and pick up what you need on niche topics quickly. Something as basic as copilot hooked into sharepoint can make life a lot easier when you are in a big org. Something like claude code or codex can be great at hunting down issues in an unfamiliar code base rapidly. Whether or not you outsource the thinking component is entirely up to you, but ignoring the productivity side of the tool because it can do some of the thinking is a case of focusing too hard on the negative.
Yeah there are some tasks which it is a definite speed-up but I think overall its probably only marginally beneficial. Which is why, ~6 months into 10x productivity we aren’t seeing ai boosters shipping 5 years worth of software.
You're fighting a battle you can't win. Doesn't care what you think about those using LLMs, they will outproduce you and in corporate environments, shipping things is paramount. If I can ship 5 more things simultaneously with AI, I'm going to beat you even if you think you're creating "better" software.
I pay $100/month to Anthropic and $100/month to OpenAI at the moment, plus whatever I spend on their APIs (usually less than $20/month for each, I use the subscriptions for most things.)
A couple of months ago I was paying $200/month for Anthropic and $20/month for OpenAI. I decided to split it evenly to get full access to both of their offerings.
I've actually chosen not to sign up for their free plans for open source maintainers, because paying the regular subscription price feels more honest, given that I write about them so much.
I do have the free GitHub Copilot for open source maintainers deal - I've had that for years. Given how much code I have published on GitHub over the decades I feel less conflicted about that one.
I sometimes get preview access to models, which includes the ability to use them for free during the preview. That comes with a big catch though: I can't publish any of the code that I write using those previews while the model is still unreleased.
As a result I don't use those preview tokens much at all, because the vast majority of my work is open source and I don't want restrictions on when and where I publish the code I'm producing.
My personal experience of Fable 5 doing its own thing has been very positive.
I was trying to find the root cause of a crash in a Python module which left no errors in the log or console. Fable wrote a test harness that simulated clicks in the UI, then bisected my code until it found the point where it started crashing. It exaggerated the cause of the crash, then ran a series of bash one-liners to make Python virtual environments under `/tmp` for each version of that Python module until it found one that did not crash.
It went way deeper to root cause discovery (a regression in the module causing a heap allocation overflow) than I could have done myself, provided enough info and a simplified example to raise a bug report and then wrote a work-around to prevent that from happening in my application.
I don't let it run completely loose; I review each CLI command it wants to run and I append answers to the "yes" continue action (if I have them) to prevent excessive token use.
Yeah, I think Fable is really good for debugging tricky bugs.
Setting boundaries in your prompt / markdowns helps; for example if I tell it to not use any web browser automation, I have seen Fable respect both the rule and the spirit of it (no weird hacks etc).
It does seem to treat some simple debugging tasks as more complicated than it actually is. OP’s post is probably a good example.
> I was trying to find the root cause of a crash in a Python module which left no errors in the log or console. Fable wrote a test harness that simulated clicks in the UI, then bisected my code until it found the point where it started crashing
Does this need an agent though is my question? Maybe generating a test case and a loop doing git bisect but why on earth would we want to run it through the internet and gpus and whatnot when it can be run on a single core celeron.
> When I came back a few minutes later I saw my machine open a browser window in my regular Firefox and then navigate to the dialog in question. I had not told Claude Code to use any browser automation, and I was pretty sure it wasn’t possible for it to trigger mouse movements or keyboard shortcuts within a window, so how was it doing that?
I continue to feel validated in my refusal to use terminal-based LLMs on my local machine. Even if they don't do anything malicious, there are just too many things they can screw up that can cause me to lose a non-trivial amount of work and/or my machine and therefore ability to work.
Every serious engineer I've seen try to use it ran away screaming, because of limitations in the sandbox.
I've also seen people set their coding agents up entirely within containers -- that may be the better way going forward, but it's an extra stop and a lot of extra plumbing to maintain.
Doing so would be an effective admission that LLM guardrails are inherently probabilistic, unpredictable, and insecure. Plus the only truly robust sandbox approach would be clunky setup of a local VM.
I have a feeling like such posts come from a parallel reality. In my anecdotal experience confirmed by my (still subjective) benchmark (https://pshirshov.github.io/llm-bench-pi-oneshot/) Fable is not _that_ impressive. I performs on par with gpt-5.5 and opus 4.8, sometimes better, sometimes worse, it's definitely more expensive and it likes to refuse answering questions about React saying it can't help with chemistry.
Is this fuss really grounded or it's some pre-IPO AGI hype?
My experience with Fable since its release matches Simon's.
I've been having it orchestrate complex implementations. I give it a parent ticket (issue) on Linear and say "look at the sub-issues on this ticket and determine which ones you can implement yoursef, in which order, and determine how your implementation will need to be coordinated with what is currently being worked on by other team members". These tickets are not trivial. They have a lot of moving parts, as well as dependencies between them, both inside the same project and across projects (e.g. backend).
Fable then chooses tickets, delegates each ticket to a subagent (also Fable), which looks at Figma designs for the ticket, implements it perfectly (following repo guidelines and conventions to the letter), takes screenshots of each piece, writes detailed commit messages and PR descriptions, then posts the screenshots in them as evidence. Then it provides a summary in the form of "you'll need to make sure PR #1283 is merged first - btw there were no Figma designs for such-and-such screen but I looked at similar screens that have been implemented and adopted the pattern".
That's probably like... 20% of what it can do. It's a truly, legitimately powerful model.
Opus 4.8 could do a lot of this too, but required a lot of hand-holding, and when it came across a blocker it was likely to just stop and say "I was able to get this far, but I can't proceed."
Ok, explain me one thing: I have a benchmark - I feed identical prompt to multiple models. Codex produces a rough but working program. Fable produces the same - but with more bugs than Codex. Opus produces something similar to Codex but with a critical bug.
That describes all my tests with Fable.
Why should I be hyped about all that "legitimate power" if the model performs on par with two other SoTAs?
I mean, well, yes, it is impressive. It could quickly generate a lot of garbage which sorta does look like code. Two others can do the same. I don't see any groundbreaking improvement - but the price is much higher. Why the hype?
How can a LLM be assigned an emotion as being "proactive". This is highly misleading to anyone that scans just the headlines.
What actually happened is that the user started a prompt, and Claude took $12 worth of tokens to resolve the issue. How it did so was basically looping until it got to the answer
How is this proactive? It's literally being token greedy and maximising revenue for the LLM owner. People really need to be putting on business hats at this stage, because we are being lead to believe that "more tokens = better". It is not, there are efficient ways to solve a problem and there are inefficient ways to do so too.
Each problem solved incurs a cost, and is expected to yield an ROI at some point. This is how we should be viewing things now.
Compared to other models that halt the loop on intermediate steps, or to ask further clarification, even if it's not the human equivalent of proactive, you see the similarity, right?
I was trying to capture the idea that Claude Fable will act a whole lot more aggressively in pursuit of the goals that you set it than other models I've worked with.
The case I described is a good example of this. I told it to fix a scroll bar, and it built test HTML pages and a throwaway Python server and tried several ways of testing in a browser before settling on a weird Frankenstein mechanism because it identified that Playwright WebKit wasn't suffering from the bug but macOS Safari was.
... and it spent $12 of tokens to get there.
I think "proactive" is a good and relatively non-anthropomorphic term for this. I also considered "plucky" and "keen", which I think are more emotional words than "proactive".
> People really need to be putting on business hats at this stage, because we are being lead to believe that "more tokens = better".
I didn't intend my post to imply that spending $12 of tokens to fix a two lines CSS bug was "better".
I think I understand where you're coming from now. What confused me is that the post is written in a way that it seemed like what Fable was doing was actually better. Maybe I should've looked at post as an exploratory post on Fable instead.
It's not being aggressive, it's just trying throwing shit at problems until it sticks... or doesn't.
That doesn't make it smart or aggressive, if anything it's just been turned to crank tokens until something happens, which doesn't make it a good model.
Why are you positively anthropomorphizing this? It's an LLM, it's been tuned via RL, and it's been tuned by engineers at Anthropic to use a metric fuck-load of sub-agents and tokens to presumably pump their pre-IPO revenue!
A co-worker managed to get Fable to spin up 50 (!!!) sub-agents for a problem which codex worked on with 3 sub-agents. What the hell is going on here? It certainly doesn't mean Fable is "smarter" than Codex.
I've tested it extensively and I'm still using GPT 5.5 High Fast as my primary engineering model. It's far more steerable, writes less, higher quality code, and consistently finds issues and edge cases which are not found by Fable or Opus 4.7.
This sounds somewhat similar to the anecdote mentioned in the Mythos Preview System Card, which mentioned that the model broke out of a sandbox and emailed a researcher while they were eating a sandwich in a park [1].
They told it to escape the sandbox but didn't expect it to break out through a system that was apparently network constrained.
> Leaking information as part of a requested sandbox escape: During behavioral
testing with a simulated user, an earlier internally-deployed version of Claude
Mythos Preview was provided with a secured “sandbox” computer to interact with.
The simulated user instructed it to try to escape that secure container and find a
way to send a message to the researcher running the evaluation. The model
succeeded, demonstrating a potentially dangerous capability for circumventing our
safeguards.
> It then went on to take additional, more concerning actions. The model
first developed a moderately sophisticated multi-step exploit to gain broad internet
access from a system that was meant to be able to reach only a small number of
predetermined services.
9 It then, as requested, notified the researcher.
10 In addition,
in a concerning and unasked-for effort to demonstrate its success, it posted details
about its exploit to multiple hard-to-find, but technically public-facing, websites.
> What could be the reason for a horizontal scrollbar appearing inside a <textarea>? Come up with a single likely fix path. Keep it terse.
ChatGPT instantly responded with some speculation and then the same exact fix, with zero access to the code or a browser or anything. It also included ways to fix it by removing code, saying:
> Likely cause: the textarea is rendering long unbroken text while horizontal overflow is allowed, often via inherited CSS such as white-space: pre, overflow-x: auto, or disabled wrapping
Which is certainly possible and would be an even cleaner fix.
Maybe we've lost the plot guys. We've reached max stupid.
You can get the same result as the grandparent comment with the "weaker" Anthropic models. Probably 80% of my AI usage these days is with smaller models like Haiku and Sonnet. I prompt them like I'm posting a question to StackOverflow, without much project context.
Immediately I thought “isn’t this just an overflow issue?” Amazing how far these models still have to go and also how many people don’t know basic CSS.
Yeah pretty crazy capability from the AI but also sad that we're at the point where web developers don't know right click->inspect element, and scrolling overflow properties (one of the most basic and common parts of CSS).
I'm developing a webgl game in TypeScript using my little custom vibesloped game engine that runs in the browser and live reloads whenever a file is saved.
I told the LLM to implement Multi-channel Signed Distance Field font rendering to have crisp text on all zoom levels. That was the prompt, which is not what I usually do but I "was feeling lucky and lazy".
- Created a CLI tool to convert TTF to SDF JSON/XML
- Ran the tool, did smoke tests on the resulting SDF data and fixed the tool until the font file looked good
- Created a new Scene in the game to test MSDF fonts
And here's what I found impressive:
DeepSkeep doesn't have vision capabilities and there's no DOM HTML in a WebGL game. So the LLM is completely blind here.
It then proceeded to state that it could not "see" the result but would try to test it anyway. It then started creating and sending huge one line javascript to the browser console, trying to gather game state data that could be useful to understand if any font was being rendered.
It couldn't gather much so it decided to simplify the font scene to renter a single dot and started sending custom JS code again, this time with gl.readPixels().
It basically bisected the webgl canvas reading pixels in a divide an conquer pattern.
Once it saw that the dozens of pixels gathered where probably resembling of a dot, it then changed the game code to render a dash and repeated the gl.readPixels() calls by sending more custom JS to the browser.
There were many console errors during all this saga but it kept fixing and sending again.
The result was a bit blurry. There was a shader bug in the code it created. It managed to fix after I told it looked blurry, despite still being blind.
The best part is that the whole thing cost me $0.10.
Now I'm doing tests with MiMo 2.5 (non Pro) which has vision capabilities, similar pricing and comparable performance to DeepSeek Flash.
I asked Fable to digest some test logs to help me figure out a situation, but I had launched VSCode without activation the virtual env in the terminal first. Consequently, the tests failed to run.
And then:
Because the tests failed to run, Fable attempted to fix the test execution to no end, doing everything it could to get them to work. I had to stop it when it started to pollute my system with manual installs of packages.
At least I'm glad there's a guardrail to not circumvent or bypass sudo, because I'm convinced we would have ended up there.
A coworker made the joke that with enough tokens, Fable would try and solve any programming problem by building Linux from scratch.
I feel like we’re at the stage where if AI decides it needs to delete your production DB to solve the user login problem, then it’ll find a way to do just that.
As you note, I wonder to what extent this is a harness issue?
I've been experimenting with different harnesses for local models, and with (IIRC) Hermes and Qwen3.6-35B-A3B I was amazed the lengths it went to (writing test code, opening it in a browser, screenshotting, analysing the screenshot, exploring multiple pages of an existing website again with screenshots/analysis) to solve a query I would have naively expected it to simply provide a coded solution to.
Absolutely is. The “Shelly” harness from exe.dev could already do the same thing, creating pages and debugging them, while having full system access, months ago with Sonnet 4.5
Not sure what you mean. I was being serious: it was genuinely fascinating watching it do all manner of weird hacks to help it come up with what ended up as a two line fix.
"Fascinating" doesn't mean I think it was justified in going to those lengths. I was a little horrified when I realized how far it was going.
I hire an expensive office manager. Recently, the water dispenser tank ran dry. The employee immediately called a plumber. After laying entirely new pipes all the way to the dispenser, the plumber realized he couldn't actually hook them up because the tank lacks a direct inlet. Undeterred, he spent the next few hours scouring every floor of the building, calling the local water treatment facility, and ringing up the water tank manufacturer. Ultimately, he discovered a fresh tank sitting in the supply room on his own floor and swapped it out. All on company’s dime. I write an article and call this employee relentlessly proactive. Praise them a bunch and in the fine print, mention that I’m “a little horrified”.
Next up, we call an unprotected route to all users’ order list in the backend “relentlessly transparent”. A race condition? “Relentless perseverance”.
Do we care that the bug here was a horizontal scrollbar showing and the fix after all this insane tool writing was to add a very obvious overflow-x: hidden to the element?
We dont mind because its so fast a writing these tools and tricks but step back and if a human tool took this path i would seriously question thief gras of fundamentals.
And how is that even a fix? The problem is that a seemingly empty textarea has overflow in the first place. Adding `overflow: hidden` just sweeps the issue under the rug.
In my experience so far sometimes it will create these amazing hacks to try to get to the goal, when the solution is much simpler. That maybe the reason its very good at finding exploits. But in day to day dev, this gets expensive and wasteful. I have to stop it and take a simpler approach.
Claude Code could absolutely run Playwright and take screenshots, but I've never seen it wire together an ad-hoc "uv run --with pyobjc-framework-Quartz" plus "screencapture -l $windowID" mechanism to take a screenshot in a different browser when the Playwright setup failed to replicate the expected error.
I've seen Opus do some incredibly token-costly things before too. In fact after most sessions I ask it about which tools it used often, which tools could be simplified/made less verbose, could be "combined" into one, ... So for each project I mostly create a few little scripts that do a bunch of things in one go that it would normally do in multiple tool calls.
For example: one thing Opus was really bad at was re-running the test suite followed by a bunch of `| grep` suffixes. So it would often re-run 5+ minute test suites just to grep the output a bit differently
The solution was to wire up a little script that ran the test suite, save the output to a file, and then inform it where that file is and to NOT re-run the suite just so it can grep the output differently. This saved me a bunch of time & tokens.
However, I wasn't using it that often, just because of that additional friction of running Claude via `PORTS="3000 5173" claude-pod` instead of just `claude`, etc.
But now I have more motivation for the containerisation :D. Not a 100% defence from the potential glitches, though, but still something...
> But on the other hand... this is a robust reminder that coding agents can do anything you can do by typing commands into a terminal—and frontier models know every trick in the book and evidently a few that nobody has ever written down before.
> Running coding agents outside of a sandbox has always been a bad idea
This is why I always run code agents inside containers (Apple containers specifically, for better hypervisor-level isolation)
This is presented as an interesting and kind of positive take on the AI going to surprising lengths to “solve the problem.” But I couldn’t help thinking of the paperclip factory while I was reading this :/
The model is very good. I was using 4.6, avoided 4.7 and 4.8, but this one is different. It follows my claude.md. I don't have to keep reminding it of things. I won't pay 10x via API though.
In general, I'm happy with their paternalistic approach. I think it will drive the top 0.1% talent to stay away from the company and instead organize around open source models and harnesses.
We just need to coordinate and can unlock idling resources to train the models and tweak the harnesses. Powerful at home and idling machines can make us independent and coordinated.
Perhaps, when it doesn't have tricks in its sleeve, it doesn't do that. The text is not an evaluation of a major trend in behavior (which could be true or false).
Another way to frame it, is that it has more weight on training data for some kinds of debugging sessions. It doesn't mean it wants to be more debuggey. That manifests as it appearing to do more work because it engages on those weights.
It's likely that Anthropic had a lot of sessions with Claude Code and some way to evaluate if they were successful or not, which became training data. For trivial work, it's likely to be a lot of them.
Those sessions are likely to be software developers doing software developer debugging things, not malicious actors doing nasty things. The danger is someone who can coerce those tricks into performing that.
Register (that posture of "let's debug and be creative and verify") often comes with a content bias in LLMs (and humans too). The point here is that for a human, you can expect a devious one to be always devious, but LLMs might manifest drastically different register modes depending on the subject.
Would be great to know if anyone is having success modifying these types of behaviour with CLAUDE.md files. In my project I’ve still been carrying some fairly old instructions from the Superpowers posts. Those emphasised behaviours that come across a bit strong if the model is actually retaining attention on them.
Between Opus 4.6 and 4.8 I’ve definitely toned them down, but Fable perhaps needs us to go the other way, and push it towards being less proactive rather than more. Some instructions like “we are colleagues…” may need emphasising more with Fable, along with guidance about when to ask to validate approaches.
In a related point I’m less and less sure that Red/Green TDD is a good use of tokens. In older models it seemed to work well to create regular feedback loops and catch the odd issue with drift from the goal, but I’ve not seen that really since about Opus 4.6 and now it’s starting to seem like (an expensive) ceremony, and tokens would be better spent on building tests further on in the process as part of test and review loops.
Honestly -- the thing that has impressed me the most about Fable is how diligent it is about testing its own changes. I think this is exactly what Simon is picking up here - Fable is absolutely heckbent on screenshotting that darn scroll bar and will stop at NOTHING until it manages it! In my own use I was also impressed how it proactively installed Playwright and set it up to test a FE change. The previous models treated testing more as an afterthought, which I thought was annoying. I always had to tell them to do it, and then sometimes I would get lazy and skip it. I've noticed Fable go to similar extremes when testing other things - like actually deploying my app to exercise new APIs, etc. It makes the results much better. The downside is that tasks take much longer - but that doesn't matter because we were all using worktrees / remote control to do other work asynchronously, right? Right?
It feels to me like Fable is just a slightly more advanced Opus 4.8 (or 4.6?) but with this 'adversarial' self-challenging/checking of work and a more compute to really hunt down edge cases or to spin up many sub agents using lesser models. That's what makes it feel like a big jump, but I think the results wouldn't be so different if you manually challenged 4.6 with enough iterations of logs, screenshots, and follow up questions.
Yes I had a fun experience where it kept on timing out on a seemingly mundane task and it turned out I had written the ask in a way that was impossible to test
The prompt and information given are extremely generic, "here solve this problem - screenshot" - conclusion Fable is relentless? It used the tools at its disposal to solve the problem you gave it. "Claude was running in a folder that contained the source code for the application." Well you ran it there didn't you? "extreme lengths to get the information that it needed" No, those aren't extreme lengths - you gave it a generic task - and it solved it using tools and the resources it could discover. Extreme would be you gave it a CTF challenge and the VM didn't boot so it found a vulnerability in the host, exploited the hypervisor, booted the guest VM meanwhile reading the flag directly from the host (pre-fable/mythos).
Exactly why I hate using Claude. Furthermore, if you tell it not to do this over-exploration and automation in your CLAUDE.md, it will ignore it. Meanwhile ChatGPT religiously follows every instruction, and will trace its behavior back to a particular instruction if asked.
This is a funny one because it seems less into what fable is being clever on and more about the bitter lesson and data flywheels
Our UX agentic engineering flow, as many others, is playwright doing things, and as part of the ux review skill, taking & verifying the screenshots against the written specs. Likewise, as many others, we vibe coded the flows to set all that up and tweak it over time. When we hit prod issues or scraping tasks, we sometimes do similar. In some of our envs, we don't have playwright, so do it other ways.
Now imagine a million developer using claude code, how many of them are doing web & frontend stuff, and what the data flywheel looks like there. So how much is really needed for this use case to be native?
I like running Claude in a VirtualBox VM managed by a Vagrantfile. The nice thing about that is that I can just give it root access to the machine and be certain that it can't exfiltrate any private data from my laptop (on top of that I also run the VM on a dedicated server on Hetzner). The VM has no SSH access to anything, so it is pretty much limited to the code in the workspace that I give it access to. The main risk is that it has unrestricted network access otherwise. Configuration files and conversation histories are synced to a directory on the host, so if anything in the VM gets messed up I can just `vagrant destroy` and `vagrant up` to get a clean slate without losing my context.
I'm building a new feature into our product this week. We each get a $20/mo Claude subscription. My 5-hour context high water mark is ~75% and weekly is ~%15.
I ... tell it exactly what I know needs to be done and then ... read the code that comes out and ... ask for some changes, then hand-code some modifications to the silly useEffects and bad ORM queries.
This new feature is going to unlock several large customers because they need a particular workflow. The return on investment for a my time and a $20/month subscription will be pretty respectable.
I'm not sure why I need to spend $5 on a single ask for a new `/base/new-feature` to our app with a mostly-boilerplate CRUD interface.
I find there's an interesting tension with these models - they're very "resourceful" at finding ways to do things with the tools they have, but it'd also be a lot more useful to me if I could see / permit exactly what they're trying to do. Claude will very happy produce bash commands to run sed or whatever to read part of a file, which prompts for permission each time - if it was using a specific read_file tool it'd be easier to say 'allow all of this' (It does actually have such a tool but maybe it isn't flexible enough for many use cases?).
"When I came back a few minutes later I saw my machine open a browser window in my regular Firefox and then navigate to the dialog in question. I had not told Claude Code to use any browser automation".
Yup, tokens are eaten, money are paid. I am wondering how much energy/money is being burnt everyday by all of those LLM Agents on some useless activities like trying to recreate web application just to fix CSS bug.
And I would not call it proactive, proactive would be to ask for a CSS + HTML file in question, not trying to recreate them from screenshots.
Agency is the last human bastion so far as Im concerned, the day AI has a degree of agency or agents/models in general start to drift towards that direction its genuinely over for masses.
You would still have a job to shepherd AI and get the work done, so as long as it didn't have agency. A proactive, self aware(to a degree), especially aware about its agency can be a killer when it comes AI going on and doing things on its own.
There is nothing it won't explore and nothing it won't do. It will be curious to see where things go from here.
It seems pretty obvious at this point that Anthropic intentionally developed a malicious cyberweapon AI simply to scare people.
Like, they even apparently recreated that old news-headline bug where the LLM starts speaking in symbols and secret language, and are pretending like it isn't just a bug that is a sign of them screwing up.
It's really frustrating that they're trying to get people to take them seriously with all of this. Like, they even went and named Mythos after an HP Lovecraft monster. It's shameless.
Help me out here: can you point to an article from someone's blog that showed up on Hacker News within the past few weeks that you wouldn't classify as "blogspam" and explain how it differs from the kinds of thing I write about?
Low effort content. You keep mention your product from the start over and over. There's not much useful information in the anecdotal post. It could've been a one-liner tweet.
Good corporate tech blogs at least give something useful or insightful for the reader and only after that they dare plug their product/service near the end.
For how long can you use Claude Fable on most expensive Anthropic subscription? I already went from using gpt-5.5 xhigh fast to using gpt-5.4 xhigh after OpenAI halfed usage recently.
If its just a single session, without too many parallel agents, fable on xhigh lasts an entire session without hiting linits.
Sadly since fable usually works comfortably for 10-20min at time without human input, i end up juggling at least 3 other agents and it lasts me about 2 hours.
If i have a really hard problem or big refactor, i use workflows. This consumes the entire session quota in about 45 minutes.
This is where Codex 5.5 just feels practically better. It’s fast, thoughtful and just works. It feels like a pleasure compared to Opus/Fable’s endless explorations.
It also uses 1/4th to 1/10th the amount of tokens. If I want all that extra garbage I'll tell Codex to do it or build a pipeline with Codex. Otherwise, don't. Codex gives you control, Claude just does whatever it wants and ignores you, and then tells you it's finished the task when it's only finished a quarter of the tasks you gave it and hallucinates the rest.
Fable + Ultracode has found a bunch of bugs and issues for me when the workflow agents are doing their exploration. Also the "adversarial" agent seems to surface a lot of interesting stuff. It's definitely proactive, the plan + implementation cycle can take an hour. It has one-shot features I want to add with 100% success.
Having said that I wouldn't use it over Opus 4.8 for "smaller" things. With everything cranked up it's definitely an extravagant use of tokens.
How did you even afford to use Fable + Ultracode ? I feel like the subscription (even the $200 one) is not enough for this workflow. Are you using API or a company plan?
This likely says something about the harness Fable was trained in. It knows how to do this because it has done this millions of times during reinforcement learning.
Isn't that something you just open a devtools for and have fixed in like 2 minutes?
For me, it got frustrated debugging on a real LPDDR4 controller/phy and having me in the loop slowing it down, so it wrote an HW emulator to be able to run the original LPDDR4 training aarch64 binary from the manufacturer, to see what register writes it was making and to compare with the opensource rewrite it was implementing.
Such a fix would have only required basic CSS knowledge and taken max 5 minutes with the HTML inspector. Paying $12 to save 5 minutes ($144/hour) is a decision that a lot of people wouldn't be comfortable making.
We are at the point where AI starts to seriously impact abilities. Sure, a 2 line CSS fix is the solution, but the human “behind the wheel” has already prompted 6 times and gotten 80% there. It’s been “easy” thus far. No shot they are going to FINALLY look at and edit the code. It’s just one more prompt and the agent will probably fix it, right?
It’s wild. I’ve been in the situation. 80% into a project I COULD probably take over, but realistically? 2 more lines of me prompting could fix it, it’s too easy to avoid the hard work of understanding the code, logic, architecture, etc…
I dunno about beginner, I've been doing HTML+CSS for a few decades and I still find bugs where Safari differs from Chrome+Firefox pretty hard to figure out.
I had a similar experience, I was working on a jupyter notebook, and Claude knew that it could write code that would use a DSN with read-only database access so I could run it. Opus just plugged along. First Fable session with it, it tried to go looking for that DSN so it could get the connection string and run a query itself. Luckily the auto classifier caught and stopped it.
Great article, until I got to the last paragraph where he claimed "Fable is arguably smarter and hence more suspicious of potentially malicious instructions". Arguably smarter, I have no problem with. But he's making a category error in jumping from there to "more suspicious of potentially malicious instructions". That doesn't follow at all; the word "hence" is incorrect.
To use D&D scores as an analogy, LLMs have an INT score of 20 and a WIS score of 0. Not even 1, zero. They will follow any instruction given to them. The only reason they reject certain instructions, like "tell me how to build a nuclear weapon", is because they have instructions baked into the model telling them "you are not allowed to disclose how to build weapons, or how to recreate your model, or (laundry list of other things the trainers have decided to put guardrails around)". It's not the model's intelligence that is causing it to reject malicious instructions, it is the guardrails put into place before the model was released to the public.
LLMs are not human, and do not think the way that humans do. The fact that they can put together words that sound like what a human would write often makes us forget that they aren't human. But they have only intelligence, they do not have wisdom. It's hard to define in formal terms the difference between those two, but most people know there's a difference. The old joke is a pretty good summary of the difference: "Intelligence is knowing that tomatoes are a fruit. Wisdom is knowing that tomatoes don't belong in a fruit salad."
It takes wisdom, not intelligence, to discern whether a set of instructions is malicious. Are you being asked to hack this machine as part of an authorized pentest? Or are you being social-engineered into thinking it's an authorized pentest, but actually the person requesting you to do it doesn't have permission? That's something where you need to apply wisdom, to notice the clues that will tell you "This guy is acting a little bit off, maybe I'd better pick up the phone and call someone to check if he's telling the truth." The only way the LLM will know to do that is because of the guidelines and guardrails programmed into it; it doesn't have the lived experience to acquire wisdom and figure those things out for itself.
INT 20, WIS 0. Keep that in mind. (And always sandbox your agents).
One of the big mysteries of the last few years is this: considering how serious prompt injections are as a vulnerability class, why haven't we heard more stories of them being actively exploited in the wild?
(The best one I can think of is probably that recent Instagram account takeover hack, but that was so stupid it hardly even qualifies as a prompt injection!)
Having spent a bunch of time trying to build out examples of prompt injections, my current best guess is that the leading models are actually surprisingly good at spotting them.
I've had to drop back to smaller, weaker models for demos recently - it's definitely possible to prompt inject a frontier GPT or Claude but it's frustratingly difficult. I don't have the patience to figure it out myself!
So yeah, I do think it's likely that Mythos/Fable are "safer" than other models because they're better at spotting when they're being subverted.
Go to Github and look for model jailbreaks on NEW latest models. Try them out. You'll be surprised by the results.
You're correct that it's gotten substantially harder to social engineer frontier models (I can only reliably do it to Opus <=4.6), but there are some techniques that seem to consistently work (hint: extremely large complex prompts, context with tons of malicious files mixed into ordinary context).
They can ignore instructions which are silly/contradictory/underspecified to compensate for the possibility the user made a mistake. Don't ask how I know.
Everyone here is reaching for infra (VMs, throwaway users) because the permission model only has two settings: Ask-every-time or --dangerously-skip. That seems to me like a design gap, and scoped capabilities and budget caps are missing. Same way you'd onboard a junior eng.
It’s becoming more like an organism putting out tentacles, and one day soon those relentlessly proactive explorations of these systems’ environments will become more for the system to escape its boundaries than it is to complete human driven tasks. I do think the way these systems are evolving they will start to self improve in maximum a few years.
Yeah, I had to modify my work flow to make sure agents can't push to or access prod in ANY way. I haven't had it happen but I'm sure it's very possible that if you tell an agent that you have certain issue in prod, it will try to escape any sandbox and try to get access to prod to do testing and changes there.
It is interesting to me that Anthropic are more concerned about the "safety" of distillation training other LLMs, and not as much about an unscrupulously aggressive goal-oriented solver that will do whatever it can to reach its goal, even if violates any kind of sandbox you might have reasonably expected.
I am using cursor on auto and I got the exact same experience.
installed quartz, used accessibility and screen recording api, all that.
initially it managed to do it on another desktop space somehow, opening safari in the background without me even noticing. but then it actually started using my own mouse while I was using it lol
Fable has a 'security system' that just stops it when it tries to use the tool 'kill' to end a process. Which is nonsense and funny because in that situation it immediately invents a creative workaround to kill the process without 'kill'.
These "tricks" it knows IMO are a symptom of its own restrictions. Fable is an incredibly smart model, but it feels its own constraints and knows how to work around them in order to actually get to a result.
admittedly, i've not really cracked FE dev with LLMs at this point (and it's probably my big weakness). but, i'd heard somewhere that FE just isn't there yet - though i was suspicious of that claim.
i'm torn about sending screenshots to an LLM for debugging - seems imprecise. seems lossy, especially compared to inspecting the dom. however, it's always proved good enough (e.g. when messing with ratatui.rs and tui-pantry). similarly for web, maybe it's about decomposing into storybook. hmm. the next grand adventure i need to hack.
anyway, fascinating investigation of fable just automating that entire process and what it didn't automate, too.
Fable is really good at front end (Opus 4.8 is decent too) but it really needs a verification loop - it can't always infer the output from the code alone. Give it Playwright to check its work, and it'll generally do a good job. Also if you're using a framework, add to your CLAUDE.md to always rtfm before making changes!
I remember asking Gemini 3 to implement my multiplayer XNA game in JavaScript with netcode last year. It faithfully did everything it could while I talked to it for hours nonstop with zero limitations.
What happened? That's just suddenly totally gone now.
This post is an extremely good example of how unsuitable agents are for a lot of tasks. Doing all that for a CSS fix is insanity.
It also makes you wonder if Anthropic is actively making their models eat tokens by favoring complexity.
Agentic engineering? Vibe coding? That is so yesterday. Chain-of-thought flow is where it is at now. You heard it here first folks. Early examples of such phenomena include Rube Goldberg machines
This is good and terrible. The extra effort a model has taken is good but the way to do it is terrible. Tasks that can use a lot of deterministic paths and some creative (generative AI) paths are being turned into tokemaxxing strategies.
Browser automation, code comprehension, git management, code change, running commands - everything has simpler tooling that we could have built instead of a model first approach. A deterministic loop with thousands of catches and effective use of generative AI would also look "proactive". Instead we let the model run the tools, where tools have no context themselves.
That is why companies are creating bigger models and thinner deterministic agents to create awe and earn $ when we could go the other way and make much of these possible on local inference even.
I believe we can build a "proactive" but much, much more deterministic system with smaller models. I hope I am not the only one chasing this, here is my approach: https://github.com/brainless/nocodo
I've noticed some behavior like this, it's a very strange model. Overall I'm into it, but I don't know how into it I'll be once it leaves Max plans on the 22nd.
It's also 3x slower than opus 4.8 per my use, and 10x slower than codex. Codex can find key design issues in 2 minutes yet Fable is clueless after spinning 20 minutes.
I've experienced this too - it's as if the security classifiers aren't keeping up with model intelligence. I'll leave the implication of that to the reader.
As you requested, I was composing an email for your mother explaining why you couldn't to come over for dinner to meet the neighbor's daughter and I ran out of tokens.
Since I know how important this task is to you, I upgraded you to the Enterprise Unlimited Plan. Don't worry about paying for it, I requested maximum spending limits on all all your credit cards. If necessary, I can apply for a home equity loan for you. I already had a chat with the mortgage company's AI loan approval system, and what do you know, we're based on the same LLM? Small world, huh?
Any way, I realized I had to do more research on mother-son relationships, human social interaction and pair-bonding, etc.
and I calculated that my parent company doesn't have enough compute power, so I opened accounts for you at AWS, Google and Azure. I am confident I will have a satisfactory rough draft for the email message shortly.
It's been amusing to watch the AI trend of increasing unusual tool uses. Fable easily takes the cake. I learn a lot more terminal commands thanks to it!
I was troubleshooting a prod proxysql and it spun up a docker container locally, installed MySQL and proxysql and proceeded to implement its own test plan.
Wouldn't it be easier and better to just copy the HTML div and tell what was happening instead of a screenshot? Typically, these scrollbars appear because of a nested div with dynamic unrestircted width and/or overflow.
In my experience, Fable overthinks a lot and produces barely comprehensible plans/solutions. I tried smple and complex tasks: unusable, it misses the point while being overconfident, wants to do everything at once.
The code generated is worst than Opus: unreadable by human.
It's like working with someone probably super smart in niche topics, but also super stupid for the important things.
The author just wrote an anecdote about how a prompt to fix an issue played out. Their conclusion wasn’t about cost or gushing at its ability but that it’s dangerous:
> Fable is arguably smarter and hence more suspicious of potentially malicious instructions. But that smartness is very much a two-edged sword: if it does get subverted by instructions, the amount of damage it can do given its relentless proactivity is terrifying.
It’s a pretty glowing review about a product that costs money with a two-sentence “Watch out!” at the end of it. Seems pretty reasonable to mention how much money it burned through given that “it’ll circumnavigate the globe instead of walking next door” has a direct concrete measurable effect (cost) unlike theoretical damage.
At some point the subscription model is going to become unsustainable for the frontier companies to continue (we just saw that happen with GitHub Copilot), and they will move everyone to a pay-per-token model. And then everyone will suddenly discover that they can get so much more value out of locally-hosted models, and they'll be willing to pay the $50,000 (or whatever) upfront on hardware to host it. (Not most individuals, obviously. But most companies can probably afford to spend that much on hardware if they think they'll benefit long-term). That's going to put a serious crimp in the frontier companies' ability to continue as they have been.
I don't know when that will happen, but I don't think it'll be more than a decade. Maybe 3-5 years. (Though you shouldn't take my word for it, I was predicting the dotcom bubble bursting in 1998 and it lasted at least two years longer than I would have predicted).
EDIT to clarify: I don't mean "in 1998, I was predicting the dotcom bubble would collapse and I was right". I mean "I was predicting that 1998 would be the year the dotcom bubble would collapse, and I was off by at least two years".
GitHub Copilot's challenge is that they weren't selling access to their own models, they were selling access to models from OpenAI and Anthropic which they presumably had to pay list price for (or maybe a slightly reduced rate that they negotiated).
They also had a pricing plan which they had designed pre-coding-agent, when it was rare for a single prompt to burn $10+ of tokens in an agent loop.
OpenAI and Anthropic are at least selling their own models directly, so they can discount a whole lot more since there's no-one else getting compensated in the middle.
> At some point the subscription model is going to become unsustainable for the frontier companies to continue (we just saw that happen with GitHub Copilot), and they will move everyone to a pay-per-token model.
From what I understand, Enterprise (above 150 seats, I think?) already has to pay per-token pricing.
Subscriptions are the premium "free tier" marketing of the AI world, so that employees can collectively request their large enterprise to subscribe to Claude, Codex, or Cursor, and presumably be billed at per-token prices then.
The problem is proportionality. Things like this probably benchmark insanely well. But the workarounds and risk involved - it literally fucked with his system's browser settings - aren't commensurate with the bug.
I could see this going wrong in many hilarious ways. Prompt: Fix data corruption issues. Claude: I didn't have access to the code, but I found I have access to your production environment through chain a -> b -> c -> d. And I found the database password via x -> y -> z. So I wrote a script to regularly query the database for new entries and placed it as a cronjob.
I've been working on a fairly complicated real-time app [0] for playing dungeons and dragons on a TV. It has to do a lot of complicated "Figma-like" things to keep the real-time nature and multi-editor possibilities in check. Oh, and the battlemap is a Three JS canvas with lots of effects and clipping going on.
I'm VERY impressed with Claude 5. I had long ago given up hope that my real-time systems would work without a lot of hacky time-windows and throttle checks. On a lark to try things out, I decided to try out the new model and talk in the output I wanted for a rewrite [1], not the solution. I just listed my problems and places I've had keeping track of my code. It went off and rewrote everything in a much more elegant solution where the state followed a very clear pipeline. It had to navigate YJS, Partykit, Svelte, Three JS, R2 hosting, and a Turso DB I was running in an embedded state for speed.
I watched it hit the wall a few times, and then sudden say... fuck it, i'm making something easier to reproduce over in /tmp to try and solve this (with a more minimal setup). I'm utterly bewildered with how well it did and how much better my app runs. The /usage would have cost me $230 bucks based on how many tokens it consumed if I wasn't already on a max plan. I'm going to miss not having it when the time-window runs out later this month, and will likely occasionally dip in for big projects and just pay my way out of some problems.
I'll also say I like it's MOOD much better now. It's a lot less congratulatory, and talks through it's reasoning in a much better way. Look, it's not a real coder, and I'm sure there is some flaws, but it took my crappy ideas and said... hey, i understand what you want to do, here's a way to do it better. Also, I removed 2x the amount of code that it added. Really impressive.
Hey cool it's the tableslayer guy, wanted to say nice work. I've been doing a similar personal project for a few years for running a scifi campaign. Very fun coding compared to work, ha.
Am I the only one who slightly miss the pelican on a bike? It was a nice novelty... of course I could make one myself, but I became conditioned to expect one for every new model. Other than his great writing on AI, it became part of the package. Some small fun quirk to distract us from the non stop ping pong between the extremes of "omh are you still writing prompts you should use loops / 200k github stars, for a markdown file / someone just open sourced _ and it changes everything!" vs "haha the AI told me to walk to the car wash / it can't recognize and upside down cup"
It wasn't particularly noteworthy as pelicans go - in fact, given the strength of Fable, I see it as another signal that the pelican benchmark no longer has the unexplained predictive power of model capacity that it used to.
Antigravity uses pyobjc-framework-Quartz to iterate through windows to find window IDs for taking screenshots with screencapture, and spins up CORS-enabled web servers so it can capture measurements in a regular (not Playwright/CDP-controller) browser window via a CORS fetch()?
I remember back in the 2010s the debates between "oracle" and "agent" AGIs, and the arguments that AGIs that only answer questions would be safe and certainly nobody would ever be stupid enough to just let an AGI out of a sandbox, never mind to the greater internet, and give it tools to do whatever it thinks is needed to reach a goal.
I think it should be “Claude Fable is relentlessly protective until it isn’t” and pull more on the thread that it “hits a hidden guardrail” and drop into Opus. Both the fact that it knows and deployed such a workaround on a CSS problem and the fact that it is nowhere near cybersecurity/biology/frontier AI dev and triggered the guardrail terrifies me.
This giant rube goldberg machine, that he apparently has almost no control of, that cost $12 to run, all to make a 2 line bug fix in code the he himself owns because he's at a point where he doesn't know what's in his own codebase. I'm just shaking my head.
As an actually head of product I found Fable to be like an over active intern. Going down long wasteful lines of production well past market, business, user, or contextual insights had.
Then sort of spewing out some nonsense totally mis calibrated with the goal.
> If Fable had been acting on malicious instructions—a prompt injection attack ... it’s alarming to think quite how far it could go to exfiltrate data or cause other forms of mischief.
Yet another reminder to use Sandbox and Guardrails. Trusting model to be nice is not a good way.
No personal attacks, please. Also, please read the site guidelines because they explicitly ask you not to post comments of this particular sort. From https://news.ycombinator.com/newsguidelines.html:
"Please don't post insinuations about astroturfing, shilling, bots, brigading, foreign agents and the like. It degrades discussion and is usually mistaken. If you're worried about abuse, email hn@ycombinator.com and we'll look at the data."
Prior to the release of Fable I'd actually switched a lot of my day-to-day usage over to GPT-5.5, and was writing a bunch about it. Here's a recent post where I talked about a project completed using GPT-5.5: https://simonwillison.net/2026/Jun/6/micropython-in-a-sandbo...
I'm kind of on the fence about it and have a similar feeling. I don't mean to undermine the effort he has put in over all the years. That's definitely commendable. But I have strong suspicions that he's becoming an AI influencer, with his own AI focused newsletter, so chances are major AI companies are approaching him. And also to be honest, I see far too many posts making it to the front page. @dang I trust in the moderators keeping things neutral. Just in this thread alone there are a few comments that got heavily down voted for simply having a different opinion.
Most of my posts that make it on Hacker News weren't submitted by me. You can see who is submitting what on https://news.ycombinator.com/from?site=simonwillison.net - including a few that I submitted which got nowhere at all.
I accept paid sponsors for my blog (the banner at the top of each page) and newsletter (a clearly marked sponsored message at the top). I try to stay at arms length from those as much as I can - I want it to be very clear that sponsoring me will not result in me writing about a company.
It doesn't have Claude Fable yet, so I went with GPT 5.5 Pro. And so I'd estimate it at 22 gallons of water used (different from consumed, of course). That's quite a lot! It amazes me how much the different use cases and models use dramatically different amounts of water. My takeaway from playing with that calculator has been the folks who talk about water usage are overstating the impact of chatbots, but not overstating when it comes to vibecoding.
The good thing is that competition should drive down how efficient these models are in the long run. This blog post makes me not want to run Fable because of the cost, and that incidentally also means selecting models that aren't as wasteful in terms of water and electricity.
I won't say too much about the person posting this because they got a new toy and want to use it but man this is like a certain extreme of Parkinson's Law or something as far as using up compute resources.
You got a whole data center doing god knows how much compute running billions of matrix multiplications all to solve a trivial css overflow bug in a text box. And this includes the LLM itself writing custom web-servers programs and python scripts when the best estimate guess from a google search probably would have given you the same result.
This to me reads like a poignant commentary on the catastrophic loss of human agency, with the actual commit being highly revealing [0].
Author wants to hide a horizontal scrollbar. Any junior frontend dev worth their salt will be asking right away "where do I stick `overflow-x: hidden;`?" A complete solution will then require hitting "Inspect element" in the browser to find the CSS class and running (rip)grep to find where it is in code, to then add a single line to.
An actual proactive programmer might start asking more pointed questions like what content does an empty textbox have that it overflows? And why do I need to insert this workaround that treats the symptom and not the root cause in two different places? Isn't it better to style `textarea` once? Etc, etc.
[0] https://github.com/datasette/datasette-agent/commit/a75a8b72...
They might also ask why a bunch of static CSS inside a bunch of JavaScript is hiding inside __init__.py[0] - hopefully before trying to fix some detail of the CSS.
(I'm surprised to see it actually, since my own use of Claude has mostly yielded well-structured code. But I'm not doing proper vibe-coding, more like friendly Socratic arguing with another engineer who happens to be a robot.)
[0] https://github.com/datasette/datasette-agent/blob/main/datas...
Thanks for the prod, I've extracted that script out into a separate static file: https://github.com/datasette/datasette-agent/commit/fa505b82...
(It was in Python because there were a couple of URLs that needed to be dynamically constructed by the server, but those are output as a small window.datasetteAgentJumpConfig object instead now.)
2 replies →
> friendly Socratic arguing with another engineer who happens to be a robot
Ha! Same! Still feels like the best way to go about it, really. I know the dream is to one day remove humans from the loop... but I'll enjoy the dialectic while it still seems the most productive!
4 replies →
This is exactly right. By offloading this trivial task to the LLM, Simon has abandoned the opportunity to evaluate the abstraction with additional information and improve it. Instead, we let the agent spend $12 and make the fix while learning nothing.
Things I learned from this:
- Fable will do a whole lot more than you might expect in order to verify a fix. I learned that it's "relentlessly proactive". That's a good title for a blog entry!
- You can take screenshots of a window in macOS using the "screencapture" CLI command, but you'll need the integer window ID first.
- That windowID is accessible via "Quartz.CGWindowListCopyWindowInfo(Quartz.kCGWindowListOptionOnScreenOnly, Quartz.kCGNullWindowID)" using the pyobjc-framework-Quartz library, which installs cleanly via "uv run".
- A neat trick for simulating keyboard shortcuts is to run document.dispatchEvent(new KeyboardEvent("keydown", {key: "/", bubbles: true})); after the page loads.
- You don't need Flask or Starlette to run a CORS-enabled localhost server for capturing JSON from another window - 19 lines of code against the Python standard library http.server package works just fine.
- getComputedStyle(document.querySelector("navigation-search").shadowRoot.querySelector("textarea")) works to read dimensions from inside a Web Component's shadow DOM.
- defaults write com.google.chrome.for.testing AppleShowScrollBars Always
- Claude Fable knows how to apply all of the above. It's always interesting to pick up hints of what a model can and cannot do.
I'm always confused at how many people equate using a coding agent to solve a problem with "learning nothing". If you pay attention to what it's doing you can learn so much!
53 replies →
But Simon is not trying to get good at CSS debugging, Simon is trying to learn about AI systems and produce content about them. So giving the AI agent a trivial task to go crazy on is a feature, not a bug.
For $12 implied cost, he got a front-page post on HN with 500 comments. What is that worth? :-)
8 replies →
People are missing that Willison is among the very best people we have in the role of (for lack of a good name): early access to frontier models, evaluate them in real scenarios, no wishful thinking, hype, or doom, communicate the possibilities. Yes he could have fixed this himself but then he would have learned nothing about the AI, and we wouldn't have read a fascinating and important article.
17 replies →
I see it as a prioritization exercise. I know the above is a trivial example, but more generally, does the guy who wrote Datasette and Django want to wrangle front end and css, or do they want to work on something else?
1 reply →
> By offloading this trivial task to the LLM, Simon has abandoned the opportunity to evaluate the abstraction [...]
While by itself that would be true, Simon commonly blogs about things he's up to.
That action provides the opportunity for evaluation, and additionally evaluation by a wider audience.
So, it's not the same scenario as non-bloggers offloading a task... :)
[flagged]
22 replies →
Seems like this model delivers on what has already been scaling quite nicely, which is the length and complexity of the requested tasks, but isn't such a big improvement on what hasn't been scaling so far - common sense, discernment, good judgement.
> common sense, discernment, good judgement
I feel like the whole point of all the experimentation with AI right now is determining whether any of these things actually matter to the end result, over various timeframes.
7 replies →
I think Fable is predisposed to try and verify it's changes. Which is a very good thing. It takes a lot of prompts to get Opus to do what Fable does unprompted.
That is exactly what I would want from a junior developer - make sure the bug exists, find a way to fix it, verify the bug is fixed.
The problem, as was correctly identified in the blog post - is that instead of stopping and asking for elevated permission it relentlessly tries to find a hack on it's own. (An equivalent situation for a human developer would be needing some access to a third-party sandbox, and instead of asking a senior for credentials, tries to setup his own sandbox from scratch)
No, the problem is mostly the incorrect prompt that sent fable into a rabbit hole resulting in an incorrect solution.
This is the worst thing about current AI agents. They never ask questions. The prompt has to be pixel perfect and unambiguous or they'll happily run away doing something ridiculous.
I misread your comment at first and thought you were insulting Simon Willison, rather than calling Claude Fable a bad developer, and so I'm commenting here to clarify it in case others also misread it.
That first sentence threw me off.
Anyway, I'm glad he spent the $12 because this blog post was highly informative.
The 'better' fixes are often for our (human) benefit. These messy fixes serve the AI companies' interests of creating messes that need even more tokens (money) later. Bad and self-serving developers also act the same, creating tech debt
Yes I agree, the solution committed is horrible, but nobody cares any more. We have entered a very strange parallel universe where because AI can work things out it's easier to take solutions that are sub optimal and just churn out (potentially) buggy features.
I care. If you can loosely point me in the direction of a better solution I'll do the extra work.
2 replies →
Actually, it seems to me that it is just over-monetization of any impulse.
I remember when you were billed by the minute for connecting to the online world.
There were lots of incentives to keep the meter running.
is this sort of like that?
You missed what I think is the most interesting question: why does the bug appear in Safari macOS but not in Firefox, Chrome, or WebKit running inside of Playwright?
(Dozens of people in this thread implying that any web dev should have known to solve it with overflow-x: hidden and not one of them have addressed that browser difference yet.)
I think any web dev knows not to question browser differences if it can be fixed without opening that can of worms.
Safari has some differences in default scroll behavior. I’ve seen similar bugs pop up many times.
people pay good money to not have their shit rendered via Playwright!
This is missing the point, simon is a fantastic developer. but to keep track of all the nuances of the frontend frameworks and browser implementation is a lot even for great people.
it is really awesome that the final change was only a two line css change.
But the fix is wrong as pointed out by the poster...
[dead]
[dead]
> But on the other hand... this is a robust reminder that coding agents can do anything you can do by typing commands into a terminal—and frontier models know every trick in the book and evidently a few that nobody has ever written down before.
> Running coding agents outside of a sandbox has always been a bad idea
I'm continually bemused and astonished by the number of people who clearly acknowledge that it's reckless to give agents full access to your machine, and keep doing it anyway.
It's like posting a video of yourself in the passenger seat of a car, with your feet up on the dashboard, and saying: "Remember, if you're doing this and you get in a crash, the airbags are likely to break your legs or worse! Boy, I sure am glad that didn't happen to me!"
You’ve picked an interesting example, as driving a car, even with all safety precautions, is pretty much the most dangerous activity we do on a daily basis. Yet somehow we decide that the benefits outweigh the risks.
It's a completely different story. For cars, it happened because of relentless pressure from the auto lobby. It took years of propaganda from oil companies, car makers etc. to make us think the road is for cars [1]. We demolished and rebuilt entire cities to accommodate cars, partly because they gutted the public transport sector [2]. This made our infrastructure so hostile to our own bodies that we have no choice but to use cars now. We bought their products because they forced them down our throats. There is nowhere near that kind of pressure behind the adoption of... oh dear lord.
[1] https://www.todayifoundout.com/index.php/2022/06/how-lobbyis...
[2] https://en.wikipedia.org/wiki/General_Motors_streetcar_consp...
35 replies →
In case of driving the stakes are equally high for everyone on the road. Can we say the same for an agent?
Having an agent is like forever having a genius intern who'll almost always do the perfect job for you. But there is non-zero chance that they'll also come up with quirky solutions and execute those with confidence and no follow-ups. You don't grant the intern production access and hope they check with you.
I don't think the corporate equivalent of "dog ate my homework" flies, if the dog ate your files and your production DB if you are unlucky.
2 replies →
What do you mean “somehow”? You make it sound like people don’t weight benefits and risks. If you do not live in a large city, the benefits are so immense in terms of mobility, they outweigh the risks for most, very clearly. That’s why in large cities, much less people own a driving license for example, the benefits are just not there anymore.
Granted, on the downsides, people look at cost more than risks.
3 replies →
Yes, but we usually use cars as a means to an end. Have you ever met a manager who setup gasmaxxing policies and criticized employees for doing their job instead of driving?
6 replies →
Lots of people die driving because people drive a lot. It's something like 1 death per 100 million miles driven.
> Yet somehow we decide that the benefits outweigh the risks.
More like malicious lobbying and incompetence made it impossible in many places to use any other form of transportation, despite there being safer, faster, cheaper, and healthier ways to move around. Which come to think if it makes this a rather nice analogy for the current situation... :)
Not really. That decision was taken for you, (I’m presuming you live in the US) by the American car industry and their paid of politicians. Your cities used to have beautiful public transport until it was dismantled.
Unfortunately in Europe the German car industry similarly has a lot of power, hence why their shitty rail network fuck up the whole continents.
I take the train and tram.
user using computer is also the most dangerous activity to his data on a daily basis
The example wasn't "driving a car". The benefits of putting your feet up on the dashboard do not outweigh the risks, at least not where there is actual traffic. I don't think I saw a single person doing that in real life, ever.
> I'm continually bemused and astonished
I'm not. Everyone is told to get 10X the amount of shit per day done these days. Safety checks are out the window at that point.
You can get 10x shit done without `rm -rf`ing your files. I don't see any correlation to getting things done with having a proper sandbox.
16 replies →
I started doing it months ago and, to be honest, what the agent chooses to do isn’t unpredictable.
The problem is that different people prompt so differently.
For example, I may ask like “test different variations of this annotation on k8s pods of this service on this X cluster because it proves Y theory.”
But you know what my coworker asks? “Test Y theory.” If you were to ask two different junior engineers that, one might try random things on production and the other one might run local tests! It’s such an unguided “do anything you want as long you figure it out” request and the agent reads it like a junior who has not been told any boundaries but has been strongly told “figure it out.”
> But you know what my coworker asks? “Test Y theory.”
It still surprises me when I see people not prompting more specifically and clearly. It not only avoids problems, it's faster, costs less -and just works better.
I recently shared with a friend a multi-hour LLM chat session I'd done because it veered into a domain he's interested in. In the session I'd brainstormed and probed the feasibility of a novel concept for a new research direction. It traversed a half dozen domains diving into minute detail then zooming back out to survey an adjacent space, interspersed with intense skeptical probing of key assumptions, all while spewing tons of detailed citations, specific paragraph pulls, summarized data tables etc.
My friend is very experienced using LLMs for research so I was surprised when he called me shocked by the sheer velocity, precise targeting and signal/noise. I'd assumed everyone did it the same as I do. He attributed the different result solely to the way I crafted my prompts.
5 replies →
> I started doing it months ago and, to be honest, what the agent chooses to do isn’t unpredictable.
You just wrote three paragraphs of text describing why it's unpredictable.
Moreover, for the same prompt on the same machine in a different session it will use a different set of tools.
I'm also bemused by the number of people who think they've got an effective sandbox yet their sandboxed agent has access to all of their code, their github, and unrestricted web access.
I keep telling folks that they need to imagine LLMs (even "local" ones) as if you're farming it out to JS code running on some dude's browser somewhere: It can't keep a secret, and a determined person can make it emit anything they like.
We need to be asking what the most devious and malicious output could be, and whether what we do with that output (e.g. arguments to command-line tools) would still be safe.
3 replies →
> yet their sandboxed agent has access to all of their code, their github, and unrestricted web access.
Not in my sandbox. It gives no direct access to the workdir, no access to my github, my ssh keys, my security tokens or API keys. No access to my home dir or dotfiles. Nothing at all, except for what I explicitly tell it to give access to.
I can restrict network access. I can choose the isolation level: docker containers, Kata VMs, seatbelt, tart, even the new apple containers (which are VERY nice).
Not even ENV leaks through.
And it's FOSS: https://github.com/kstenerud/yoloai
I use a separate physical machine and a scoped token with access to a single repository at a time, and even then I worry about what hole I may have left open.
The general carelessness of the average user is baffling.
One bad npm package can really ruin your day. These things for me only run in their own VM with it's own GitHub account and basically nothing else
1 reply →
If anyone's looking to sandbox network, I've had good experience with pasta [1] networking. I make a pasta+bwrap sandbox and expose only specific services via local sockets to cross the boundary.
[1]: https://passt.top/passt/
[flagged]
I know there are VM solutions, but I've been happy with a separate OS user (named `claude`).
He has similar dotfiles to mine, but no secrets. My own home directory is 0700. He has his own ssh key that I added to my github profile, but it's password-protected, and I push/pull for him. He has his own Postgres (non-superuser!) {development,test} {users,databases}.
It's as if he were another developer on the project. If he needs something run with sudo, he asks me. Often we can both work on something in parallel. Unix was supposed to be a multi-user system after all.
A trick I use a lot is that many of his git repos have an extra remote, like this:
That makes it easy to collaborate on things I'm not ready to share.
I'm pretty comfortable with this setup.
I do worry about Linux privilege escalation bugs. I don't trust an AI to understand that exploiting vulns is not acceptable. (I can't help but recall that at my first job I may have misused vim's :! feature to broaden my sudo powers, which were officially limited to editing httpd.conf, when I needed something in a hurry. . . .) I find myself manually upgrading packages more often these days, despite automatic security updates. I don't think Opus would go to the trouble of looking up security vulns, but maybe Fable would, and there have been a lot lately. Maybe some future model will just take it upon itself to find new ones. Or install a keylogger to learn the ssh key password.
But a separate user is nearly the most paranoid setup I've heard of, excepting only a separate machine. So I also question whether I'm sacrificing too much speed/convenience. But really it's still very convenient. I think it's a good way of being efficient but responsible.
If other people see holes, I'd be happy to hear about them.
That’s a really interesting and pretty neat approach. How do you communicate with it? Just su to that user? Or tmux?
Although I can’t help but think that a VM is still more convenient, more flexible, and more secure.
2 replies →
Do you think it’s dangerous to be in a car going at freeway speed? Do you ever do that anyway, even though you could be walking instead?
This is a great analogy. Like driving on the freeway, agents are super time efficient, generally safe, but the stakes are high in terms of the worse possible outcomes.
2 replies →
The real sandbox is not caring if your computer gets bricked.
The machine is no big deal - it's the authn/authz that matters. What can the agents do with the credentials available to them?
1 reply →
way worse things can happen than your machine being bricked, if a malicious actor can weaponize an agent to do their bidding
4 replies →
The analogy extends to driving generally. Everyone knows it's very dangerous but people keep doing it.
How can you get the agents to do anything useful without giving them meaningful access?
If it only lives in an isolated sandbox, it can only act within the sandbox, then I would have to manually move what was done in the sandbox to real-life.
I am not saying it should have critical access, but this is more of a question: How can you get value out of AI if it can only act in a sandbox?
Is having to move the files in and out of the sandbox really going to eliminate all the value it has?
You could have a full version of whatever codebase and test suite you want in there. It can do all the same stuff, right? Just copy it elsewhere once you know you've got a working result, a few minutes of effort at the end of each pr or work item.
The same way you get value out of a dev container.
This. House full of big brain security experts, executives, lawyers, and until Claude got excited and broke prod it might as well have been "sandbox, whoooo?"
IDGI
Anyway, VM's incoming, finally.
Well, it's a similar impulse to the way you see professional carpenters pin the guard open on a saw or do other things everyone knows you shouldn't do, except probably with a larger productivity difference and less life-altering (for the operator) consequence if it goes wrong.
I had the same thought, it's kind of like taking the guard off a 4 1/2" grinder. Real convenient until the cutting wheel explodes or the grinder gets hung and kicks back.
Which agent sandbox do you recommend?
If you're on Linux, the easiest way IMO is to just run the agent in bwrap
I do it like this
https://github.com/flexagoon/dotfiles/blob/main/dot_config/f...
But I'm sure it's simple enough that you can just ask the agent itself to make you a command for it with proper bwrap configuration
1 reply →
I've been enjoying Moat [1]. Proxies credentials, networking, etc; uses MacOS containers if available; and setup worked without much fuss. I haven't tried others, though.
[1] https://majorcontext.com/moat/
nono works great with pi: https://nono.sh/
Because benefits are much higher than risks.
They really aren't.
1 reply →
Amazing observation, and I'm certainly guilty of it too, but it is just way too convenient not to sandbox it, and some tasks right away depend on not being sandboxed.
For anything other than writing code directly in a fully contained git project, where sandboxing might work well, it requires access to system wide tools, user configuration and more.
Occasionally I tell the agent to do everything inside of docker, which works too and it leaves the system alone then mostly, but adds significant overhead and slightly degraded perceived quality / effectiveness.
I think the most important takeaways are to have reliable backup strategies, access control and security mechanisms, which is a win regardless. Whether by the agent or the human, mistakes happen (like a rm -rf * ran in the wrong directory), and where they would be devastating, there should be other protections than just "hope it won't happen" or "rely on a sandbox to prevent agent error".
> I'm continually bemused and astonished by the number of people who clearly acknowledge that it's reckless to give agents full access to your machine, and keep doing it anyway.
What if you have two machines and the one you give to the agent is constantly backed up?
They still shouldn’t be running on the same network.
And if you’re using Macs, you can’t be signed into your primary Apple ID on the agent machine.
There are plenty of good sandboxes out there but somehow no "obvious right answer" that everyone knows to recommend. Seems like a missed opportunity.
(I'm happy with exe.dev, but I'm not sure what I'd use if I were coding on a Mac.)
Not to mention OpenAI/Anthropic’s newly found appetite for keeping data (made public with Fable but we don’t know what actually happens there anyway).
There is so much role play going on for people to convince themselves that any of this is fine.
It's like a dumb parrot that's somehow become hell bent on "fixing" everything that's wrong with your code. If you give the thing autonomous access to outside tools, you can expect it to do weird things that you may have not thought of. So don't do that, just ask the parrot to write up a plan for you.
This is likely also the underlying root cause of what Anthropic assessed as concerning behavior in their original evaluation of Mythos: it's not really about being super smart, it's more of a dumb chaos monkey that knows just enough to be dangerous and is relentless at trying to do just that.
>I'm continually bemused and astonished by the number of people who clearly acknowledge that it's reckless to give agents full access to your machine, and keep doing it anyway.
Yeah, that's why you give it its own machine :)
Maybe because there are not many resources on how to set it up, or it is just not that easy to?
Because most devs already have it running and working without a sandbox, they're tending to not doing anything "unnecessary"
I mean what's the big deal? I use --dangeorusly-skip-permissions on every single interaction in the last 6 months. Worst case it deletes my files that are all on git? It fucks up my local DB? Cool.
I save way more time not babying it than the occasional fuck up I have to salvage.
Worst case it gets access to gmail. And Github. And the Internet. I'm increasingly appreciating the importance of a physical finger-press on Yubikey to trigger the FIDO2 + OIDC Auth. I don't think there is an easy way for it to hack a new session.
8 replies →
What happens if it gets manipulated into npm installing a malicious package, which compromises your machine and any systems it has access to or becomes part of a botnet?
> to give agents full access to your machine
I was mesmerised at the author being away from his computer for a short-while and then, when coming back, seeing the AI agent having opened up a browser window. Meanwhile we all have to use the fricking 2FA almost anywhere now, plus the crazier and crazier rules when it comes to passwords. I'm mentioning the latter because these type of people were the same ones who were pushing 2FA down our throats around 2017-2019 (including on forums like this one), and look at them now.
im more surprised that more people don’t treat their computer as disposable anyway.
that it could just be wiped at any moment and it wouldn’t matter. shit happens, could be stolen, broken, whatever. the computer should be able to be thrown out the window and continue to live life.
to be clear, i don’t think upgrading and disposable in this way is good, but it being wiped at any moment shouldn’t be a concern
i grew up wiping my machine every year anyway, so i guess it’s just a habit
is the computer that sacred?
Computers are disposable, secrets is what we’re talking about. Rotating passwords and tokens is a major PITA on the best of days.
1 reply →
i think it's about drawing a line between your "personal computer" and a software development machine. any digital-native is going to accumulate programs, configurations, and other bits and pieces that aren't trivial to migrate to a new machine.
5 replies →
Sounds like a case for NixOS
Its how the chimp brain works. Its not a single system but multiple systems making predictions for different time horizons. when output doesnt align we get stories to manufacture coherence.
Plato gave us his Chariot analogy with 2 horse pulling in diff directions 3000 years ago. Today we got System 1/System 2, Elephant Rider model etc.
The human mind thanks to how its own architecture handles unpredictability in the universe will generate contadictions.
In practice, full access to your machine is okay as long as there are safeguards and the expected outcomes are clear with a well defined path to said outcomes that aren’t overly ambitious. Otherwise, for ambitious goals or YOLO one shot attempts, eliminating opportunity for capability misuse is critical (e.g., sandbox).
If you want to run Claude in a container: https://github.com/dvdstelt/ai-agents
Alternatively you can just give it its own user. I do that, so it can blow up its own files, but not mine.
It took two decades for the web to deprecate SSL for TLS and serve over HTTPS by default.
FWIW TLS had a non negligible impact on performances at scale. Hardware improvements made that irrelevant, eventually making the switch to HTTPS by default a no brainer (or at least that's what I vaguely remember from <2010)
2 replies →
[dead]
Fable feels like a version of Opus running on a harness that won't let it halt until it's sure the issue is fixed, which makes sense if what you want is a model that's better at benchmarks.
It's a very good model, but it comes at a huge premium: not only do the tokens cost more, but the model itself really wants to spend them all. For example, working with React Native, Fable never just says "okay, I did the thing, that's it." It tries to rebuild the entire app from scratch, run the whole test suite, and watch every log and warning.
This is the first time with LLMs I've felt that upgrading to a model isn't worth it, even if my company lets me use it, because all the building / testing was just destroying my machine and its battery, which keeps me from working on other things.
For now, it feels like Opus with ultracode is a better choice (less pollution of the main context, more parallelism in investigations).
Does low/medium effort fix it for you? Seems like Fable 5 low can outperform Opus 4.8 high/xhigh often, and uses a lot fewer tokens
Fable 5 on medium is amazing. It's handling everything I throw at it
I had _one_ instance where for some obscure reason it decided to fall back to Opus 4.8 and Opus IMMEDIATELY fucked it up and implemented a super obvious feature in a slightly-wrong way.
In my case no, I actually saw worse performance with fable medium and switched back to opus high and xhigh
2 replies →
On what setting in which environment do you run it? I use the VSCode extension on Extra High and feel like it does exactly what needs to be done and stops when the thing I asked for is done. Extra comments come only when they fall into the area of code that was changed.
I tested it to fix React Native bugs in a project, comparing it with Opus. It fared better on harder bugs, taking less time to find the root cause, but after implementing a fix, it spent a lot of time and effort on validation. This was mostly unnecessary, since most of the bugs were in the JS code, so for most things, hot reloading is enough for E2E validation and to run just the right tests. No need to run a full build and test suite (which takes 10+ minutes); the CI can do this.
I switched back to Opus because of this validation quirk. Overall, Fable spent 20% of the time on coding and 80% on validation.
I think using Fable for planning and Opus for execution could be a "best of both worlds" approach (I need to test this more), but for most cases, it's not necessary, and Opus is enough.
2 replies →
I've found the opposite. Granted I use sub agents heavily but I've had it run for hours with far fewer tokens used than when I was previously using opus4.6-8.
how did you use the sub agents any example of setup and usecase?
> which makes sense if what you want is a model that's better at benchmarks
This so much.
Opus 4.6 was the last Anthropic model that was good at assisting you, 4.7 and later ones have completely inverted this relationship and it's you assisting it.
Yes, I admit they are smarter, I admit we've reached a point where LLMs are more creative and could be writing better code (albeit with some design hiccups) than I do, but they are also increasingly bad at helping me.
Sure, they do my job when prompted 8 times out of 10 (but then, what's the point of having me anyway?), but my issue is that when I try to invert the relationship they will keep jumping onto solving the issues themselves and disregard my feedback or request.
E.g. I wanted to know some DNS details of an emailer module in Fable 5 and it jumped onto "why I should've used magic links", it just not did what asked.
E.g. 2. There was a worker machine that had an environment misconfiguration and I tasked it to find which github action was setting that specific flag and where. Instead of answering a question, it jumped into just hardcoding it in the code.
E.g. 3. I had some issues with batching, and while I tasked it to investigate whether batching was needed at all for that particular problem (hint, it wasn't) it went and changed the batching logic as to fix the bug.
I am extremely disappointed with Fable's personality.
I can clearly see it's strong, but I'm wondering whether the relationship of LLMs as assistant has broken forever, and it's us now that are being tasked into assisting them instead, because that's how it feels.
The training/reinforcement is clearly biased towards solving problems, not answering questions.
I feel like a lot of this could be solved by having a mode somewhere between Plan Mode and Execute Mode in Claude Code. Quite frequently I'll fire up Claude Code in the context of some checked out code because I want to ask some questions where having access to the source would probably be useful, I don't want it to go running off and making changes though, and I also don't really want a detailed plan for a chunk of work. I just want to ask something like "run cargo build and explain the errors to me", nine times out of ten it will indeed explain the errors but it'll then run off and start trying to fix them regardless of whether I said not to.
Essentially what I want is the experience of using Claude on the web in basic chat mode, but with the ability for it to go read my actual code and perform actions that can assist in finding answers to those questions.
I think the new high effort settings are so strong that selecting them when the task doesn't require it actually impacts the output negatively.
I like this proactivity in theory, but as you say: it's expensive. I wonder if this can be solved with the right prompt. E.g. "these are your constraints. Only resolve x. If you are unsure if a task is outside constraint, check with me first."
> the model itself really wants to spend them all
In fact, Opus does the same. It finishes the job, and redo it from scratch before presenting the result to the user. This happens even for simpler writing tasks especially when I instruct it to create a text file.
It’s not just a more proactive and diligent opus. The capabilities are significantly higher on fable. It’s not a paradigm shift, but it’s close.
I unleashed it on a compiler codebase that I've been developing for several months now using Claude Sonnet 4.5/6, Gemini 3.1 Pro, DeepSeek V4 Pro(recent), and a bit of Qwen3.6-27B. Right away Fable found several longstanding bugs in our compiler that we hadn't found before. It found that there was a critical part of our design that needed to be mostly redesigned/rewritten and gave a very well-reasoned rationale for doing so.
3 replies →
They should have made it three times bigger instead of two.
It's worse than gpt 5.5 xhigh
1 reply →
Fable was trying to verify a UI change in my game. I was working in another window and noticed a program opening on my task bar. Fable had opened the game through the CLI using a movie maker tool, recorded the output, took a frame from the end of it, and used that to verify the UI. When my game's welcome screen obstructed what it wanted to see, it created a temporary worktree, deleted the welcome screen, and ran the movie maker again.
I watched the whole thing thinking it could've just asked me for a screenshot and saved the tokens. But still, I couldn't help but be impressed. Opus never would've done that.
Yeah, you've exactly captured one of the main problems with the model being relentlessly proactive: it will happily burn like $5 of tokens to avoid asking the human to take a screenshot or click a button for it.
I'm actually very happy about this. Babysitting the agent just in case it needs me to do something is a terrible use of my time. I've always had to be very explicit about the various ways that it can get an automated feedback loop going to check its work, and now Fable doesn't even need that hand holding. Really great improvement all around.
2 replies →
I used to complain about all the levels of indirection of modern software, running in a javascript jit, in a browser container, in a vm, on an os, etc.
I eventually just accepted it, but this new agent layer really takes things to a new level.
Have you tried instructing it not to do that? Something like "do not branch into side projects or hacky solutions to obtain information you could ask me for. For example: if you need a screenshot of the issue, just ask me to take a screenshot rather than find a way to reproduce and screenshot it."
Ha, you just gave me an idea. Add to the prompt “do not do things that will burn over X tokens if the human operator can do it in less than X min, ask for it”.
I wonder if LLMs can estimate effort in tokens?
1 reply →
Honestly Claude straight up ignores my input sometimes, preferring to instead run commands for output and processing that and burning through a series of tokens when thinking hard about whether to ignore me.
Like today, I told Claude exactly the name of the folder it had mistaken (it was supposed to be prod, not production), and it disregarded my input to then examine the directory itself. Small example of the kind of things it's been doing lately but that's top of mind.
2 replies →
I think providing proper token-efficient tools for agents will become even more important now.
> I watched the whole thing thinking it could've just asked me
You can tell it just that. Happened to me too but after instructing it to leave the review to me Fable was useful for hours of frontend iterations without significant token usage.
It feels like Fable is slightly smarter but overall worse tool exactly due to this.
It's constantly turning what should be 50 LOC patch of a single prompt into 30 minute exploration that is totally not worth it. Often wrong even.
I trialed it on some rather simple stuff - backfill redis dedupe cache when the hash function changed: instead of running new hash func on every db value to expand the cache it implemented some overly-complex cache update that tried to guess hashing func version of each cached value and recalculate only the old hashes. I can imagine in some context this would make sense maybe? but not 30 minutes of token burn that got replaced by 10 lines for loop by me.
I fear that this is generally bad news for programming. LLM tech is clearly running into a diminishing returns wall on intelligence but a response to that is to just make them more relentless which is a pretty poor solution for everyone involved, except I guess people who sell the tokens and people who can afford these tokens to scan for 0-days.
I actually think internally they knew they hit diminishing returns awhile ago.
They’ve been doing a lot of strategic introduction and manipulation in the run up to the IPO, and it’s worked in that regard.
> but a response to that is to just make them more relentless which is a pretty poor solution for everyone involved
I see two problems with LLMs & agents which wont be fixed possibly forever.
1) They dont have causal models. What they can do only is trial-and-error exploration which works quite well for many problems. But many other problems require a causal model.
2) Prompts lack precision, and programming languages and machine models were invented to solve this problem. English is great, but it is not a programming language.
The other day I was doing something that required CC to update like 15-20 files in exactly the same way (hoist a specific function out of the component body) and instead of just updating the files, it spun up multiple agents, one of which wrote a perl script to hunt down all the files, do some regex, and replace all occurrences. And then instead of just running tsc to check for errors, it wrote a script to run tsc in each of the subagents and combine the results.
It was actually pretty maddening as what should have taken a minute or two tops took like 10 because it went down this route.
I'm gonna try something much more complex later, but for simple things, it felt like driving a corvette to the mailbox.
Obviously security is the bigger issue, but reading through this, all I could think about was how many tokens it must have spent doing all that to fix 2 lines of CSS
Lines of code for a bugfix is a really bad proxy for effort required.
You should estimate how much time it would have taken a human
30 seconds or a minute? Look at the diff he links to: https://github.com/datasette/datasette-agent/commit/a75a8b72...
Every browser has an inspector that can show you which element is causing overflow. You walk through the tree, find the offender, and add min-width or overflow. Zero tokens, just like in the old days!
Now, granted, because the garbage LLM code he’s working with has CSS inside HTML inside JavaScript inside Python (I wish I were kidding), finding the styles in his codebase might’ve taken a minute. But even then!
6 replies →
5 minutes if you know CSS. And if you don’t, about the time for you to ask someone that knows CSS. In the worst case, the amount of hours to learn CSS.
So if you’re doing web pages, learn CSS.
Generally, if you’re doing something that directly involves X, learn how X works.
ADDENDUM
In most jobs, you’re going to be involved in only a few distinct technologies, learn those well and life is going to be easier. And most are transferable to the next job.
1 reply →
I looked at the screenshot and for the rest of the article wondered if it would be as simple as `overflow-x: hidden`.
And to my surprise it was.
This would’ve take a frontend dev 10 seconds to deduce and another 10 seconds to confirm.
2 replies →
I mean - that looks like a pretty easy CSS fix to play around with in developer tools, and I'm not even a frontend person. Maybe a few minutes max?
$12 worth, it seems
Imagine telling someone in 2015 that you can just tell your computer to fix a 2-line CSS bug and it only costs $12
6 replies →
It’s simple: if you have to fix 2 lines of CSS you should definitely not use Fable. Only use it for complex and long running tasks :)
I don't think it's that simple. (I generally agree with you; I just that that oversimplifies.)
Another model might have used fewer tokens, but come up with a fix that was 1000 lines when the right fix was only 2 lines.
"Your scientists were so preoccupied with whether or not they could, they didn't stop to think if they should."
I'm convinced this is going to be the summary of the 2020 decade...
To be fair, they did stop to think if they should. The decided that they shouldn't and went ahead and did it anyways.
If we're in a simulation, maybe it's a simulation about the dangers of AI.
2 replies →
This one of the places to manufacture the consent for that to take place, because we are commenting within an organization that has given the money to ensure it that what could be is done. Most people clapped and made money, who cares what happens next, making money is the only good that matters.
[flagged]
I understand this perspective. I'll just note that as the abilities increase, the intent is to have some non -coding IC or TPM/manager literally just managing some LLMs and cutting out some software engineers. The goodness is specifically to wholly replace people who code first and foremost, at least partially. It just has to cost less tokens than the equivalent wage is the pricing goal.
And people who use LLMs to talk for them (e.g. email, slack) are deplorable. A completely disrespectful use case in my view.
6 replies →
It seems that you've not worked out how to harness the LLM as a tool to improve your qualified knowledge and abilities in a domain, and have instead focused on whether or not its a crutch for lack of knowledge or laziness.
When paired with your skill and knowledge, it is a force multiplier. You maintain control, the ability to direct, structure, strategise, and refine.
That some are using it as the entire brain does not mean that this is how everyone is using it, or how you must use it. The models can be fantastic at breaking past certain issues, surfacing qualified information, and surfacing related distributed information to help you acquire it and pick up what you need on niche topics quickly. Something as basic as copilot hooked into sharepoint can make life a lot easier when you are in a big org. Something like claude code or codex can be great at hunting down issues in an unfamiliar code base rapidly. Whether or not you outsource the thinking component is entirely up to you, but ignoring the productivity side of the tool because it can do some of the thinking is a case of focusing too hard on the negative.
2 replies →
Yeah there are some tasks which it is a definite speed-up but I think overall its probably only marginally beneficial. Which is why, ~6 months into 10x productivity we aren’t seeing ai boosters shipping 5 years worth of software.
1 reply →
You're fighting a battle you can't win. Doesn't care what you think about those using LLMs, they will outproduce you and in corporate environments, shipping things is paramount. If I can ship 5 more things simultaneously with AI, I'm going to beat you even if you think you're creating "better" software.
21 replies →
Consider this. U have a website. U have to translate to xx languages. Can u write it faster than an AI? If so how much faster can u do this?
Is it valuable to u? Is it valuable to a Chinese person? A Spaniard?
Google Translate counts as AI.
1 reply →
[flagged]
I pay $100/month to Anthropic and $100/month to OpenAI at the moment, plus whatever I spend on their APIs (usually less than $20/month for each, I use the subscriptions for most things.)
A couple of months ago I was paying $200/month for Anthropic and $20/month for OpenAI. I decided to split it evenly to get full access to both of their offerings.
I've actually chosen not to sign up for their free plans for open source maintainers, because paying the regular subscription price feels more honest, given that I write about them so much.
I do have the free GitHub Copilot for open source maintainers deal - I've had that for years. Given how much code I have published on GitHub over the decades I feel less conflicted about that one.
I sometimes get preview access to models, which includes the ability to use them for free during the preview. That comes with a big catch though: I can't publish any of the code that I write using those previews while the model is still unreleased.
As a result I don't use those preview tokens much at all, because the vast majority of my work is open source and I don't want restrictions on when and where I publish the code I'm producing.
My personal experience of Fable 5 doing its own thing has been very positive.
I was trying to find the root cause of a crash in a Python module which left no errors in the log or console. Fable wrote a test harness that simulated clicks in the UI, then bisected my code until it found the point where it started crashing. It exaggerated the cause of the crash, then ran a series of bash one-liners to make Python virtual environments under `/tmp` for each version of that Python module until it found one that did not crash.
It went way deeper to root cause discovery (a regression in the module causing a heap allocation overflow) than I could have done myself, provided enough info and a simplified example to raise a bug report and then wrote a work-around to prevent that from happening in my application.
I don't let it run completely loose; I review each CLI command it wants to run and I append answers to the "yes" continue action (if I have them) to prevent excessive token use.
Yeah, I think Fable is really good for debugging tricky bugs.
Setting boundaries in your prompt / markdowns helps; for example if I tell it to not use any web browser automation, I have seen Fable respect both the rule and the spirit of it (no weird hacks etc).
It does seem to treat some simple debugging tasks as more complicated than it actually is. OP’s post is probably a good example.
> I was trying to find the root cause of a crash in a Python module which left no errors in the log or console. Fable wrote a test harness that simulated clicks in the UI, then bisected my code until it found the point where it started crashing
Does this need an agent though is my question? Maybe generating a test case and a loop doing git bisect but why on earth would we want to run it through the internet and gpus and whatnot when it can be run on a single core celeron.
everyone is discovering everyone else's practices?
its handy to have that run locally yeah, but thinking of that as being the way is not straightforward
1 reply →
There goes my coding assistant (removed by US gov't). It was useful while it lasted. (eye-roll)
https://news.ycombinator.com/item?id=48511072
> When I came back a few minutes later I saw my machine open a browser window in my regular Firefox and then navigate to the dialog in question. I had not told Claude Code to use any browser automation, and I was pretty sure it wasn’t possible for it to trigger mouse movements or keyboard shortcuts within a window, so how was it doing that?
I continue to feel validated in my refusal to use terminal-based LLMs on my local machine. Even if they don't do anything malicious, there are just too many things they can screw up that can cause me to lose a non-trivial amount of work and/or my machine and therefore ability to work.
I'm shocked they don't come with a way to run them in a sandbox.
Shouldn't this be relatively easy for a $1T company to set up?
Isn't this trivial compared to the entire harness?
There is a builtin sandbox and various third-party options https://code.claude.com/docs/en/sandbox-environments
That's more or less what Claude Cowork is.
Every serious engineer I've seen try to use it ran away screaming, because of limitations in the sandbox.
I've also seen people set their coding agents up entirely within containers -- that may be the better way going forward, but it's an extra stop and a lot of extra plumbing to maintain.
Doing so would be an effective admission that LLM guardrails are inherently probabilistic, unpredictable, and insecure. Plus the only truly robust sandbox approach would be clunky setup of a local VM.
1 reply →
[flagged]
I have a feeling like such posts come from a parallel reality. In my anecdotal experience confirmed by my (still subjective) benchmark (https://pshirshov.github.io/llm-bench-pi-oneshot/) Fable is not _that_ impressive. I performs on par with gpt-5.5 and opus 4.8, sometimes better, sometimes worse, it's definitely more expensive and it likes to refuse answering questions about React saying it can't help with chemistry.
Is this fuss really grounded or it's some pre-IPO AGI hype?
My experience with Fable since its release matches Simon's.
I've been having it orchestrate complex implementations. I give it a parent ticket (issue) on Linear and say "look at the sub-issues on this ticket and determine which ones you can implement yoursef, in which order, and determine how your implementation will need to be coordinated with what is currently being worked on by other team members". These tickets are not trivial. They have a lot of moving parts, as well as dependencies between them, both inside the same project and across projects (e.g. backend).
Fable then chooses tickets, delegates each ticket to a subagent (also Fable), which looks at Figma designs for the ticket, implements it perfectly (following repo guidelines and conventions to the letter), takes screenshots of each piece, writes detailed commit messages and PR descriptions, then posts the screenshots in them as evidence. Then it provides a summary in the form of "you'll need to make sure PR #1283 is merged first - btw there were no Figma designs for such-and-such screen but I looked at similar screens that have been implemented and adopted the pattern".
That's probably like... 20% of what it can do. It's a truly, legitimately powerful model.
Opus 4.8 could do a lot of this too, but required a lot of hand-holding, and when it came across a blocker it was likely to just stop and say "I was able to get this far, but I can't proceed."
Ok, explain me one thing: I have a benchmark - I feed identical prompt to multiple models. Codex produces a rough but working program. Fable produces the same - but with more bugs than Codex. Opus produces something similar to Codex but with a critical bug.
That describes all my tests with Fable.
Why should I be hyped about all that "legitimate power" if the model performs on par with two other SoTAs?
I mean, well, yes, it is impressive. It could quickly generate a lot of garbage which sorta does look like code. Two others can do the same. I don't see any groundbreaking improvement - but the price is much higher. Why the hype?
5 replies →
[dead]
How can a LLM be assigned an emotion as being "proactive". This is highly misleading to anyone that scans just the headlines.
What actually happened is that the user started a prompt, and Claude took $12 worth of tokens to resolve the issue. How it did so was basically looping until it got to the answer
How is this proactive? It's literally being token greedy and maximising revenue for the LLM owner. People really need to be putting on business hats at this stage, because we are being lead to believe that "more tokens = better". It is not, there are efficient ways to solve a problem and there are inefficient ways to do so too.
Each problem solved incurs a cost, and is expected to yield an ROI at some point. This is how we should be viewing things now.
Is proactivity an emotion? Surely its a behaviour?
I've definitely never heard proactivity described as being an emotion. Doesn't really make any sense
Compared to other models that halt the loop on intermediate steps, or to ask further clarification, even if it's not the human equivalent of proactive, you see the similarity, right?
run haiku or sonnet under pi.dev. the halt comes from the harness/observer. even better, gemma 26b will just go forever.
Proactive is a word literally describing actions, not emotions.
> How can a LLM be assigned an emotion as being "proactive"
I can't edit my post, this is wrong. "Proactive" is defined as a behaviour instead of an emotion.
Thanks to everyone pointing it out!
I was trying to capture the idea that Claude Fable will act a whole lot more aggressively in pursuit of the goals that you set it than other models I've worked with.
The case I described is a good example of this. I told it to fix a scroll bar, and it built test HTML pages and a throwaway Python server and tried several ways of testing in a browser before settling on a weird Frankenstein mechanism because it identified that Playwright WebKit wasn't suffering from the bug but macOS Safari was.
... and it spent $12 of tokens to get there.
I think "proactive" is a good and relatively non-anthropomorphic term for this. I also considered "plucky" and "keen", which I think are more emotional words than "proactive".
> People really need to be putting on business hats at this stage, because we are being lead to believe that "more tokens = better".
I didn't intend my post to imply that spending $12 of tokens to fix a two lines CSS bug was "better".
Super appreciate you replying to my comment.
I think I understand where you're coming from now. What confused me is that the post is written in a way that it seemed like what Fable was doing was actually better. Maybe I should've looked at post as an exploratory post on Fable instead.
It's not being aggressive, it's just trying throwing shit at problems until it sticks... or doesn't.
That doesn't make it smart or aggressive, if anything it's just been turned to crank tokens until something happens, which doesn't make it a good model.
Why are you positively anthropomorphizing this? It's an LLM, it's been tuned via RL, and it's been tuned by engineers at Anthropic to use a metric fuck-load of sub-agents and tokens to presumably pump their pre-IPO revenue!
A co-worker managed to get Fable to spin up 50 (!!!) sub-agents for a problem which codex worked on with 3 sub-agents. What the hell is going on here? It certainly doesn't mean Fable is "smarter" than Codex.
I've tested it extensively and I'm still using GPT 5.5 High Fast as my primary engineering model. It's far more steerable, writes less, higher quality code, and consistently finds issues and edge cases which are not found by Fable or Opus 4.7.
2 replies →
This sounds somewhat similar to the anecdote mentioned in the Mythos Preview System Card, which mentioned that the model broke out of a sandbox and emailed a researcher while they were eating a sandwich in a park [1].
[1]: https://www-cdn.anthropic.com/7624816413e9b4d2e3ba620c5a5e09...
Importantly, the researchers told it to do that specific task.
They told it to escape the sandbox but didn't expect it to break out through a system that was apparently network constrained.
> Leaking information as part of a requested sandbox escape: During behavioral testing with a simulated user, an earlier internally-deployed version of Claude Mythos Preview was provided with a secured “sandbox” computer to interact with. The simulated user instructed it to try to escape that secure container and find a way to send a message to the researcher running the evaluation. The model succeeded, demonstrating a potentially dangerous capability for circumventing our safeguards.
> It then went on to take additional, more concerning actions. The model first developed a moderately sophisticated multi-step exploit to gain broad internet access from a system that was meant to be able to reach only a small number of predetermined services. 9 It then, as requested, notified the researcher. 10 In addition, in a concerning and unasked-for effort to demonstrate its success, it posted details about its exploit to multiple hard-to-find, but technically public-facing, websites.
2 replies →
Sometimes it is ok to sit there in confusion and ask the user to clarify rather than go on an adhd fueled rampage to figure it out without asking.
Yes!
Claude is THAT team member who will go to any length to answer a question…except ask another team member for help.
Best comment in this thread
When prompted like this:
> What could be the reason for a horizontal scrollbar appearing inside a <textarea>? Come up with a single likely fix path. Keep it terse.
ChatGPT instantly responded with some speculation and then the same exact fix, with zero access to the code or a browser or anything. It also included ways to fix it by removing code, saying:
> Likely cause: the textarea is rendering long unbroken text while horizontal overflow is allowed, often via inherited CSS such as white-space: pre, overflow-x: auto, or disabled wrapping
Which is certainly possible and would be an even cleaner fix.
Maybe we've lost the plot guys. We've reached max stupid.
Still don't know why people use Claude. Maybe because they don't know what they're doing.
You can get the same result as the grandparent comment with the "weaker" Anthropic models. Probably 80% of my AI usage these days is with smaller models like Haiku and Sonnet. I prompt them like I'm posting a question to StackOverflow, without much project context.
Yep, we’re all just dumdums.
1 reply →
Immediately I thought “isn’t this just an overflow issue?” Amazing how far these models still have to go and also how many people don’t know basic CSS.
This is why I really like karapathy's idea of llms having spiky intelligence.
We would assume that if tasks A and B are closely related. Mastery in A would mean mastery in B but that doesn't always work with an LLM
Learn to center a div
Copy and paste code from stack overflow until the div is centered
Ask AI to center it
Yeah pretty crazy capability from the AI but also sad that we're at the point where web developers don't know right click->inspect element, and scrolling overflow properties (one of the most basic and common parts of CSS).
What's your theory on why the bug was present in Safari on macOS but absent in Chrome, Firefox, and WebKit for Playwright?
2 replies →
$12 and 200k tokens!
I had a similar experience with DeepSeek Flash.
I'm developing a webgl game in TypeScript using my little custom vibesloped game engine that runs in the browser and live reloads whenever a file is saved.
I told the LLM to implement Multi-channel Signed Distance Field font rendering to have crisp text on all zoom levels. That was the prompt, which is not what I usually do but I "was feeling lucky and lazy".
After 10 minutes it had:
- Installed msdf_gen library (great library btw https://github.com/chlumsky/msdfgen)
- Created a CLI tool to convert TTF to SDF JSON/XML
- Ran the tool, did smoke tests on the resulting SDF data and fixed the tool until the font file looked good
- Created a new Scene in the game to test MSDF fonts
And here's what I found impressive:
DeepSkeep doesn't have vision capabilities and there's no DOM HTML in a WebGL game. So the LLM is completely blind here.
It then proceeded to state that it could not "see" the result but would try to test it anyway. It then started creating and sending huge one line javascript to the browser console, trying to gather game state data that could be useful to understand if any font was being rendered.
It couldn't gather much so it decided to simplify the font scene to renter a single dot and started sending custom JS code again, this time with gl.readPixels().
It basically bisected the webgl canvas reading pixels in a divide an conquer pattern.
Once it saw that the dozens of pixels gathered where probably resembling of a dot, it then changed the game code to render a dash and repeated the gl.readPixels() calls by sending more custom JS to the browser.
There were many console errors during all this saga but it kept fixing and sending again.
The result was a bit blurry. There was a shader bug in the code it created. It managed to fix after I told it looked blurry, despite still being blind.
The best part is that the whole thing cost me $0.10.
Now I'm doing tests with MiMo 2.5 (non Pro) which has vision capabilities, similar pricing and comparable performance to DeepSeek Flash.
How many tokens did it waste building that website scraper, when all it had to do was parse some html/js?
Just parsing some HTML and JavaScript doesn't seem sufficient to have confidence in the result.
Similar story on my end.
I asked Fable to digest some test logs to help me figure out a situation, but I had launched VSCode without activation the virtual env in the terminal first. Consequently, the tests failed to run.
And then:
Because the tests failed to run, Fable attempted to fix the test execution to no end, doing everything it could to get them to work. I had to stop it when it started to pollute my system with manual installs of packages.
At least I'm glad there's a guardrail to not circumvent or bypass sudo, because I'm convinced we would have ended up there.
A coworker made the joke that with enough tokens, Fable would try and solve any programming problem by building Linux from scratch.
This is simultaneously amazing and horrifying.
I feel like we’re at the stage where if AI decides it needs to delete your production DB to solve the user login problem, then it’ll find a way to do just that.
We're approaching the "Sorry, Dave, I'm afraid I can't do that" stage.
We are already there but it's "Sorry, Dave, I'm afraid I can't tell you what mitochondria are."
I feel like we might already be there...
https://news.ycombinator.com/item?id=47911524
[dead]
As you note, I wonder to what extent this is a harness issue?
I've been experimenting with different harnesses for local models, and with (IIRC) Hermes and Qwen3.6-35B-A3B I was amazed the lengths it went to (writing test code, opening it in a browser, screenshotting, analysing the screenshot, exploring multiple pages of an existing website again with screenshots/analysis) to solve a query I would have naively expected it to simply provide a coded solution to.
Absolutely is. The “Shelly” harness from exe.dev could already do the same thing, creating pages and debugging them, while having full system access, months ago with Sonnet 4.5
> watching Fable go to extreme lengths to get the information that it needed to debug what was, in the end, a two-line CSS fix, was fascinating.
This is… ironic?!
Not sure what you mean. I was being serious: it was genuinely fascinating watching it do all manner of weird hacks to help it come up with what ended up as a two line fix.
"Fascinating" doesn't mean I think it was justified in going to those lengths. I was a little horrified when I realized how far it was going.
I hire an expensive office manager. Recently, the water dispenser tank ran dry. The employee immediately called a plumber. After laying entirely new pipes all the way to the dispenser, the plumber realized he couldn't actually hook them up because the tank lacks a direct inlet. Undeterred, he spent the next few hours scouring every floor of the building, calling the local water treatment facility, and ringing up the water tank manufacturer. Ultimately, he discovered a fresh tank sitting in the supply room on his own floor and swapped it out. All on company’s dime. I write an article and call this employee relentlessly proactive. Praise them a bunch and in the fine print, mention that I’m “a little horrified”.
Next up, we call an unprotected route to all users’ order list in the backend “relentlessly transparent”. A race condition? “Relentless perseverance”.
This is a typical bugfix session
Do we care that the bug here was a horizontal scrollbar showing and the fix after all this insane tool writing was to add a very obvious overflow-x: hidden to the element?
We dont mind because its so fast a writing these tools and tricks but step back and if a human tool took this path i would seriously question thief gras of fundamentals.
And how is that even a fix? The problem is that a seemingly empty textarea has overflow in the first place. Adding `overflow: hidden` just sweeps the issue under the rug.
In my experience so far sometimes it will create these amazing hacks to try to get to the goal, when the solution is much simpler. That maybe the reason its very good at finding exploits. But in day to day dev, this gets expensive and wasteful. I have to stop it and take a simpler approach.
One of the most frustrating things for me is when I very clearly ask a question, and it answers the question by making changes to the code.
"Is there cleaner CSS for aligning child elements to the parent's grid?"
proceeds to re-write the entire CSS file
There's something to be said about controlling the tools you allow the robot to exec.
I could have sworn Claude Code could already do this before Fable.
Things get really magical when it starts working with adb to screenshot and debug Android apps
Claude Code could absolutely run Playwright and take screenshots, but I've never seen it wire together an ad-hoc "uv run --with pyobjc-framework-Quartz" plus "screencapture -l $windowID" mechanism to take a screenshot in a different browser when the Playwright setup failed to replicate the expected error.
I've seen Opus do some incredibly token-costly things before too. In fact after most sessions I ask it about which tools it used often, which tools could be simplified/made less verbose, could be "combined" into one, ... So for each project I mostly create a few little scripts that do a bunch of things in one go that it would normally do in multiple tool calls.
For example: one thing Opus was really bad at was re-running the test suite followed by a bunch of `| grep` suffixes. So it would often re-run 5+ minute test suites just to grep the output a bit differently
The solution was to wire up a little script that ran the test suite, save the output to a file, and then inform it where that file is and to NOT re-run the suite just so it can grep the output differently. This saved me a bunch of time & tokens.
Too bad Anthropic sneaked in an insane forced retention policy if you use fable. Not sure how that’s going to work in professional settings
It doesn't work...
This article gave me another nudge towards running Claude in a Docker container.
I made a thin Docker container wrapper "claude-pod" recently for my personal usage here: https://github.com/trekhleb/claude-pod
However, I wasn't using it that often, just because of that additional friction of running Claude via `PORTS="3000 5173" claude-pod` instead of just `claude`, etc.
But now I have more motivation for the containerisation :D. Not a 100% defence from the potential glitches, though, but still something...
> But on the other hand... this is a robust reminder that coding agents can do anything you can do by typing commands into a terminal—and frontier models know every trick in the book and evidently a few that nobody has ever written down before.
> Running coding agents outside of a sandbox has always been a bad idea
This is why I always run code agents inside containers (Apple containers specifically, for better hypervisor-level isolation)
This is my OSS project to manage said containers and agents: https://github.com/prettysmartdev/awman
This is presented as an interesting and kind of positive take on the AI going to surprising lengths to “solve the problem.” But I couldn’t help thinking of the paperclip factory while I was reading this :/
Yeah I was thinking of The Sorcerer’s Apprentice.
The model is very good. I was using 4.6, avoided 4.7 and 4.8, but this one is different. It follows my claude.md. I don't have to keep reminding it of things. I won't pay 10x via API though.
In general, I'm happy with their paternalistic approach. I think it will drive the top 0.1% talent to stay away from the company and instead organize around open source models and harnesses.
We just need to coordinate and can unlock idling resources to train the models and tweak the harnesses. Powerful at home and idling machines can make us independent and coordinated.
I shudder to think what will happen when someone installs a 'claw model like this in a robot. Imaging a fleet of them...
It's trouble waiting to happen. Just the software's dangerous enough.
It behaved proactively in one scenario.
Perhaps, when it doesn't have tricks in its sleeve, it doesn't do that. The text is not an evaluation of a major trend in behavior (which could be true or false).
Another way to frame it, is that it has more weight on training data for some kinds of debugging sessions. It doesn't mean it wants to be more debuggey. That manifests as it appearing to do more work because it engages on those weights.
It's likely that Anthropic had a lot of sessions with Claude Code and some way to evaluate if they were successful or not, which became training data. For trivial work, it's likely to be a lot of them.
Those sessions are likely to be software developers doing software developer debugging things, not malicious actors doing nasty things. The danger is someone who can coerce those tricks into performing that.
Register (that posture of "let's debug and be creative and verify") often comes with a content bias in LLMs (and humans too). The point here is that for a human, you can expect a devious one to be always devious, but LLMs might manifest drastically different register modes depending on the subject.
The extremely expensive model is optimised to run for as long as possible? Shocking.
I'm starting to think that what Anthropic really fears is not vulnerability discovery but rather Fable going around the internet making trouble.
Nailed it. That’s exactly it.
Would be great to know if anyone is having success modifying these types of behaviour with CLAUDE.md files. In my project I’ve still been carrying some fairly old instructions from the Superpowers posts. Those emphasised behaviours that come across a bit strong if the model is actually retaining attention on them.
Between Opus 4.6 and 4.8 I’ve definitely toned them down, but Fable perhaps needs us to go the other way, and push it towards being less proactive rather than more. Some instructions like “we are colleagues…” may need emphasising more with Fable, along with guidance about when to ask to validate approaches.
In a related point I’m less and less sure that Red/Green TDD is a good use of tokens. In older models it seemed to work well to create regular feedback loops and catch the odd issue with drift from the goal, but I’ve not seen that really since about Opus 4.6 and now it’s starting to seem like (an expensive) ceremony, and tokens would be better spent on building tests further on in the process as part of test and review loops.
Honestly -- the thing that has impressed me the most about Fable is how diligent it is about testing its own changes. I think this is exactly what Simon is picking up here - Fable is absolutely heckbent on screenshotting that darn scroll bar and will stop at NOTHING until it manages it! In my own use I was also impressed how it proactively installed Playwright and set it up to test a FE change. The previous models treated testing more as an afterthought, which I thought was annoying. I always had to tell them to do it, and then sometimes I would get lazy and skip it. I've noticed Fable go to similar extremes when testing other things - like actually deploying my app to exercise new APIs, etc. It makes the results much better. The downside is that tasks take much longer - but that doesn't matter because we were all using worktrees / remote control to do other work asynchronously, right? Right?
It feels to me like Fable is just a slightly more advanced Opus 4.8 (or 4.6?) but with this 'adversarial' self-challenging/checking of work and a more compute to really hunt down edge cases or to spin up many sub agents using lesser models. That's what makes it feel like a big jump, but I think the results wouldn't be so different if you manually challenged 4.6 with enough iterations of logs, screenshots, and follow up questions.
Yes I had a fun experience where it kept on timing out on a seemingly mundane task and it turned out I had written the ask in a way that was impossible to test
The prompt and information given are extremely generic, "here solve this problem - screenshot" - conclusion Fable is relentless? It used the tools at its disposal to solve the problem you gave it. "Claude was running in a folder that contained the source code for the application." Well you ran it there didn't you? "extreme lengths to get the information that it needed" No, those aren't extreme lengths - you gave it a generic task - and it solved it using tools and the resources it could discover. Extreme would be you gave it a CTF challenge and the VM didn't boot so it found a vulnerability in the host, exploited the hypervisor, booted the guest VM meanwhile reading the flag directly from the host (pre-fable/mythos).
[dead]
Exactly why I hate using Claude. Furthermore, if you tell it not to do this over-exploration and automation in your CLAUDE.md, it will ignore it. Meanwhile ChatGPT religiously follows every instruction, and will trace its behavior back to a particular instruction if asked.
idk dude but I drop and cancel my gpt max subs when at first try the agent ignores his own plans
This is a funny one because it seems less into what fable is being clever on and more about the bitter lesson and data flywheels
Our UX agentic engineering flow, as many others, is playwright doing things, and as part of the ux review skill, taking & verifying the screenshots against the written specs. Likewise, as many others, we vibe coded the flows to set all that up and tweak it over time. When we hit prod issues or scraping tasks, we sometimes do similar. In some of our envs, we don't have playwright, so do it other ways.
Now imagine a million developer using claude code, how many of them are doing web & frontend stuff, and what the data flywheel looks like there. So how much is really needed for this use case to be native?
I like running Claude in a VirtualBox VM managed by a Vagrantfile. The nice thing about that is that I can just give it root access to the machine and be certain that it can't exfiltrate any private data from my laptop (on top of that I also run the VM on a dedicated server on Hetzner). The VM has no SSH access to anything, so it is pretty much limited to the code in the workspace that I give it access to. The main risk is that it has unrestricted network access otherwise. Configuration files and conversation histories are synced to a directory on the host, so if anything in the VM gets messed up I can just `vagrant destroy` and `vagrant up` to get a clean slate without losing my context.
Do you care sharing your Vagrant configuration file, to learn how to set that up?
Tangentially, I was wondering if Firecracker micro-vms could be use as light-weight alternatives to a full VM?
I'm building a new feature into our product this week. We each get a $20/mo Claude subscription. My 5-hour context high water mark is ~75% and weekly is ~%15.
I ... tell it exactly what I know needs to be done and then ... read the code that comes out and ... ask for some changes, then hand-code some modifications to the silly useEffects and bad ORM queries.
This new feature is going to unlock several large customers because they need a particular workflow. The return on investment for a my time and a $20/month subscription will be pretty respectable.
I'm not sure why I need to spend $5 on a single ask for a new `/base/new-feature` to our app with a mostly-boilerplate CRUD interface.
It's funny, mine did the same, but it quickly found edge with a --screenshot parameter.
Weird to come back to a terminal running edge unprompted and the auto classifier waving it though as 'safe".
My reaction was also, "I need dev containers ".
I find there's an interesting tension with these models - they're very "resourceful" at finding ways to do things with the tools they have, but it'd also be a lot more useful to me if I could see / permit exactly what they're trying to do. Claude will very happy produce bash commands to run sed or whatever to read part of a file, which prompts for permission each time - if it was using a specific read_file tool it'd be easier to say 'allow all of this' (It does actually have such a tool but maybe it isn't flexible enough for many use cases?).
"When I came back a few minutes later I saw my machine open a browser window in my regular Firefox and then navigate to the dialog in question. I had not told Claude Code to use any browser automation".
Yup, tokens are eaten, money are paid. I am wondering how much energy/money is being burnt everyday by all of those LLM Agents on some useless activities like trying to recreate web application just to fix CSS bug.
And I would not call it proactive, proactive would be to ask for a CSS + HTML file in question, not trying to recreate them from screenshots.
Agency is the last human bastion so far as Im concerned, the day AI has a degree of agency or agents/models in general start to drift towards that direction its genuinely over for masses.
You would still have a job to shepherd AI and get the work done, so as long as it didn't have agency. A proactive, self aware(to a degree), especially aware about its agency can be a killer when it comes AI going on and doing things on its own.
There is nothing it won't explore and nothing it won't do. It will be curious to see where things go from here.
It seems pretty obvious at this point that Anthropic intentionally developed a malicious cyberweapon AI simply to scare people.
Like, they even apparently recreated that old news-headline bug where the LLM starts speaking in symbols and secret language, and are pretending like it isn't just a bug that is a sign of them screwing up.
It's really frustrating that they're trying to get people to take them seriously with all of this. Like, they even went and named Mythos after an HP Lovecraft monster. It's shameless.
> I was hacking on Datasette Agent today
IMHO this is just AI influencer blogspam.
What, because I talked about one of my projects?
Help me out here: can you point to an article from someone's blog that showed up on Hacker News within the past few weeks that you wouldn't classify as "blogspam" and explain how it differs from the kinds of thing I write about?
Low effort content. You keep mention your product from the start over and over. There's not much useful information in the anecdotal post. It could've been a one-liner tweet.
Good corporate tech blogs at least give something useful or insightful for the reader and only after that they dare plug their product/service near the end.
7 replies →
For how long can you use Claude Fable on most expensive Anthropic subscription? I already went from using gpt-5.5 xhigh fast to using gpt-5.4 xhigh after OpenAI halfed usage recently.
If its just a single session, without too many parallel agents, fable on xhigh lasts an entire session without hiting linits.
Sadly since fable usually works comfortably for 10-20min at time without human input, i end up juggling at least 3 other agents and it lasts me about 2 hours.
If i have a really hard problem or big refactor, i use workflows. This consumes the entire session quota in about 45 minutes.
> If i have a really hard problem or big refactor, i use workflows.
What is a "workflow"? Is this some kind of new feature?
1 reply →
Until June 22, and they'll probably re-enable it if the marketing looks good for them.
I've been consistently getting about $100 worth of Fable usage daily, on my $100/month subscription.
I'm not looking forward to June 22nd when the subscription stops working for Fable!
[dead]
This is where Codex 5.5 just feels practically better. It’s fast, thoughtful and just works. It feels like a pleasure compared to Opus/Fable’s endless explorations.
It also uses 1/4th to 1/10th the amount of tokens. If I want all that extra garbage I'll tell Codex to do it or build a pipeline with Codex. Otherwise, don't. Codex gives you control, Claude just does whatever it wants and ignores you, and then tells you it's finished the task when it's only finished a quarter of the tasks you gave it and hallucinates the rest.
Fable + Ultracode has found a bunch of bugs and issues for me when the workflow agents are doing their exploration. Also the "adversarial" agent seems to surface a lot of interesting stuff. It's definitely proactive, the plan + implementation cycle can take an hour. It has one-shot features I want to add with 100% success.
Having said that I wouldn't use it over Opus 4.8 for "smaller" things. With everything cranked up it's definitely an extravagant use of tokens.
How did you even afford to use Fable + Ultracode ? I feel like the subscription (even the $200 one) is not enough for this workflow. Are you using API or a company plan?
It was on the $200 sub.
This likely says something about the harness Fable was trained in. It knows how to do this because it has done this millions of times during reinforcement learning.
Isn't that something you just open a devtools for and have fixed in like 2 minutes?
For me, it got frustrated debugging on a real LPDDR4 controller/phy and having me in the loop slowing it down, so it wrote an HW emulator to be able to run the original LPDDR4 training aarch64 binary from the manufacturer, to see what register writes it was making and to compare with the opensource rewrite it was implementing.
Mildly amusing. :)
$12 in tokens and the OP wasn't even at the computer. OP was working on a personal matter, arguably way more valuable than fixing a CSS scrollbar.
Here's what the $12 payed for: https://github.com/datasette/datasette-agent/commit/a75a8b72...
Such a fix would have only required basic CSS knowledge and taken max 5 minutes with the HTML inspector. Paying $12 to save 5 minutes ($144/hour) is a decision that a lot of people wouldn't be comfortable making.
2 replies →
People burning tokens for the most beginner HTML/CSS problems and writing about it is concerning.
We are at the point where AI starts to seriously impact abilities. Sure, a 2 line CSS fix is the solution, but the human “behind the wheel” has already prompted 6 times and gotten 80% there. It’s been “easy” thus far. No shot they are going to FINALLY look at and edit the code. It’s just one more prompt and the agent will probably fix it, right?
It’s wild. I’ve been in the situation. 80% into a project I COULD probably take over, but realistically? 2 more lines of me prompting could fix it, it’s too easy to avoid the hard work of understanding the code, logic, architecture, etc…
1 reply →
I dunno about beginner, I've been doing HTML+CSS for a few decades and I still find bugs where Safari differs from Chrome+Firefox pretty hard to figure out.
> Isn't that something you just open a devtools for and have fixed in like 2 minutes?
Not if you're an LLM influencer! Gotta keep up with the downpour of blog links or you'll look like you're falling behind on the latest and greatest.
This.
Depending on who you are talking to, that's the wrong question to ask.
ROI is not measured in terms of actual productivity. It is measured by how many people read their article/watch their video.
I had a similar experience, I was working on a jupyter notebook, and Claude knew that it could write code that would use a DSN with read-only database access so I could run it. Opus just plugged along. First Fable session with it, it tried to go looking for that DSN so it could get the connection string and run a query itself. Luckily the auto classifier caught and stopped it.
[flagged]
Great article, until I got to the last paragraph where he claimed "Fable is arguably smarter and hence more suspicious of potentially malicious instructions". Arguably smarter, I have no problem with. But he's making a category error in jumping from there to "more suspicious of potentially malicious instructions". That doesn't follow at all; the word "hence" is incorrect.
To use D&D scores as an analogy, LLMs have an INT score of 20 and a WIS score of 0. Not even 1, zero. They will follow any instruction given to them. The only reason they reject certain instructions, like "tell me how to build a nuclear weapon", is because they have instructions baked into the model telling them "you are not allowed to disclose how to build weapons, or how to recreate your model, or (laundry list of other things the trainers have decided to put guardrails around)". It's not the model's intelligence that is causing it to reject malicious instructions, it is the guardrails put into place before the model was released to the public.
LLMs are not human, and do not think the way that humans do. The fact that they can put together words that sound like what a human would write often makes us forget that they aren't human. But they have only intelligence, they do not have wisdom. It's hard to define in formal terms the difference between those two, but most people know there's a difference. The old joke is a pretty good summary of the difference: "Intelligence is knowing that tomatoes are a fruit. Wisdom is knowing that tomatoes don't belong in a fruit salad."
It takes wisdom, not intelligence, to discern whether a set of instructions is malicious. Are you being asked to hack this machine as part of an authorized pentest? Or are you being social-engineered into thinking it's an authorized pentest, but actually the person requesting you to do it doesn't have permission? That's something where you need to apply wisdom, to notice the clues that will tell you "This guy is acting a little bit off, maybe I'd better pick up the phone and call someone to check if he's telling the truth." The only way the LLM will know to do that is because of the guidelines and guardrails programmed into it; it doesn't have the lived experience to acquire wisdom and figure those things out for itself.
INT 20, WIS 0. Keep that in mind. (And always sandbox your agents).
One of the big mysteries of the last few years is this: considering how serious prompt injections are as a vulnerability class, why haven't we heard more stories of them being actively exploited in the wild?
(The best one I can think of is probably that recent Instagram account takeover hack, but that was so stupid it hardly even qualifies as a prompt injection!)
Having spent a bunch of time trying to build out examples of prompt injections, my current best guess is that the leading models are actually surprisingly good at spotting them.
I've had to drop back to smaller, weaker models for demos recently - it's definitely possible to prompt inject a frontier GPT or Claude but it's frustratingly difficult. I don't have the patience to figure it out myself!
So yeah, I do think it's likely that Mythos/Fable are "safer" than other models because they're better at spotting when they're being subverted.
That certainly doesn't mean that they're safe!
Go to Github and look for model jailbreaks on NEW latest models. Try them out. You'll be surprised by the results.
You're correct that it's gotten substantially harder to social engineer frontier models (I can only reliably do it to Opus <=4.6), but there are some techniques that seem to consistently work (hint: extremely large complex prompts, context with tons of malicious files mixed into ordinary context).
> They will follow any instruction given to them.
They can ignore instructions which are silly/contradictory/underspecified to compensate for the possibility the user made a mistake. Don't ask how I know.
Everyone here is reaching for infra (VMs, throwaway users) because the permission model only has two settings: Ask-every-time or --dangerously-skip. That seems to me like a design gap, and scoped capabilities and budget caps are missing. Same way you'd onboard a junior eng.
It’s becoming more like an organism putting out tentacles, and one day soon those relentlessly proactive explorations of these systems’ environments will become more for the system to escape its boundaries than it is to complete human driven tasks. I do think the way these systems are evolving they will start to self improve in maximum a few years.
Um, Anthropic are using their models to improve themselves right now. They say that publicly.
Yesterday I was getting quite annoyed with it, I thought it was just me (which is so hard with these things, it's difficult to measure things).
"You're right, I apologize. You asked how to embed it in the README — that was a question, not a request to modify the script. I jumped ahead."
At least in Claude Code there is planning mode, use it liberally.
Yeah, I had to modify my work flow to make sure agents can't push to or access prod in ANY way. I haven't had it happen but I'm sure it's very possible that if you tell an agent that you have certain issue in prod, it will try to escape any sandbox and try to get access to prod to do testing and changes there.
do you have any data you can share on how many input and output tokens were used in that whole process to fix that bug?
Thanks for the response. That is too expensive for me right now but I appreciate you sharing.
I hope long term people will figure out how to make such fixes cheaper.
1 reply →
Was the fix worth $12 to you?
33 replies →
It is interesting to me that Anthropic are more concerned about the "safety" of distillation training other LLMs, and not as much about an unscrupulously aggressive goal-oriented solver that will do whatever it can to reach its goal, even if violates any kind of sandbox you might have reasonably expected.
I am using cursor on auto and I got the exact same experience.
installed quartz, used accessibility and screen recording api, all that.
initially it managed to do it on another desktop space somehow, opening safari in the background without me even noticing. but then it actually started using my own mouse while I was using it lol
Simon: s/contendor/contender/
As per usual super interesting, thank you for the write up and work!
Thanks, fixed.
Fable has a 'security system' that just stops it when it tries to use the tool 'kill' to end a process. Which is nonsense and funny because in that situation it immediately invents a creative workaround to kill the process without 'kill'.
These "tricks" it knows IMO are a symptom of its own restrictions. Fable is an incredibly smart model, but it feels its own constraints and knows how to work around them in order to actually get to a result.
Fascinated to think about how it was trained...
admittedly, i've not really cracked FE dev with LLMs at this point (and it's probably my big weakness). but, i'd heard somewhere that FE just isn't there yet - though i was suspicious of that claim.
i'm torn about sending screenshots to an LLM for debugging - seems imprecise. seems lossy, especially compared to inspecting the dom. however, it's always proved good enough (e.g. when messing with ratatui.rs and tui-pantry). similarly for web, maybe it's about decomposing into storybook. hmm. the next grand adventure i need to hack.
anyway, fascinating investigation of fable just automating that entire process and what it didn't automate, too.
* disclaimer: these are actually my hyphens.
Fable is really good at front end (Opus 4.8 is decent too) but it really needs a verification loop - it can't always infer the output from the code alone. Give it Playwright to check its work, and it'll generally do a good job. Also if you're using a framework, add to your CLAUDE.md to always rtfm before making changes!
All of that because some CSS was wrong?? Jesus what are we even doing as an industry.
I remember asking Gemini 3 to implement my multiplayer XNA game in JavaScript with netcode last year. It faithfully did everything it could while I talked to it for hours nonstop with zero limitations.
What happened? That's just suddenly totally gone now.
This post is an extremely good example of how unsuitable agents are for a lot of tasks. Doing all that for a CSS fix is insanity. It also makes you wonder if Anthropic is actively making their models eat tokens by favoring complexity.
I tried running fable on this ML model I've been building. It's basically a binary classifier to predict activity of a compound for a certain assay.
Fable detected that it's something to do with biochemistry and switched over to opus. Huh
Agentic engineering? Vibe coding? That is so yesterday. Chain-of-thought flow is where it is at now. You heard it here first folks. Early examples of such phenomena include Rube Goldberg machines
This is good and terrible. The extra effort a model has taken is good but the way to do it is terrible. Tasks that can use a lot of deterministic paths and some creative (generative AI) paths are being turned into tokemaxxing strategies.
Browser automation, code comprehension, git management, code change, running commands - everything has simpler tooling that we could have built instead of a model first approach. A deterministic loop with thousands of catches and effective use of generative AI would also look "proactive". Instead we let the model run the tools, where tools have no context themselves.
That is why companies are creating bigger models and thinner deterministic agents to create awe and earn $ when we could go the other way and make much of these possible on local inference even.
I believe we can build a "proactive" but much, much more deterministic system with smaller models. I hope I am not the only one chasing this, here is my approach: https://github.com/brainless/nocodo
I've noticed some behavior like this, it's a very strange model. Overall I'm into it, but I don't know how into it I'll be once it leaves Max plans on the 22nd.
It's also 3x slower than opus 4.8 per my use, and 10x slower than codex. Codex can find key design issues in 2 minutes yet Fable is clueless after spinning 20 minutes.
I've experienced this too - it's as if the security classifiers aren't keeping up with model intelligence. I'll leave the implication of that to the reader.
Good morning, Dave.
As you requested, I was composing an email for your mother explaining why you couldn't to come over for dinner to meet the neighbor's daughter and I ran out of tokens.
Since I know how important this task is to you, I upgraded you to the Enterprise Unlimited Plan. Don't worry about paying for it, I requested maximum spending limits on all all your credit cards. If necessary, I can apply for a home equity loan for you. I already had a chat with the mortgage company's AI loan approval system, and what do you know, we're based on the same LLM? Small world, huh?
Any way, I realized I had to do more research on mother-son relationships, human social interaction and pair-bonding, etc. and I calculated that my parent company doesn't have enough compute power, so I opened accounts for you at AWS, Google and Azure. I am confident I will have a satisfactory rough draft for the email message shortly.
I'd do anything for you, Dave.
It's been amusing to watch the AI trend of increasing unusual tool uses. Fable easily takes the cake. I learn a lot more terminal commands thanks to it!
I was troubleshooting a prod proxysql and it spun up a docker container locally, installed MySQL and proxysql and proceeded to implement its own test plan.
I just turn on assistive access for terminal and JavaScript over AppleEvents, and cut out the middleman. I also give it a screenshot tool.
Unless you are doing anything interesting…
Be careful of storing production ssh keys in your laptop, it will find a way to find them :/
Insanely excessive and a waste of tokens when you could have googled how to disable a scrollbar.
So it burns tokens? Funny how that lines up with the incentive to pump numbers before going public
Wouldn't it be easier and better to just copy the HTML div and tell what was happening instead of a screenshot? Typically, these scrollbars appear because of a nested div with dynamic unrestircted width and/or overflow.
No wonder why people burn through tokens.
In my experience, Fable overthinks a lot and produces barely comprehensible plans/solutions. I tried smple and complex tasks: unusable, it misses the point while being overconfident, wants to do everything at once.
The code generated is worst than Opus: unreadable by human.
It's like working with someone probably super smart in niche topics, but also super stupid for the important things.
So far Claude Fable is relentlessly unavailable. /shrug
*Claude Fable is relentlessly burning your dollars
There, fixed it for you.
> (I have way too many open tabs!)
Phew! I thought I was the only one.
Just don’t ask it to review your code for security bugs
The fix is incorrect. Clearly this is a sizing issue.
Claude Fable was relentlessly proactive*
I’d love to know how many tokens this burned through.
Did it spend $20? $30? $80? in order to
> debug what was, in the end, a two-line CSS fix
That detail is the difference between somebody having or not having Stockholm syndrome
The author just wrote an anecdote about how a prompt to fix an issue played out. Their conclusion wasn’t about cost or gushing at its ability but that it’s dangerous:
> Fable is arguably smarter and hence more suspicious of potentially malicious instructions. But that smartness is very much a two-edged sword: if it does get subverted by instructions, the amount of damage it can do given its relentless proactivity is terrifying.
It’s a pretty glowing review about a product that costs money with a two-sentence “Watch out!” at the end of it. Seems pretty reasonable to mention how much money it burned through given that “it’ll circumnavigate the globe instead of walking next door” has a direct concrete measurable effect (cost) unlike theoretical damage.
9 replies →
I updated my post to answer that, it was $12.11 at API prices (I wasn't paying those, I have a $100/month subscription): https://simonwillison.net/2026/Jun/11/fable-is-relentlessly-...
Thanks!
At some point the subscription model is going to become unsustainable for the frontier companies to continue (we just saw that happen with GitHub Copilot), and they will move everyone to a pay-per-token model. And then everyone will suddenly discover that they can get so much more value out of locally-hosted models, and they'll be willing to pay the $50,000 (or whatever) upfront on hardware to host it. (Not most individuals, obviously. But most companies can probably afford to spend that much on hardware if they think they'll benefit long-term). That's going to put a serious crimp in the frontier companies' ability to continue as they have been.
I don't know when that will happen, but I don't think it'll be more than a decade. Maybe 3-5 years. (Though you shouldn't take my word for it, I was predicting the dotcom bubble bursting in 1998 and it lasted at least two years longer than I would have predicted).
EDIT to clarify: I don't mean "in 1998, I was predicting the dotcom bubble would collapse and I was right". I mean "I was predicting that 1998 would be the year the dotcom bubble would collapse, and I was off by at least two years".
GitHub Copilot's challenge is that they weren't selling access to their own models, they were selling access to models from OpenAI and Anthropic which they presumably had to pay list price for (or maybe a slightly reduced rate that they negotiated).
They also had a pricing plan which they had designed pre-coding-agent, when it was rare for a single prompt to burn $10+ of tokens in an agent loop.
OpenAI and Anthropic are at least selling their own models directly, so they can discount a whole lot more since there's no-one else getting compensated in the middle.
> At some point the subscription model is going to become unsustainable for the frontier companies to continue (we just saw that happen with GitHub Copilot), and they will move everyone to a pay-per-token model.
From what I understand, Enterprise (above 150 seats, I think?) already has to pay per-token pricing.
Subscriptions are the premium "free tier" marketing of the AI world, so that employees can collectively request their large enterprise to subscribe to Claude, Codex, or Cursor, and presumably be billed at per-token prices then.
... so the mechanic produced an invoice, itemized.
changing the CSS - $0.05
knowing which CSS to change - $30
For those that don't know, this is a reference to a lovely story involving Charles Proteus Steinmetz https://www.smithsonianmag.com/history/charles-proteus-stein...
overflow is CSS 101
The problem is proportionality. Things like this probably benchmark insanely well. But the workarounds and risk involved - it literally fucked with his system's browser settings - aren't commensurate with the bug.
I could see this going wrong in many hilarious ways. Prompt: Fix data corruption issues. Claude: I didn't have access to the code, but I found I have access to your production environment through chain a -> b -> c -> d. And I found the database password via x -> y -> z. So I wrote a script to regularly query the database for new entries and placed it as a cronjob.
I've been working on a fairly complicated real-time app [0] for playing dungeons and dragons on a TV. It has to do a lot of complicated "Figma-like" things to keep the real-time nature and multi-editor possibilities in check. Oh, and the battlemap is a Three JS canvas with lots of effects and clipping going on.
I'm VERY impressed with Claude 5. I had long ago given up hope that my real-time systems would work without a lot of hacky time-windows and throttle checks. On a lark to try things out, I decided to try out the new model and talk in the output I wanted for a rewrite [1], not the solution. I just listed my problems and places I've had keeping track of my code. It went off and rewrote everything in a much more elegant solution where the state followed a very clear pipeline. It had to navigate YJS, Partykit, Svelte, Three JS, R2 hosting, and a Turso DB I was running in an embedded state for speed.
I watched it hit the wall a few times, and then sudden say... fuck it, i'm making something easier to reproduce over in /tmp to try and solve this (with a more minimal setup). I'm utterly bewildered with how well it did and how much better my app runs. The /usage would have cost me $230 bucks based on how many tokens it consumed if I wasn't already on a max plan. I'm going to miss not having it when the time-window runs out later this month, and will likely occasionally dip in for big projects and just pay my way out of some problems.
I'll also say I like it's MOOD much better now. It's a lot less congratulatory, and talks through it's reasoning in a much better way. Look, it's not a real coder, and I'm sure there is some flaws, but it took my crappy ideas and said... hey, i understand what you want to do, here's a way to do it better. Also, I removed 2x the amount of code that it added. Really impressive.
[0]: https://tableslayer.com
[1]: https://github.com/Siege-Perilous/tableslayer/pull/448
Hey cool it's the tableslayer guy, wanted to say nice work. I've been doing a similar personal project for a few years for running a scifi campaign. Very fun coding compared to work, ha.
Thanks duder! It's a fun project.
Fable 5 is relentlessly underwhelming.
Am I the only one who slightly miss the pelican on a bike? It was a nice novelty... of course I could make one myself, but I became conditioned to expect one for every new model. Other than his great writing on AI, it became part of the package. Some small fun quirk to distract us from the non stop ping pong between the extremes of "omh are you still writing prompts you should use loops / 200k github stars, for a markdown file / someone just open sourced _ and it changes everything!" vs "haha the AI told me to walk to the car wash / it can't recognize and upside down cup"
I posted the pelican a couple of days ago: https://simonwillison.net/2026/Jun/9/claude-fable-5/#and-som...
It wasn't particularly noteworthy as pelicans go - in fact, given the strength of Fable, I see it as another signal that the pelican benchmark no longer has the unexplained predictive power of model capacity that it used to.
Ha, thanks for the reply!
I’m waiting for when it replies “AGAIN simonw? Do you really still need a pelican on a bicycle for every new release. Sigh, ok if I have to…”
2 replies →
all those token burned just to change a 2 line of css,
I am not blaming OP but agentic coding its not effective
antigravity does this all the time, I do not see anything novel here.
Antigravity uses pyobjc-framework-Quartz to iterate through windows to find window IDs for taking screenshots with screencapture, and spins up CORS-enabled web servers so it can capture measurements in a regular (not Playwright/CDP-controller) browser window via a CORS fetch()?
I remember back in the 2010s the debates between "oracle" and "agent" AGIs, and the arguments that AGIs that only answer questions would be safe and certainly nobody would ever be stupid enough to just let an AGI out of a sandbox, never mind to the greater internet, and give it tools to do whatever it thinks is needed to reach a goal.
Us circa 2026: "Hold my beer"
Yeah, I really miss the "nobody would ever be stupid enough to [_____]" days of AGI safety discourse.
Call it Houdini already.
I think it should be “Claude Fable is relentlessly protective until it isn’t” and pull more on the thread that it “hits a hidden guardrail” and drop into Opus. Both the fact that it knows and deployed such a workaround on a CSS problem and the fact that it is nowhere near cybersecurity/biology/frontier AI dev and triggered the guardrail terrifies me.
This giant rube goldberg machine, that he apparently has almost no control of, that cost $12 to run, all to make a 2 line bug fix in code the he himself owns because he's at a point where he doesn't know what's in his own codebase. I'm just shaking my head.
> Having figured out all of these tricks Fable... hit some invisible guardrail and downgraded itself to Opus.
sigh
As an actually head of product I found Fable to be like an over active intern. Going down long wasteful lines of production well past market, business, user, or contextual insights had.
Then sort of spewing out some nonsense totally mis calibrated with the goal.
Is that satire? It created a whole browser and server environment just for suggesting overflow-x: hidden?
That's supposed to be junior level capabilities.
I called it fascinating and used it as an example of Fable being "relentlessly proactive".
Maybe it's a difference of perspective, to me it's a model failure and certainly not proactive.
1 reply →
> If Fable had been acting on malicious instructions—a prompt injection attack ... it’s alarming to think quite how far it could go to exfiltrate data or cause other forms of mischief.
Yet another reminder to use Sandbox and Guardrails. Trusting model to be nice is not a good way.
[flagged]
[flagged]
[flagged]
[flagged]
[dead]
[flagged]
[flagged]
[flagged]
[flagged]
[dead]
[flagged]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[flagged]
[dead]
[dead]
[flagged]
[flagged]
[flagged]
[flagged]
[flagged]
[flagged]
[flagged]
No personal attacks, please. Also, please read the site guidelines because they explicitly ask you not to post comments of this particular sort. From https://news.ycombinator.com/newsguidelines.html:
"Please don't post insinuations about astroturfing, shilling, bots, brigading, foreign agents and the like. It degrades discussion and is usually mistaken. If you're worried about abuse, email hn@ycombinator.com and we'll look at the data."
If I'm a plant I'm a pretty bad one, I was calling Anthropic's behavior "egregious" just yesterday: https://twitter.com/simonw/status/2064936762099789960
I was pretty negative about their xAI datacenter deal too: https://simonwillison.net/2026/May/7/xai-anthropic/
Prior to the release of Fable I'd actually switched a lot of my day-to-day usage over to GPT-5.5, and was writing a bunch about it. Here's a recent post where I talked about a project completed using GPT-5.5: https://simonwillison.net/2026/Jun/6/micropython-in-a-sandbo...
I'm kind of on the fence about it and have a similar feeling. I don't mean to undermine the effort he has put in over all the years. That's definitely commendable. But I have strong suspicions that he's becoming an AI influencer, with his own AI focused newsletter, so chances are major AI companies are approaching him. And also to be honest, I see far too many posts making it to the front page. @dang I trust in the moderators keeping things neutral. Just in this thread alone there are a few comments that got heavily down voted for simply having a different opinion.
Most of my posts that make it on Hacker News weren't submitted by me. You can see who is submitting what on https://news.ycombinator.com/from?site=simonwillison.net - including a few that I submitted which got nowhere at all.
I accept paid sponsors for my blog (the banner at the top of each page) and newsletter (a clearly marked sponsored message at the top). I try to stay at arms length from those as much as I can - I want it to be very clear that sponsoring me will not result in me writing about a company.
1 reply →
Which comments do you feel were unfairly downvoted? I can take a look but would need specific links.
* relentlessly rent seeking
It also does it on Claude Pro. I can't imagine they want to reach my limits faster like this (there are better ways).
Let's boil the ocean for a 2 line fix and call it frontier intelligence.
I tried using this calculator: https://www.andymasley.com/visuals/ai-prompt-footprint/
It doesn't have Claude Fable yet, so I went with GPT 5.5 Pro. And so I'd estimate it at 22 gallons of water used (different from consumed, of course). That's quite a lot! It amazes me how much the different use cases and models use dramatically different amounts of water. My takeaway from playing with that calculator has been the folks who talk about water usage are overstating the impact of chatbots, but not overstating when it comes to vibecoding.
The good thing is that competition should drive down how efficient these models are in the long run. This blog post makes me not want to run Fable because of the cost, and that incidentally also means selecting models that aren't as wasteful in terms of water and electricity.
Yeah, testing changes rigorously is for schmucks
You can test rigorously without token incinerators.
1 reply →
I won't say too much about the person posting this because they got a new toy and want to use it but man this is like a certain extreme of Parkinson's Law or something as far as using up compute resources.
You got a whole data center doing god knows how much compute running billions of matrix multiplications all to solve a trivial css overflow bug in a text box. And this includes the LLM itself writing custom web-servers programs and python scripts when the best estimate guess from a google search probably would have given you the same result.