Comment by piker
9 days ago
This is exactly right. By offloading this trivial task to the LLM, Simon has abandoned the opportunity to evaluate the abstraction with additional information and improve it. Instead, we let the agent spend $12 and make the fix while learning nothing.
Things I learned from this:
- Fable will do a whole lot more than you might expect in order to verify a fix. I learned that it's "relentlessly proactive". That's a good title for a blog entry!
- You can take screenshots of a window in macOS using the "screencapture" CLI command, but you'll need the integer window ID first.
- That windowID is accessible via "Quartz.CGWindowListCopyWindowInfo(Quartz.kCGWindowListOptionOnScreenOnly, Quartz.kCGNullWindowID)" using the pyobjc-framework-Quartz library, which installs cleanly via "uv run".
- A neat trick for simulating keyboard shortcuts is to run document.dispatchEvent(new KeyboardEvent("keydown", {key: "/", bubbles: true})); after the page loads.
- You don't need Flask or Starlette to run a CORS-enabled localhost server for capturing JSON from another window - 19 lines of code against the Python standard library http.server package works just fine.
- getComputedStyle(document.querySelector("navigation-search").shadowRoot.querySelector("textarea")) works to read dimensions from inside a Web Component's shadow DOM.
- defaults write com.google.chrome.for.testing AppleShowScrollBars Always
- Claude Fable knows how to apply all of the above. It's always interesting to pick up hints of what a model can and cannot do.
I'm always confused at how many people equate using a coding agent to solve a problem with "learning nothing". If you pay attention to what it's doing you can learn so much!
Sorry that wasn't a criticism of you!
I completely see how it was misread that way. I would edit it now if I could.
I was using you more as an example of a hypothetical programmer using it in this way. If the goal is to create a maintainable product, this isn't a great approach. If the goal is to learn about the model and its behaviors itself, of course this is a fantastic way to experiment. Yes, you might have learned a lot of tricks as a side effect, but avoiding the pain of thinking about, finding and hiding the thing may mask a better abstraction that reduces complexity and allows the project to move forward faster.
Honestly my goal is to learn how to teach an agent to build a maintainable product, so I'm way more interested in the learnings at the agentic level (how to prompt/direct/manage context/restrict tool use, provide reusable shims, etc) than getting into the details of a css bug. That's just not a level of abstraction with sufficient leverage for what I'm trying to do.
I stopped coding a while back because I could have more impact directing a team of developers than writing code personally.
For my use case, the agents are now how I can have that scaled impact.
1 reply →
> If you pay attention to what it's doing you can learn so much!
I think your post is fair but it's worth pointing out that learning via watching is much less effective than learning via doing.
I used to believe that was universally true, but then I learned about the "worked-example effect": https://en.wikipedia.org/wiki/Worked-example_effect
1 reply →
It leads to less cohesive shared vision on how to solve problems. In groups where I am trying to foster a shared technical vision, I try to get people to do “see one, do one, teach one” for procedures that are common enough to come up repeatedly (and as a method for discovery for where automation would be a bigger win). Pure green-fields software dev sometimes is doing such novel things that that doesn’t work well, but much of routine software maintenance is discovery of the steps needed to add a new flow or a new customer type or a new configurable behavior, which benefit from consistency.
The whole saga is kind of nuts, but the thing that fascinates me most is that Fable got this far and then hit some kind of guardrail; I'd be very curious to know what it wasn't able to do that caused it to downgrade to Opus.
It already got extremely... invasive? It didn't do anything that I wouldn't have approved in the same case, but it's interesting that it got as far as launching browsers, inspecting every open window, and storing screenshots to disk, and then it was stopped by something? I wonder what.
It feels like there should be a budget approval, in that particular case $12 worth of KW/h - tokens were spent, without a clear approval.
It sounds like you learned lots of things related to the tool, but not so much about the problem that you were using the tool to solve?
Is that fair? Not trying to snark? I see similar results myself
Learning doesn't happen in a vacuum. Even pre-LLM days where I'd scour stack overflow for the solution to one problem, I'd inadvertently learn other random stuff while looking.
Yes, that's entirely fair.
Most of my career success has been based on my tendency to be relentlessly proactive and it does not surprise me in the least that frontier models would start to pick up on these strategies (I'm pretty sure each of the individual things you list above are available in the codeoverflow parts of the training corpus, and combining them to achieve a goal seems ... like a fairly obvious result of the type of training these models go through.
About a year ago I remarked to people that despite all my attempts to make data more programmatically accessible, the most effective way for AI to interact with a modern computer is to use the built-in accessibility interfaces driving actual desktops with full applications. IE, the best API for an AI is the UI (mainly because that's what most humans use).
Opus also do this kind of tehcnically competentent but dumb deviations to fix a simple issue where asking for input would be better. Models have no illative sense.
It's like saying you can learn so much about math from using SymPy to solve equations. Yes, you probably can. If you pay close attention to what is happening and can integrate the techniques being used into your knowledge.
But your learnings here are what, a handful of hacks? For most people it's like being shown the chain rule (which frankly, is more general than any of these learnings) without knowing what a derivative is. It's knowledge that comes context free. And even when it can be understood, I'm not sure I believe it gets integrated especially well when you did none of the work to understand it. If you are extremely diligent and self-aware about what your limitations are, and careful to be sure you have an understanding of this knowledge, sure I guess you can learn a lot.
And ultimately what do you think is more likely? People using the experience of using these tools to progress their knowledge or for them to rely on the answers uncritically? I think people with a rosy view about this are severely undercounting the problems associated with the trust relationship between a person and an LLM and what that means.
> I think people with a rosy view about this are severely undercounting the problems associated with the trust relationship between a person and an LLM and what that means.
Personally I think the impact of LLMs on children's education is a crisis right now.
Kids are not going to learn to write if an LLM writes their essays for them. And writing is how you learn to think.
3 replies →
It was only pursuing the goal you gave it - Keep Summer Safe.
nobody ever asked how the car with the dead battery was still able to murder hundreds of people with laser beams and stuff
"Oh my God"
2 replies →
Thanks for documenting your personal observations. I do have a few questions. First, could you expand by giving other examples on how you observed this model to be relentlessly proactive? From my personal experience with prior frontier models using both Claude Code and Codex I found them to already be quite proactive depending on the domain (although Codex a bit less so, which I personally prefer). The main task that they seemed to struggle with for me are tasks that naturally have long run times for the programs the agents wrote, as they didn't seem to have a good intuition for when/how to change approach to minimise the time spent on the task. Specificically if you are trying to scrape sites/services that are heavily guarded against programmatic access or running automated tasks that call LLMs (such as indexing or document extraction). I'm not surprised that for web dev the proactiveness is the most obvious improvement, as I would expect the most common use case with the most training data to be the biggest priority. I have previously built a similar workflow as you described Fable 5 to auto test changes to the website and while it worked somewhat well, it often couldn't identify obvious flaws to the human eye, such as overlapping text or inconsistent font choices as well as bad layout decisions. I do like it for quick prototyping, but the testing and design decisions were not ones I would hand off at this moment. Did you notice improvements in these areas? Can you share how it does for long running programs?
If you want I can give you some more specific instructions to test, but I would also be happy to hear from your own use cases.
The visual regression point is interesting. In my experience, the models that do best at "overlapping text/bad layout" catches are the ones being fed actual screenshots rather than DOM snapshots. If Fable is doing screenshot-based diffs natively, that would explain an improvement there, but I haven't verified it.
1 reply →
Are you using Claude Code or a different agent? I'm curious how screenshots are being fed back into the model? Does CC register a tool for this, or is Fable just using a bash tool to perform the screen capture, and then what tool is it using to request the resulting image to be fed back to it?
Claude Code can process images by reading the files. And as I found out the other day, it also knows ffmpeg well enough to process videos even though it has no native video capabilities...
While debugging, it asked me to pass it a video from the past testing, proceeded to generate a "contact sheet" of the video using ffmpeg, interpreted the image to figure out which frames it needed, and extracted the full size frames and extracted the relevant text from it and used it to reproduce the problem with Playwright...
2 replies →
I was using the Claude Code CLI harness. It can "read" any image file on disk, so all it needs is a way to create a file in one of the standard formats supported by the Anthropic API.
That's a lot learned about debugging, sure, but it's worthwhile to note that it doesn't tell you much about the abstractions used to build Datasette, as the previous commenters pointed out.
I designed those abstractions myself.
[flagged]
And Fable is still worse than Codex.
I use both and the only thing (as always) that I will use Claude for is UI design.
Opus 4.8 and now Fable are still both worse at actually getting the job done than the Codex model. Claude models write FAR too much code when it's not needed, they burn far too many tokens, when they are not needed, write un-necessary tests, write plans which are 5 pages longer than are needed, etc. etc.
Have you actually compared code quality and plan quality versus Codex? It's demonstrably worse.
I don't know what problems you're working on but Fable is not just better, it is a step change from GPT 5.5 in my experience. It feels at least one major model generation ahead.
4 replies →
In my experience writing about 50 programs with fable, opus, and GPT, fable is a significant step change better than opus which is significantly better than GPT. We must be doing different things.
5 replies →
Curious, which model do you use for Codex? I'm very happy with the solutions '5.5 high' finds. It's like it understands exactly what I mean and it also anticipates all sorts of situations. Before I used '5.5 medium' for some time and it was a bit underwhelming. It may sound funny but it's like it didn't care that much to do a good job.
2 replies →
What are your harnesses? Do you have the same skillsets/tools/etc for both?
1 reply →
But Simon is not trying to get good at CSS debugging, Simon is trying to learn about AI systems and produce content about them. So giving the AI agent a trivial task to go crazy on is a feature, not a bug.
For $12 implied cost, he got a front-page post on HN with 500 comments. What is that worth? :-)
> What is that worth? :-)
This is one of those double edge sword situations. It is on the front page and it stays because it will trigger a lot of people and he has to spend a lot of effort explaining himself. What is that worth?
His explanations would most likely be buried deep so the impression that others get might be worsened. What is that worth?
In my opinion, this is one of those find a harder problem and you would still have the same content...but it might not draw as much feedback and stay on the front page longer.
To most of us that's worth a ton, whereas he's probably had enough front-page posts that there's less value to him, although still likely more than $12 worth.
>enough front-page posts that there's less value to him
On the countrary I'd say it's probably even more important - without (amongst doing other "thought leader" things) getting on the HN front-page regularly an influencer's value to the industry disappears (not criticising him here)
5 replies →
People are missing that Willison is among the very best people we have in the role of (for lack of a good name): early access to frontier models, evaluate them in real scenarios, no wishful thinking, hype, or doom, communicate the possibilities. Yes he could have fixed this himself but then he would have learned nothing about the AI, and we wouldn't have read a fascinating and important article.
>> he would have learned nothing about the AI
there is absolutely zero value in spending time to learn about new models as in few months new model will be out and whatever you learned about the current one will be useless.
Also with models getting better and better you have to know less and less to achieve same results.
My experience has been the exact opposite.
As the models get better you need to know more about their capabilities, because otherwise you risk prompting Claude Fable 5 like it's GPT-4o and complaining loudly about how it's all hype and nothing about these models is improving at all (yes, I do see people say that.)
Getting the best results out of these models requires skill, experience, intuition, and domain expertise. There's always room for improving every one of those.
11 replies →
There’s zero value? Surely you don’t believe zero, it’s potentially the most powerful predictive AI in the world ever made? Maybe only incremental steps sure. But also their IPO is coming, you don’t want people evaluating them beforehand?
1 reply →
you know, women make a big deal about you meeting their father/parents, and honestly, I'm too autistic to really fucking have put any importance until now as to why that was remotely important, but if N+1 is coming for your job, it seems it might be worth your while to know the capabilities of N, no?
[dead]
I see it as a prioritization exercise. I know the above is a trivial example, but more generally, does the guy who wrote Datasette and Django want to wrangle front end and css, or do they want to work on something else?
See above https://news.ycombinator.com/item?id=48498573#48502311
> By offloading this trivial task to the LLM, Simon has abandoned the opportunity to evaluate the abstraction [...]
While by itself that would be true, Simon commonly blogs about things he's up to.
That action provides the opportunity for evaluation, and additionally evaluation by a wider audience.
So, it's not the same scenario as non-bloggers offloading a task... :)
[flagged]
Here's a handy calculator you can use to estimate how much CO2 and water I wasted with my coding agent session: https://www.andymasley.com/visuals/ai-prompt-footprint/
The real point is not "one session", it's the fact that people now do that routinely, that CICD are using those to check every commit, and each search engine query now does that too, so it multiplies
Not sure what point you wanted to make, but this calculator is quite shocking. GPT 5.5 pro, with "a long document" and 10 requests a day gives 25% of daily CO2 emissions!
Ten coding sessions a day with Opus is still 4.7%!
This feels enormous. I will definitely stop rolling my eyes when people complain about AI CO/water usage...
3 replies →
This very obtusely ommits the demand for new data centers and related infrastructure that using AI creates, the going "vegan for a year" option assumes less cows being born but somehow the "don't use AI" doesn't assume that the data center wasn't build in the first place.
1 reply →
[flagged]
As someone who actually gives a shit about the environment and global warming and has been putting this into practice for more than a decade through daily personal sacrifices: no, I downvote it because if you properly look into it, AI is just completely insignificant compared to cars, air travel, clothing, food, needless junk and so on that it's a joke. It's always brought up by people who never cared, but now pretend to do so because they hate LLMs for other reasons. The irony is that some of those are actually _good_ reasons but they're too cowardly to admit them. There's nothing unmanly about admitting you're afraid of AI taking your job, becoming more intelligent, and ending up in a dystopia.
Go run the numbers and compare them vs. what it takes to produce a single hamburger or hoodie. Anyone who actually cares has already done this and drawn this conclusion.
2 replies →
While one can raise environmental concerns about the AI datacenter buildout, I don't think it is fair to say that it "ruins the planet".
I don't think it is a good contribution to the discussion around Simon's LLM use to fix a CSS bug.
That's an interesting choice as a source. It doesn't mention climate change or human impacts at all and describes El Niño as a naturally occurring event.
> The El Nino is a phenomenon that occurs naturally
6 replies →
It was posted at 5am in New York... not sure that that was a US view, so the fact that the platform is US-owned doesn't seem so relevant, if there's a global audience.
That being said, I do agree it is a legit thought (and moreso, completely on point in the subthread discussing downsides), and that it shouldn't be downvoted.
[flagged]