← Back to context

Comment by bananaquant

9 days ago

This to me reads like a poignant commentary on the catastrophic loss of human agency, with the actual commit being highly revealing [0].

Author wants to hide a horizontal scrollbar. Any junior frontend dev worth their salt will be asking right away "where do I stick `overflow-x: hidden;`?" A complete solution will then require hitting "Inspect element" in the browser to find the CSS class and running (rip)grep to find where it is in code, to then add a single line to.

An actual proactive programmer might start asking more pointed questions like what content does an empty textbox have that it overflows? And why do I need to insert this workaround that treats the symptom and not the root cause in two different places? Isn't it better to style `textarea` once? Etc, etc.

[0] https://github.com/datasette/datasette-agent/commit/a75a8b72...

They might also ask why a bunch of static CSS inside a bunch of JavaScript is hiding inside __init__.py[0] - hopefully before trying to fix some detail of the CSS.

(I'm surprised to see it actually, since my own use of Claude has mostly yielded well-structured code. But I'm not doing proper vibe-coding, more like friendly Socratic arguing with another engineer who happens to be a robot.)

[0] https://github.com/datasette/datasette-agent/blob/main/datas...

  • > friendly Socratic arguing with another engineer who happens to be a robot

    Ha! Same! Still feels like the best way to go about it, really. I know the dream is to one day remove humans from the loop... but I'll enjoy the dialectic while it still seems the most productive!

    • Same, I like to call it rubber duck coding (now the duck talks back!)

      Edit: Now I want an LLM connected rubber duck with a speaker/microphone that sees your screen

      2 replies →

    • For my own projects, I'm very happy with an outcome that is "not faster, but better" as a result of my use of generative AI.

      I still hope this will be a shared goal in at least some tech companies long-term. But the headwinds are strong. "Not better, but faster" is starting to look like a job requirement.

This is exactly right. By offloading this trivial task to the LLM, Simon has abandoned the opportunity to evaluate the abstraction with additional information and improve it. Instead, we let the agent spend $12 and make the fix while learning nothing.

  • Things I learned from this:

    - Fable will do a whole lot more than you might expect in order to verify a fix. I learned that it's "relentlessly proactive". That's a good title for a blog entry!

    - You can take screenshots of a window in macOS using the "screencapture" CLI command, but you'll need the integer window ID first.

    - That windowID is accessible via "Quartz.CGWindowListCopyWindowInfo(Quartz.kCGWindowListOptionOnScreenOnly, Quartz.kCGNullWindowID)" using the pyobjc-framework-Quartz library, which installs cleanly via "uv run".

    - A neat trick for simulating keyboard shortcuts is to run document.dispatchEvent(new KeyboardEvent("keydown", {key: "/", bubbles: true})); after the page loads.

    - You don't need Flask or Starlette to run a CORS-enabled localhost server for capturing JSON from another window - 19 lines of code against the Python standard library http.server package works just fine.

    - getComputedStyle(document.querySelector("navigation-search").shadowRoot.querySelector("textarea")) works to read dimensions from inside a Web Component's shadow DOM.

    - defaults write com.google.chrome.for.testing AppleShowScrollBars Always

    - Claude Fable knows how to apply all of the above. It's always interesting to pick up hints of what a model can and cannot do.

    I'm always confused at how many people equate using a coding agent to solve a problem with "learning nothing". If you pay attention to what it's doing you can learn so much!

    • Sorry that wasn't a criticism of you!

      I completely see how it was misread that way. I would edit it now if I could.

      I was using you more as an example of a hypothetical programmer using it in this way. If the goal is to create a maintainable product, this isn't a great approach. If the goal is to learn about the model and its behaviors itself, of course this is a fantastic way to experiment. Yes, you might have learned a lot of tricks as a side effect, but avoiding the pain of thinking about, finding and hiding the thing may mask a better abstraction that reduces complexity and allows the project to move forward faster.

      2 replies →

    • > If you pay attention to what it's doing you can learn so much!

      I think your post is fair but it's worth pointing out that learning via watching is much less effective than learning via doing.

      3 replies →

    • The whole saga is kind of nuts, but the thing that fascinates me most is that Fable got this far and then hit some kind of guardrail; I'd be very curious to know what it wasn't able to do that caused it to downgrade to Opus.

      It already got extremely... invasive? It didn't do anything that I wouldn't have approved in the same case, but it's interesting that it got as far as launching browsers, inspecting every open window, and storing screenshots to disk, and then it was stopped by something? I wonder what.

      1 reply →

    • It sounds like you learned lots of things related to the tool, but not so much about the problem that you were using the tool to solve?

      Is that fair? Not trying to snark? I see similar results myself

      2 replies →

    • Most of my career success has been based on my tendency to be relentlessly proactive and it does not surprise me in the least that frontier models would start to pick up on these strategies (I'm pretty sure each of the individual things you list above are available in the codeoverflow parts of the training corpus, and combining them to achieve a goal seems ... like a fairly obvious result of the type of training these models go through.

      About a year ago I remarked to people that despite all my attempts to make data more programmatically accessible, the most effective way for AI to interact with a modern computer is to use the built-in accessibility interfaces driving actual desktops with full applications. IE, the best API for an AI is the UI (mainly because that's what most humans use).

    • Opus also do this kind of tehcnically competentent but dumb deviations to fix a simple issue where asking for input would be better. Models have no illative sense.

    • It's like saying you can learn so much about math from using SymPy to solve equations. Yes, you probably can. If you pay close attention to what is happening and can integrate the techniques being used into your knowledge.

      But your learnings here are what, a handful of hacks? For most people it's like being shown the chain rule (which frankly, is more general than any of these learnings) without knowing what a derivative is. It's knowledge that comes context free. And even when it can be understood, I'm not sure I believe it gets integrated especially well when you did none of the work to understand it. If you are extremely diligent and self-aware about what your limitations are, and careful to be sure you have an understanding of this knowledge, sure I guess you can learn a lot.

      And ultimately what do you think is more likely? People using the experience of using these tools to progress their knowledge or for them to rely on the answers uncritically? I think people with a rosy view about this are severely undercounting the problems associated with the trust relationship between a person and an LLM and what that means.

      4 replies →

    • Thanks for documenting your personal observations. I do have a few questions. First, could you expand by giving other examples on how you observed this model to be relentlessly proactive? From my personal experience with prior frontier models using both Claude Code and Codex I found them to already be quite proactive depending on the domain (although Codex a bit less so, which I personally prefer). The main task that they seemed to struggle with for me are tasks that naturally have long run times for the programs the agents wrote, as they didn't seem to have a good intuition for when/how to change approach to minimise the time spent on the task. Specificically if you are trying to scrape sites/services that are heavily guarded against programmatic access or running automated tasks that call LLMs (such as indexing or document extraction). I'm not surprised that for web dev the proactiveness is the most obvious improvement, as I would expect the most common use case with the most training data to be the biggest priority. I have previously built a similar workflow as you described Fable 5 to auto test changes to the website and while it worked somewhat well, it often couldn't identify obvious flaws to the human eye, such as overlapping text or inconsistent font choices as well as bad layout decisions. I do like it for quick prototyping, but the testing and design decisions were not ones I would hand off at this moment. Did you notice improvements in these areas? Can you share how it does for long running programs?

      If you want I can give you some more specific instructions to test, but I would also be happy to hear from your own use cases.

      2 replies →

    • Are you using Claude Code or a different agent? I'm curious how screenshots are being fed back into the model? Does CC register a tool for this, or is Fable just using a bash tool to perform the screen capture, and then what tool is it using to request the resulting image to be fed back to it?

      4 replies →

    • That's a lot learned about debugging, sure, but it's worthwhile to note that it doesn't tell you much about the abstractions used to build Datasette, as the previous commenters pointed out.

      1 reply →

    • And Fable is still worse than Codex.

      I use both and the only thing (as always) that I will use Claude for is UI design.

      Opus 4.8 and now Fable are still both worse at actually getting the job done than the Codex model. Claude models write FAR too much code when it's not needed, they burn far too many tokens, when they are not needed, write un-necessary tests, write plans which are 5 pages longer than are needed, etc. etc.

      Have you actually compared code quality and plan quality versus Codex? It's demonstrably worse.

      16 replies →

  • But Simon is not trying to get good at CSS debugging, Simon is trying to learn about AI systems and produce content about them. So giving the AI agent a trivial task to go crazy on is a feature, not a bug.

    For $12 implied cost, he got a front-page post on HN with 500 comments. What is that worth? :-)

    • > What is that worth? :-)

      This is one of those double edge sword situations. It is on the front page and it stays because it will trigger a lot of people and he has to spend a lot of effort explaining himself. What is that worth?

      His explanations would most likely be buried deep so the impression that others get might be worsened. What is that worth?

      In my opinion, this is one of those find a harder problem and you would still have the same content...but it might not draw as much feedback and stay on the front page longer.

    • To most of us that's worth a ton, whereas he's probably had enough front-page posts that there's less value to him, although still likely more than $12 worth.

      6 replies →

  • People are missing that Willison is among the very best people we have in the role of (for lack of a good name): early access to frontier models, evaluate them in real scenarios, no wishful thinking, hype, or doom, communicate the possibilities. Yes he could have fixed this himself but then he would have learned nothing about the AI, and we wouldn't have read a fascinating and important article.

    • >> he would have learned nothing about the AI

      there is absolutely zero value in spending time to learn about new models as in few months new model will be out and whatever you learned about the current one will be useless.

      Also with models getting better and better you have to know less and less to achieve same results.

      16 replies →

  • > By offloading this trivial task to the LLM, Simon has abandoned the opportunity to evaluate the abstraction [...]

    While by itself that would be true, Simon commonly blogs about things he's up to.

    That action provides the opportunity for evaluation, and additionally evaluation by a wider audience.

    So, it's not the same scenario as non-bloggers offloading a task... :)

Seems like this model delivers on what has already been scaling quite nicely, which is the length and complexity of the requested tasks, but isn't such a big improvement on what hasn't been scaling so far - common sense, discernment, good judgement.

  • > common sense, discernment, good judgement

    I feel like the whole point of all the experimentation with AI right now is determining whether any of these things actually matter to the end result, over various timeframes.

    • It's well known that companies with an abundance of raw technical skills but poor judgement tend to fail. On the technical side technical debt accumulates, while on the business side the wrong choices are made. I think it's valid to generalize this to AI.

I think Fable is predisposed to try and verify it's changes. Which is a very good thing. It takes a lot of prompts to get Opus to do what Fable does unprompted.

That is exactly what I would want from a junior developer - make sure the bug exists, find a way to fix it, verify the bug is fixed.

The problem, as was correctly identified in the blog post - is that instead of stopping and asking for elevated permission it relentlessly tries to find a hack on it's own. (An equivalent situation for a human developer would be needing some access to a third-party sandbox, and instead of asking a senior for credentials, tries to setup his own sandbox from scratch)

  • No, the problem is mostly the incorrect prompt that sent fable into a rabbit hole resulting in an incorrect solution.

This is the worst thing about current AI agents. They never ask questions. The prompt has to be pixel perfect and unambiguous or they'll happily run away doing something ridiculous.

I misread your comment at first and thought you were insulting Simon Willison, rather than calling Claude Fable a bad developer, and so I'm commenting here to clarify it in case others also misread it.

That first sentence threw me off.

Anyway, I'm glad he spent the $12 because this blog post was highly informative.

The 'better' fixes are often for our (human) benefit. These messy fixes serve the AI companies' interests of creating messes that need even more tokens (money) later. Bad and self-serving developers also act the same, creating tech debt

Yes I agree, the solution committed is horrible, but nobody cares any more. We have entered a very strange parallel universe where because AI can work things out it's easier to take solutions that are sub optimal and just churn out (potentially) buggy features.

  • I care. If you can loosely point me in the direction of a better solution I'll do the extra work.

    • Interesting... I downloaded dataset-agent and removed various different styles from the textarea (with an intention of providing a PR) including the overflow-x: hidden and I tried Safari and Chrome with both the global Mac setting of Always showing scrollbars on and off. It NEVER shows the scrollbar for me.

      Do you have an extension installed that is doing something weird to your textareas? Maybe I'm doing it wrong but I think for now overflow-x is fine if you are experiencing it and I am not! Let's all get on with our lives... I was probably a bit overzealous about caring all that much about a perfectly fine CSS fix.

      1 reply →

Actually, it seems to me that it is just over-monetization of any impulse.

I remember when you were billed by the minute for connecting to the online world.

There were lots of incentives to keep the meter running.

is this sort of like that?

You missed what I think is the most interesting question: why does the bug appear in Safari macOS but not in Firefox, Chrome, or WebKit running inside of Playwright?

(Dozens of people in this thread implying that any web dev should have known to solve it with overflow-x: hidden and not one of them have addressed that browser difference yet.)

  • I think any web dev knows not to question browser differences if it can be fixed without opening that can of worms.

This is missing the point, simon is a fantastic developer. but to keep track of all the nuances of the frontend frameworks and browser implementation is a lot even for great people.

it is really awesome that the final change was only a two line css change.