Comment by infinitebattery

7 days ago

From this commit: https://github.com/cloudflare/workers-oauth-provider/commit/...

===

"Fix Claude's bug manually. Claude had a bug in the previous commit. I prompted it multiple times to fix the bug but it kept doing the wrong thing.

So this change is manually written by a human.

I also extended the README to discuss the OAuth 2.1 spec problem."

===

This is super relatable to my experience trying to use these AI tools. They can get halfway there and then struggle immensely.

> They can get halfway there and then struggle immensely.

Restart the conversation from scratch. As soon as you get something incorrect, begin from the beginning.

It seems to me like any mistake in a messages chain/conversation instantly poisons the output afterwards, even if you try to "correct" it.

So if something was wrong at one point, you need to go back to the initial message, and adjust it to clarify the prompt enough so it doesn't make that same mistake again, and regenerate the conversation from there on.

  • Chatbot UIs really need better support for conversation branching all around. It's very handy to be able to just right-click on any random message in the conversation in LM Studio and say, "branch from here".

    • Maybe it's contrarian, maybe it's not, but I don't think Chat UIs are well suited for software engineering/programming at all, we need something completely different. Being able to branch conversations and such would be useful, but probably not for the way I do software. Besides, I'm rarely beyond 3 messages (1 system, 1 user, 1 assistant) in any usage of the chat UIs. Maybe it's more useful to people with different workflows.

      2 replies →

    • AI Studio has this, I usually ask it to plan and I do some rounds of refining until the plan covers all my requirements, then I branch this conversation, a branch for each feature, none of the branches get polluted this way.

  • I thought Claude still has a problem generating the same output for the same input? That you can't just rewind and rerun and get to the same point again.

    • > I thought Claude still has a problem generating the same output for the same input?

      I haven't used Anthropic's models/software in a long time (months, basically forever in AI ecosystem), so don't know exactly how it works now.

      But last time I used Claude, you could edit the first message, and then re-generate the assistants next message based on your edit. Most of the LLM interfaces has one or another way of doing this, I can't imagine they got rid of that feature.

      What I'm suggesting isn't to use the exact same input (the first message), but rather change it so you remove the chances of something incorrect happening later after that.

    • > can't just rewind and rerun and get to the same point again

      Why would you want to? The whole point of a retry is that your previous conversation attempt went poorly.

      2 replies →

  • It can be done, but for my environment the sum of all prompts that I end up typing to get the right result ends up being longer than the actual code.

    So now I'm using LLMs as crapshoot machines for generating ideas which I then implement manually

  • Can you imagine if Excel worked like this? the formula put out the wrong result, so try again! It's like that scene from The Office where Michael has an accountant "run it again." It's farcical. They have created computers that are bad at math and I will never forgive them.

    Also, each try costs money! You're pulling the lever on a god damned slot machine!

    I will TRY AGAIN with the same prompt when I start getting a refund for my wasted money and time when the model outputs bullshit, otherwise this is all confirmation and sunk cost bias talking, I'm sure if it.

    • > Can you imagine if Excel worked like this?

      I mean, why would I imagine that? Who would want that? It's like the argument against legal marijuana, and someone replies "But would you like your pilot to be high when flying?!". Right tool for the right job, clearly when you want 100% certainty then LLMs aren't the tool for that. Just because they're useful for some things don't mean we have to replace everything with them.

      > Also, each try costs money!

      I guess you're using some paid API? Try a different way then. I mostly use the web UI from OpenAI, or Codex lately, or ran locally with my own agent using local weights, neither is "each try costs money" more than writing data to my SSD is costing me money.

      It's not a holy grail some people paint it, and not sure we're across the "productivity threshold" (https://news.ycombinator.com/item?id=44160664) yet, but it's worth trying it out probably before jumping to conclusions. But no one is forcing you either, YMMV and all that.

The comment in lines 163 - 172 make some claims that are outright false and/or highly A/S dependent, to the point where I question the validity of this post entirely. While it's possible that an A/S can be pseudo-generated based on lots of training data, each implementation makes very specific design choices: i.e.: Auth0's A/S allows for a notion of "leeway" within the scope of refresh token grant flows to account for network conditions, but other A/S implementations may be far more strict in this regard.

My point being: assuming you have RFCs (which leave A LOT to the imagination) and some OSS implementations to train on, each implementation usually has too many highly specific choices made to safely assume an LLM would be able to cobble something together without an amount of oversight effort approaching simply writing the damned thing yourself.

One way to mitigate the issue is to use tests or specifications and let the AI find a solution to the spec.

A few months ago, solving such a spec riddle could take a while, and most of the time, the solutions that were produced by long run times were worse than the quick solutions. However, recently the models have become significantly better at solving such riddles, making it fun (depending on how well your use case can be put into specs).

In my experience, sonnet 3.7 represented a significant step forward compared to sonnet 3.5 in this discipline, and Gemini 2.5 Pro was even more impressive. Sonnet 4 makes even fewer mistakes, but it is still necessary to guide the AI through sound software engineering practices (obtaining requirements, discovering technical solutions, designing architecture, writing user stories and specifications, and writing code) to achieve good results.

Edit: And there is another trick: Provide good examples to the AI. Recently, I wanted to create an app with the OpenAI Realtime API and at first it failed miserably, but then I added the most important two pages of the documentation and one of the demo projects into my workspace and just like that it worked (even though für my use-case the API calls had to be use quite differently).

  • That's one thing where I love Golang. I just tell Aider to `/run go doc github.com/some/package`, and it includes the full signatures in the chat history.

    It's true: often enough AI struggles to use libraries, and doesn't remember the usage correctly. Simply adding the go doc fixed that often.

I am waiting for studies whether we have just an illusion of production or these actually save man hours in the long term in creation of production-level systems.

This to me is why I think these tools don't have actual understanding, and are instead producing emergent output from pooling an incomprehensibly large set of pattern-recognized data.

  • > these tools don't have actual understanding, and are instead producing emergent output from pooling an incomprehensibly large set of pattern-recognized data

    I mean, bypassing the fact that "actual understanding" doesn't have any consensus about what it is, does it matter if it's "actual understanding" or "kind of understanding", or even "barely understanding", as long as it produces the results you expect?

    • > as long as it produces the results you expect?

      But it's more the case of "until it doesn't produce the results you expect" and then what do you do?

      7 replies →

    • No, I was not making a critique on its effectiveness at generating usable results. I was responding to what I've seen in several other articles here arguing towards anthropomorphism.

Same. But I personally find it a lot easier to do those bits at the end than to begin from a blank file/function, so it's a good match for me.

  • Same here. Sometimes you just need time to stew in the problem/solution space.

    LLMs let me be ultraproductive upfront then come in at the end to clean up when I have a full understanding.