Comment by dostick

16 days ago

Its gotten so bad that Claude will pretend in 10 of 10 cases that task is done/on screenshot bug is fixed, it will even output screenshot in chat, and you can see the bug is not fixed pretty clear there.

I consulted Claude chat and it admitted this as a major problem with Claude these days, and suggested that I should ask what are the coordinates of UI controls are on screenshot thus forcing it to look. So I did that next time, and it just gave me invented coordinates of objects on screenshot.

I consult Claude chat again, how else can I enforce it to actually look at screenshot. It said delegate to another “qa” agent that will only do one thing - look at screenshot and give the verdict.

I do that, next time again job done but on screenshot it’s not. Turns out agent did all as instructed, spawned an agent and QA agent inspected screenshot. But instead of taking that agents conclusion coder agent gave its own verdict that it’s done.

It will do anything- if you don’t mention any possible situation, it will find a “technicality” , a loophole that allows to declare job done no matter what.

And on top of it, if you develop for native macOS, There’s no official tooling for visual verification. It’s like 95% of development is web and LLM providers care only about that.

43 comments

dostick

deaux 16 days ago

> I consulted Claude chat and it admitted this as a major problem with Claude these days, and suggested that I should ask what are the coordinates of UI controls are on screenshot thus forcing it to look

If 3 years into LLMs even HNers still don't understand that the response they give to this kind of question is completely meaningless, the average person really doesn't stand a chance.

motoboi 16 days ago
The whole “chat with an AI” paradigm is the culprit here. Priming people to think they are actually having a conversation with something that has a mind model.
It’s just a text generator that generates plausible text for this role play. But the chat paradigm is pretty useful in helping the human. It’s like chat is a natural I/O interface for us.
- adriand 16 days ago
  
  I disagree that it’s “just a text generator” but you are so right about how primed people are to think they’re talking to a person. One of my clients has gone all-in on openclaw: my god, the misunderstanding is profound. When I pointed out a particularly serious risk he’d opened up, he said, “it won’t do that, because I programmed it not to”. No, you tried to persuade it not to with a single instruction buried in a swamp of markdown files that the agent is itself changing!
  
  11 replies →
- tasuki 16 days ago
  
  > It’s just a text generator that generates plausible text for this role play.
  Often enough, that text is extremely plausible.
- abcde666777 16 days ago
  
  I pin just as much responsibility on people not taking the time to understand these tools before using them. RTFM basically.
- unselect5917 16 days ago
  
  I think the mindset you have to have is "it understands words, but has no concept of physics".
toraway 16 days ago

It doesn’t help that a frequent recommendation on HN whenever someone complains about Claude not following a prompt correctly is to “ask Claude itself how to rewrite a prompt to get the result you want”.
Which sure, can be helpful, but it’s kinda just a coincidence (plus some RLHF probably) that question happens to generate output text that can be used as a better prompt. There’s no actual introspection or awareness of its internal state or architecture beyond whatever high level summary Anthropic gives it in its “soul” document et al.
But given how often I’ve read that advice on here and Reddit, it’s not hard to imagine how someone could form an impression that Claude has some kind of visibility into its own thinking or precise engineering. Instead of just being as much of a black box to itself as it is to us.
user3939382 16 days ago
It’s not meaningless. It’s a signal that the agent has run out of context to work on the problem which is not something it can resolve on its own. Decomposing problems and managing cognitive (or quasi cognitive in this case) burden is a programmer’s job regardless of the particular tools.
- mlrtime 16 days ago
  
  I think you are saying what I was about to suggest:
  For this single problem, open a new claude session with this particular issue and refining until fixed, then incorporating it into the larger project.
  I think the QA agent might have been the same step here, but it depends on how that QA agent was setup.
retsibsi 16 days ago
> completely meaningless
This is way too strong isn't it? If the user naively assumes Claude is introspecting and will surely be right, then yeah, they're making a mistake. But Claude could get this right, for the same reasons it gets lots of (non-introspective) things right.
- furyofantares 15 days ago
  
  It's not too strong. If it answered from its weights, it's pretty meaningless. If it did a web search and found reports of other people saying this, you'd want to know that this is how it answered - and then you'd probably just say that here on HN rather than appealing to claude as an authority on claude.
  They also said it "admitted" this as a major problem, as if it has been compelled to tell an uncomfortable truth.
  
  2 replies →

steelbrain 16 days ago

> And on top of it, if you develop for native macOS, There’s no official tooling for visual verification. It’s like 95% of development is web and LLM providers care only about that.

Thinking out loud here, but you could make an application that's always running, always has screen sharing permissions, then exposes a lightweight HTTP endpoint on 127.0.0.1 that when read from, gives the latest frame to your agent as a PNG file.

Edit: Hmm, not sure that'd be sufficient, since you'd want to click-around as well.

Maybe a full-on macOS accessibility MCP server? Somebody should build that!

abrookewood 16 days ago
Yeah, this is pretty much how Tidewave works, but passes the HTML/JavaScript reference instead of a picture: https://tidewave.ai/
- neya 16 days ago
  
  Is this the same one I vaguely recall being implemented/launched by Phoenix/Elixir team?
Leynos 16 days ago
https://github.com/steipete/Peekaboo
- steelbrain 16 days ago
  
  I didnt realize how prolific the OpenClaw author was. Thanks for sharing!

abrookewood 16 days ago

There is a tool called Tidewave that allows you to point and click at an issue and it will pass the DIV or ID or something to the LLM so it knows exactly what you are talking about. Works pretty well.

https://tidewave.ai/

rudedogg 16 days ago

> And on top of it, if you develop for native macOS, There’s no official tooling for visual verification. It’s like 95% of development is web and LLM providers care only about that.

I think this is built in to the latest Xcode IIRC

silentkat 16 days ago

Oh, no, I had these grand plans to avoid this issue. I had been running into it happening with various low-effort lifts, but now I'm worried that it will stay a problem.

technocrat8080 16 days ago

You can provide the screencapture cli as a tool to Claude and it will take screenshots (of specific windows) to verify things visually.

gambiting 16 days ago

>>It’s like 95% of development is web and LLM providers care only about that.

I've been trying to use it for C++ development and it's maybe not completely useless, but it's like a junior who very confidently spouts C++ keywords in every conversation without knowing what they actually mean. I see that people build their entire companies around it, and it must be just web stuff, right? Claude just doesn't work for C++ development outside of most trivial stuff in my experience.

logicprog 16 days ago

Models are also quite good at Go, Rust, and Python in my experience — also a lot of companies are using TypeScript for many non web related things now. Apparently they're also really good at C, according to the guy who wrote Redis anyway.
widdershins 15 days ago

It's working reasonably well for me. But this is inside a well-established codebase with lots of tests and examples of how we structure code. I also haven't used it much for building brand new features yet, but for making changes to existing areas.
VortexLain 16 days ago

GPT models are generally much better at C++, although they sometimes tend to produce correct but overengineered code, and the operator has to keep an eye on that.

canadiantim 16 days ago

This is why you need a red-green-refactor TDD skill

to11mtm 16 days ago

I mean, I don't use CC itself, just Claude through Copilot IDE plugin for 'reasons'...

At at least there it's more honest than GPT, although at work especially it loves to decide not to use the built in tools and instead YOLO on the terminal but doesn't realize it's in powershell not a true nix terminal, and when it gets that right there's a 50/50 shot it can actually read the output (i.e. spirals repeatedly trying to run and read the output).

I have had some success with prompting along the lines of 'document unfinished items in the plan' at least...

eyeris 16 days ago

Codex via codex-cli used to be pretty about knowing whether it was in powershell. Think they might have changed the system prompt or something because it’s usually generating powershell on the first attempt.
Sometimes it tries to use shell stuff (especially for redirection), but that’s way less common rn.

inetknght 16 days ago

Are you sure you're talking about Claude? Because it sounds like you're describing how a lot of people function. They can't seem to follow instructions either.

I guess that's what we get for trying to get LLM to behave human-like.

SegfaultSeagull 16 days ago

What if, stay with me here, AI is actually a communist plot to ensorcell corporations into believing they are accelerating value creation when really they are wasting billions more in unproductive chatting which will finally destroy the billionaire capital elite class and bring about the long-awaited workers’ paradise—delivered not by revolution in the streets, but by millions of chats asking an LLM to “implement it.” Wake up sheeple!