Comment by Jaysobel

1 month ago

Author here - some bonus links!

Session transcript using Simon Willison's claude-code-transcripts

https://htmlpreview.github.io/?https://gist.githubuserconten...

Reddit post

https://www.reddit.com/r/ClaudeAI/comments/1q9fen5/claude_co...

OpenRCT2!!

https://github.com/jaysobel/OpenRCT2

Project repo

https://github.com/jaysobel/OpenRCT2

10 comments

Jaysobel

theptip 1 month ago

Did you eval using screenshots or some sort of rendered visualization instead of the CLI? I wonder if Claude has better visual intelligence when viewing images (lots of these in its training set) rather than ascii schematics (probably very few of these in the corpus).

cheema33 1 month ago
Computer use and screenshots are context intensive. Text is not. The more context you give to an LLM, the dumber it gets. Some people think at 40% context utilization, the LLM starts to get into the dumb zone. That is where the limitations are as of today. This is why CLI based tools like Claude Code are so good. And any attempt at computer use has fallen by the wayside.
There are some potential solutions to this problem that come to mind. Use subagents to isolate the interesting bits about a screenshot and only feed that to the main agent with a summary. This will all still have a significantly higher token usage compared to a text based interface, but something like this could potentially keep the LLM out of the dumb zone a little longer.
- fragmede 1 month ago
  
  > And any attempt at computer use has fallen by the wayside.
  You're totally right! I mean, aside from Anthropic launching "Cowork: Claude Code for the rest of your work" 5 days ago. :)
  https://news.ycombinator.com/item?id=46593022
  More to the point though, you should be using Agents in Claude Code to limit context pollution. Agents run with their own context, and then only return salient details. Eg, I have an Agent to run "make" and return the return status and just the first error message if there is one. This means the hundreds/thousands of lines of compilation don't pollute the main Claude Code context, letting me get more builds in before I run out of context there.
  
  1 reply →
Jaysobel 1 month ago

I had tried the browser screenshotting feature for agents in Cursor and found it wasn't very reliable - screenshots eat a lot of context, and the agent didn't have a good sense for when to use them. I didn't try it in this project. I bet it would work in some specific cases.
nanapipirara 1 month ago

Claude helped me immensely getting an image converter to work. Giving it screenshots of wrong output (lots of layers had an unpredictable offsets that was not supposed to be there) and output as I expected it helped Claude understand the problems and it fixed the bugs immediately.
deepl_y 1 month ago

I'm not sure if this proves anything, but i saw this article of Opus playign pokemon, and here they were given actual screenshots, and it still says it navigated visual space pretty poorly despite the advancements https://www.lesswrong.com/posts/u6Lacc7wx4yYkBQ3r/insights-i...

cheschire 1 month ago

Did you intend the last link to link to your project? It’s a copy of the OpenRCT2 project.

philipwhiuk 1 month ago

I think the first one should have been https://github.com/OpenRCT2/OpenRCT2
The actual change to implement CC is: https://github.com/jaysobel/OpenRCT2/commit/5d49dc960fcfc133...

fragmede 1 month ago

> Claude is at a pretty steep visuo-spatial disadvantage,

How hard would it be to use with OpenAI's offerings instead? Particularly, imo, OpenAI's better at "looking" at pictures than Claude.