We put Claude Code in Rollercoaster Tycoon

5 days ago (labs.ramp.com)

> As a mirror to real-world agent design: the limiting factor for general-purpose agents is the legibility of their environments, and the strength of their interfaces. For this reason, we prefer to think of agents as automating diligence, rather than intelligence, for operational challenges.

Author here - some bonus links!

Session transcript using Simon Willison's claude-code-transcripts

https://htmlpreview.github.io/?https://gist.githubuserconten...

Reddit post

https://www.reddit.com/r/ClaudeAI/comments/1q9fen5/claude_co...

OpenRCT2!!

https://github.com/jaysobel/OpenRCT2

Project repo

https://github.com/jaysobel/OpenRCT2

  • Did you eval using screenshots or some sort of rendered visualization instead of the CLI? I wonder if Claude has better visual intelligence when viewing images (lots of these in its training set) rather than ascii schematics (probably very few of these in the corpus).

    • Claude helped me immensely getting an image converter to work. Giving it screenshots of wrong output (lots of layers had an unpredictable offsets that was not supposed to be there) and output as I expected it helped Claude understand the problems and it fixed the bugs immediately.

    • I had tried the browser screenshotting feature for agents in Cursor and found it wasn't very reliable - screenshots eat a lot of context, and the agent didn't have a good sense for when to use them. I didn't try it in this project. I bet it would work in some specific cases.

> The only other notable setback was an accidental use of the word "revert" which Codex took literally, and ran git revert on a file where 1-2 hours of progress had been accumulating.

  • Amazing that these tools don't maintain a replayable log of everything they've done.

    Although git revert is not a destructive operation, so it's surprising that it caused any loss of data. Maybe they meant git reset --hard or something like that. Wild if Codec would run that.

  • Yet another reason to use Jujutsu. And put a `jj status` wrapper in your PS1. ;-)

    • Start with env args like AGENT_ID for indicating which Merkle hash of which model(s) generated which code with which agent(s) and add those attributes to signed (-S) commit messages. For traceability; to find other faulty code generated by the same model and determine whether an agent or a human introduced the fault.

      Then, `git notes` is better for signature metadata because it doesn't change the commit hash to add signatures for the commit.

      And then, you'd need to run a local Rekor log to use Sigstore attestations on every commit.

      Sigstore.dev is SLSA.dev compliant.

      Sigstore grants short-lived release attestation signing keys for CI builds on a build farm to sign artifacts with.

      So, when jujutsu autocommits agent-generated code, what causes there to be an {{AGENT_ID}} in the commit message or git notes? And what stops a user from forging such attestations?

      1 reply →

> We don't know any C++ at all, and we vibe-coded the entire project over a few weeks. The core pieces of the build are…

what a world!

  • First time I am seeing realistic timelines from a vibe-coded project. Usually everyone who vibe codes just says they did in few hours, no matter the project.

    • It’s possible to vibe code certain generic things in a few hours if you’re basically combining common, thoroughly documented, mature building blocks. It’s not going to be production ready or polished but you can get surprisingly far with some things.

      For real work, that phase is like starting from a template or a boilerplate repo. The real work begins after the basics are wired together.

    • Hmm. My experience with it is that a few hours of that will get you a sprint if you're lucky and the prompt hits the happy path. I had… I think two of those, over 5 weeks? I can believe plenty of random people stumble across happy-path examples.

      Exciting when it works, but I think a much more exciting result for people with less experience who may not know that the "works for me" demo is the dreaded "first 90%", and even fairly small projects aren't done until the fifth-to-tenth 90%.

      (That, and that vibe coding in the sense of "no code review" are prone to balls of mud, so you need to be above average at project management to avoid that after a few sprint-equivalents of output).

  • Everyone should read that section. It was really interesting reading about their experiences/challenges getting it all working.

  • I would’ve walked for days to a CompUSA and spent my life savings if there was anything remotely equivalent to this when I was learning C on my Macintosh 4400 in 1997

    People don’t appreciate what they have

    • Did you actually learn C? Be thankful nothing like this existed in 1997.

      A machine generating code you don't understand is not the way to learn a programming language. It's a way to create software without programming.

      These tools can be used as learning assistants, but the vast majority of people don't use them as such. This will lead to a collective degradation of knowledge and skills, and the proliferation of shoddily built software with more issues than anyone relying on these tools will know how to fix. At least people who can actually program will be in demand to fix this mess for years to come.

      18 replies →

I love the interview at the end of the video. The kubectl-inspired CLI, and the feedback for improvements from Claude, as well as the alerts/segmentation feedback.

You could take those, make the tools better, and repeat the experience, and I'd love to see how much better the run would go.

I keep thinking about that when it comes to things like this - the Pokemon thing as well. The quality of the tooling around the AI is only going to become more and more impactful as time goes on. The more you can deterministically figure out on behalf of the AI to provide it with accurate ways of seeing and doing things, the better.

Ditto for humans, of course, that's the great thing about optimizing for AI. It's really just "if a human was using this, what would they need"? Think about it: The whole thing with the paths not being properly connected, a human would have to sit down and really think about it, draw/sketch the layout to visualize and understand what coordinates to do things in. And if you couldn't do that, you too would probably struggle for a while. But if the tool provided you with enough context to understand that a path wasn't connected properly and why, you'd be fine.

  • I see this sentiment of using AI to improve itself a lot but it never seems to work well in practice. At best you end up with a very verbose context that covers all the random edge cases encountered during tasks.

    For this to work the way people expect you’d need to somehow feed this info back into fine tuning rather than just appending to context. Otherwise the model never actually “learns”, you’re just applying heavy handed fudge factors to existing weights through context.

    • I've been playing around with an AI generated knowledge base to grok our code base, I think you need good metrics on how the knowledge base is used. A few things is:

      1. Being systematic. Having a system for adding, improving and maintaining the knoweldge base 2. Having feedback for that system 3. Implementing the feedback into a better system

      I'm pretty happy I have an audit framework and documentation standards. I've refactored the whole knowledge base a few times. In the places where it's overly specific or too narrow in it's scope of use for the retained knowledge, you just have to prune it.

      Any garden has weeds when you lay down fertile soil.

      Sometimes they aren't weeds though, and that's where having a person in the driver's seat is a boon.

Interesting article but it doesn’t actually discuss how well it performs at playing the game. There is in fact a 1.5 hour YouTube video but it woulda been nice for a bit of an outcome postmortem. It’s like “here’s the methods and set up section of a research paper but for the conclusion you need to watch this movie and make your own judgements!”

  • It does discuss that? Basically it has good grasp of finances and often knows what "should" be done, but it struggles with actually building anything beyond placing toilets and hotdog stalls. To be fair, its map interface is not exactly optimal, and a multimodal model might fare quite a bit better at understanding the 2D map (verticality would likely still be a problem).

  • I was told the important part of AI is the generation part, not the verification or quality.

> kept the context above the ~60% remaining level where coding models perform at their absolute best

Maybe this is obvious to Claude users but how do you know your remaining context level? There is UI for this?

Question: There is still a competitive AoE2 community. Will that be destroyed by AI?

  • Dota 2 is a real time strategy game with an arguably more complex micro game (but a far simpler macro game than AoE2, but that's far easier for an AI to master), and OpenAI Five completely destroyed the reigning champions. In 2019. Perfect coordination between units, superhuman mechanical skill, perfect consistency.

    I see no reason why AoE2 would be any different.

    Worth noting that openAI Five was mostly deep reinforcement learning and massive distributed training, it didn't use image to text and an LLM for reasoning about what it sees to make its "decisions". But that wouldn't be a good way to do an AI like that anyway.

    Oh, and humans still play Dota. It's still a highly competitive community. So that wasn't destroyed at all, most teams now use AI to study tactics and strategy.

I corroborate that spatial reasoning is a challenge still. In this case, it's the complexity of the game world, but anyone who has used Codex/Claude with complex UIs in CSS or a native UI library will recognize the shortcomings fairly quickly.

It's been several times that I see ASCII being used initially for these kinds of problems. I think it's because its counter-intuitive, in the sense that for us humans ASCII is text but we tend to forget spacial awareness.

I find this very interesting of us humans interacting with AIs.

Can't wait for someone to let Claude control a runescape character from scratch

  • I've done this! Given the right interface I was surprised at how well it did. Prompted it "You're controlling a character in Old School RuneScape, come up with a goal for yourself, and don't stop working on it until you've achieved it". It decided to fish for and cook 100 lobsters, and it did it pretty much flawlessly!

    Biggest downside was it's inability to see (literally), getting lists of interact-able game objects, NPCs, etc was fine when it decided to do something that didn't require any real-time input. Sailing, or anything that required it to react to what's on screen was pretty much impossible without more tooling to manage the reacting part for it (e.g. tool to navigate automatically to some location).

  • People have been botting on Runescape since the early 2000s. Obviously not quite at the Claude level :). The botting forums were a group of very active and welcoming communities. This is actually what led me to Java programming and computer science more broadly--I wrote custom scripts for my characters.

    I still have some parts of the old Rei-net forum archived on an external somewhere.

This is a cool idea. I wanted to do something like this by adding a Lua API to OpenRCT2 that allows you to manipulate and inspect the game world. Then, you could either provide an LLM agent the ability to write and run scripts in the game, or program a more classic AI using the Lua API. This AI would probably perform much better than an LLM - but an interesting experiment nonetheless to see how a language model can fare in a task it was not trained to do.

"i vibe coded a thing to play video games for me"

i enjoy playing video games my own self. separately, i enjoy writing code for video games. i don't need ai for either of these things.

  • Yeah, but can you use your enjoyment of video games as marketing material to justify a $32B valuation?

  • I actually think it would be pretty fun to code something to play video games for me, it has a lot of overlap with robotics. Separately, I learned about assembly from cheat engine when I was a kid.

  • That’s not the point of this. This was an exercise to measure the strengths and weaknesses of current LLMs in operating a company and managing operations, and the video game was just the simulation engine.

  • That's fine. Tool-assisted speedruns long predate LLMs and they're boring as hell: https://youtu.be/W-MrhVPEqRo

    It's still a neat perspective on how to optimize for super-specific constraints.

  • You do you. I find this exceedingly cool and I think it's a fun new thing to do.

    It's kind of like how people started watching Let's Plays and that turned into Twitch.

    One of the coolest things recently is VTubers in mocap suits using AI performers to do single person improv performances with. It's wild and cool as hell. A single performer creating a vast fantasy world full of characters.

    LLMs and agents playing Pokemon and StarCraft? Also a ton of fun.

Most interesting phrase: "Keeping all four agents busy took a lot of mental bandwidth."

> We don't know any C++ at all, and we vibe-coded the entire project over a few weeks.

And these are the same people that put countless engineers through gauntlets of bizarre interview questions and exotic puzzles to hire engineers.

But when it comes to C++ just vibe it obviously.

  • Oh, I almost didn't realise this is done by a company. I was like this must have costed a lot, didn't realize its just an advertisement for ramp

The opening paragraph I thought was the agent prompt haha

> The park rating is climbing. Your flagship coaster is printing money. Guests are happy, for now. But you know what's coming: the inevitable cascade of breakdowns, the trash piling up by the exits, the queue times spiraling out of control.

Wonder how it would do with Myst.

  • Surely it must have digested plenty of walkthroughs for any game?

    A linear puzzle game like that I would just expect the ai to fly through first time, considering it has probably read 30 years of guides and walkthroughs.

> "Where Claude excels:"

Am I reading a Claude generated summary here?

  • Yes I believe so. Also things like forcing a "key insight" summary after the excels vs struggles section.

    I would take any descriptions like "comprehensive", "sophisticated" etc with a massive grain of salt. But the nuts and bolts of how it was done should be accurate.

  • I thought it sounded more like an ad for Claude written by Anthropic:

    > "This was surprising, but fits with Claude's playful personality and flexible disposition."

    • This sounds as expected to me as a heavy user of Opus. Claude absolutely has a "personality" that is a lot less formal and more willing to "play along" with more creative tasks than Codex. If you want an agent that's prepared to just jump in, it's a plus. If you want an agent that will be careful, considered and plan things out meticulously, it's not always so great - I feel that when you want Claude to do reptitive, tedious tasks, you need to do more work to prevent it from getting "bored" and try to take shortcuts or find something else to do, for example.

      1 reply →

While this seems cool at first, it does not demonstrate superiority over a true custom built AI for rollercoaster tycoon.

It is a curiosity, good for headlines, but the takeaway is if you really need an actual good AI, you are still better off not using an LLM powered solution.

Would a way to take screenshots help? It seems to work for browser testing.

  • I’ve been doing game development and it starts to hallucinate more rapidly when it doesn’t understand things like the direction it placing things or which way the camera is oriented

    Gemini models are a little bit better about spatial reasoning, but we’re still not there yet because these models were not designed to do spatial reasoning they were designed to process text

    In my development, I also use the ascii matrix technique.

    • Spatial awareness was also a huge limitation to Claude playing pokemon.

      It really seems to me that the first AI company getting to implement "spatial awareness" vector tokens and integrating them neatly with the other conventional text, image and sound tokens will be reaping huge rewards. Some are already partnering with robot companies, it's only a matter of time before one of those gets there.

      1 reply →

    • I disagree. With opus I'll screenshot an app and draw all over it like a child with me paint and paste it into the chat - it seems to reasonably understand what I'm asking with my chicken scratch and dimensions.

      As far as 3d I don't have experience however it could be quite awful at that

this is cute but i imagined prompting the ai for a loop-di-loop roller coaster. If this could build complex ride it would be a game changer.

  • yeah I was expecting it to... do something in the game? like build a ride

    not just make up bullshit about events

Interesting this is on the ramp.com domain? I'm surprised in this tech market they can pay devs to hack on Rollercoaster Tycoon. Maybe there's some crossover I'm missing but seems like a sweet gig honestly.

  • yeah really - ramp.com is a credit card/expense platform that surely loses money right now...

    pretty heavy/slow javascript but pretty functional nonetheless...

    • This is brilliant SEO work, I doubt that they loose money with it. With 40h and some additional for the landingpage it might be an expensive link bait, but definitely worth it. Kudos!

      If not for SEO, it’s building quite a good reputation for this company, they got a lot of open positions.

      I’m a big fan of transport tycoon, used to play it for hours as a kid and with Open Transport Tycoon it also might have been a good choice, but maybe not B2C?

This was an interesting application of AI, but I don't really think this is what LLMs excel at. Correct me if I'm wrong.

It was interesting that the poster vibe-coded (I'm assuming) the CTL from scratch; Claude was probably pretty good at doing that, and that task could likely have been completed in an afternoon.

Pairing the CTL with the CLI makes sense, as that's the only way to gain feedback from the game. Claude can't easily do spatial recognition (yet).

A project like this would entirely depend on the game being open source. I've seen some very impressive applications of AI online with closed-source games and entire algorithms dedicated to visual reasoning.

I'm still trying to figure out how this guy: https://www.youtube.com/watch?v=Doec5gxhT_U

Was able to have AI learn to play Mario Kart nearly perfectly. I find his work to be very impressive.

I guess because RCT2 is more data-driven than visually challenging, this solution works well, but having an LLM try to play a racing game sounds like it would be disastrous.

  • Not sure if you clocked this, but the Mario Kart AI is not an LLM. It's a randomized neural net that was trained with reinforcement learning. Apologies if I misread.

next up: Crusader Kings III

  • Crusader Kings is a franchise I really could see LLMs shine. One of the current main criticisms on the game is that there's a lack of events, and that they often don't really feel relevant to your character.

    An LLM could potentially make events far more aimed at your character, and could actually respond to things happening in the world far more than what the game currently does. It could really create some cool emerging gameplay.

    • In general you are right, I expect something like this to appear in the future and it would be cool.

      But isn't the criticism rather that there are too many (as you say repetitive, not relevant) events - its not like there are cool stories emerging from the underlying game mechanics anymore ("grand strategy") but players have to click through these boring predetermined events again and again.

      2 replies →

  • > You’re right, I did accidentally slaughter all the residents of Béziers. I won’t do that again. But I think that you’ll find God knows his own.