Comment by grey-area

6 days ago

I tried this a while ago, haven’t tried again recently. The models were producing code that was clearly lifted from stuff in their training data, and what I ended up with was a fairly decent game in html and js after a bit of tidy up, though it felt like several code paradigms smooshed together rather than a coherent whole, but it mostly worked. Not something I’d want to maintain but it was impressive at the time.

They were able to one-shot famous games (like asteroid or pong), I suspect because they had been trained on multiple versions of that game. So like producing Harry Potter, with the right prompt it was able to produce a license stripped version of code it had seen. I tried another arcade game like frogger and it failed really badly and took a lot longer, never got it working.

The whole exercise left me feeling they have a long way to go, I don’t see how anyone could think they would replace SWE unless they didn’t look at the code produced, even now.

9 comments

grey-area

vessenes 6 days ago

Out of curiosity - what harness did you use, and what model? And how are you prompting? In my mind prompting like:

“You’re going to make frogger in javascript. I want a complete clone of functionality for level 1, with amazing 80s era pixel art sprites. I’m super lazy, so you’re going to have to test everything, right from the start. Pick a test harness, write the tests, including tests for having amazing graphics, gameplay, input, UI, sounds, etc, and write a full workplan, then work through that workplan, in parallel where you can. The workplan should emphasize getting a stripped down version up immediately and have workstreams for all the major requirements after that. Add a final test that assesses how fun the game is by reviewing a real video of a test run. Loop on that final test until you can’t improve things any more.”

Should produce something playable with no further input. As you say, I’m not sure it would produce a codebase we’d want to look at or work on. But, I’d be surprised if this weren’t successful.

grey-area 6 days ago
Sure give it a go, perhaps it will work better now with frontier models, I haven't tried it in a while (this was a year ago, things have improved since then). I'm not sure what tests for having amazing graphics, gameplay, input, UI, sounds, etc would look like, but it would be interesting to see the results!
- vessenes 6 days ago
  
  okay hold my beer. both claude and codex running now.
  EDIT: both agents took about 20 minutes. I used that exact prompt in a clean directory for each, and then said "deploy to netlify" - so a total of two prompts.
  Codex: https://astounding-bavarois-27b5a2.netlify.app
  Claude: http://strong-hotteok-91dfb0.netlify.app
  Netlify is having trouble claiming the Claude project, so if you need a password it's "My-Drop-Site"
  FYI, Claude rated itself 7.7/10 for fun, and Codex 98/100 during the fun test loop. As you'll see if you poke at them, Claude needs a physics bug fix round. But I think these both did about what I would have expected.
  
  6 replies →