Comment by danpalmer

3 months ago

Good post. I recently built a choose-your-own-adventure style educational game at work for a hackathon.

Prompting an LLM to generate and run a game like this gave immediate impressive results, 10 mins after starting we had something that looked great. The problem was that the game sucked. It always went 3-4 rounds of input regardless. It constantly gave the game away because it had all the knowledge in the context, and it just didn't have the right flow at all.

What we ended up with at the end of the ~2 days was a whole bunch of Python orchestrating 11 different prompts, no cases where the user could directly interact with the LLM, only one case where we re-used context across multiple queries, and a bunch of (basic) RAG to hide game state from the LLM until the user caused it to be revealed through their actions.

LLMs are best used as small cogs in a bigger machine. Very capable, nearly magic cogs, but orchestrated by a lot of regular engineering work.

8 comments

danpalmer

aurareturn 3 months ago

  Prompting an LLM to generate and run a game like this gave immediate impressive results, 10 mins after starting we had something that looked great. The problem was that the game sucked. It always went 3-4 rounds of input regardless. It constantly gave the game away because it had all the knowledge in the context, and it just didn't have the right flow at all.

I'm confused. Did you ask the LLM to write the game in code? Or did the LLM run the entire game via inference?

Why do you expect that the LLM can generate the entire game with a few prompts and work exactly the way you want it? Did your prompt specify the exact conditions for the game?

danpalmer 3 months ago
> Or did the LLM run the entire game via inference?
This, this was our 10 minute prototype, with a prompt along the lines of "You're running a CYOA game about this scenario...".
> Why do you expect that the LLM can generate the entire game with a few prompts
I did not expect it to work, and indeed it didn't, however why it didn't work wasn't obvious to the whole group, and much of the iteration process in the hackathon was breaking things down into smaller components so that we could retain more control over the gameplay.
One surprising thing I hinted at there was using RAG not for its ability to expose more info to the model than can fit in context, but rather for its ability to hide info from the model until its "discovered" in some way. I hadn't considered that before and it was fun to figure out.
- apothegm 3 months ago
  
  > using RAG not for its ability to expose more info to the model than can fit in context, but rather for its ability to hide info from the model until its "discovered" in some way
  Would you be willing to expand on this?
  
  3 replies →

lrpe 3 months ago

I've run numerous interactive text adventures through ChatGPT as well, and while it's great at coming up with scenarios and taking the story in surprising directions, it sucks at maintaining a coherent narrative. The stories are fraught with continuity errors. What time of day it is seems to be decided at random, and it frequently forgets things I did or items picked up previously that are important. It also needs to be constantly reminded of rules that I gave it in the initial prompt. Basically, stuff that the article refers to as "maintaining state."

I've become wary of trusting it with any task that takes more than 5-10 prompts to achieve. The more I need to prompt it, the more frequently it hallucinates.

petesergeant 3 months ago

> What we ended up with at the end of the ~2 days was a whole bunch of Python orchestrating 11 different prompts, no cases where the user could directly interact with the LLM, only one case where we re-used context across multiple queries, and a bunch of (basic) RAG to hide game state from the LLM until the user caused it to be revealed through their actions.

Super cool! I'm the author of the article. Send me an email if you ever just wanna chat about this on a call.