Comment by coder543

9 months ago

> My conversation quickly began to approach the context window for the LLM and some RAG engineering is very necessary to keep the LLM informed about the key parts of your history

Assuming we're talking about GPT-4o, that 128k context window theoretically corresponds to somewhere around 73,000 words. People talk at around 100 words per minute in conversation, so that would be about 730 minutes of context, or about 12 hours. The Gemini models can do up to 2 million tokens of context... which we could extrapolate to 11,400 minutes of context (190 hours), which might be enough?

I would say GPT-4o was only good up to about 64k tokens the last time I really tested large context stuff, so let's call that 6 hours of context. In my experience, Gemini's massive context windows are actually able to retain a lot of information... it's not like there's only 64k usable or something. Google has some kind of secret sauce there.

One could imagine architecting the app to use Gemini's Context Caching[0] to keep response times low, since it wouldn't need to re-process the entire session for every response. The application would just spin up a new context cache in the background every 10 minutes or so and delete the old one, reducing the amount of recent conversation that would have to be re-processed each time to generate a response.

I've just never seen RAG work particularly well... and fitting everything into the context is very nice by comparison.

But, one alternative to RAG would be a form of context compression... you could give the LLM several tools/functions for managing the context. The LLM would be instructed to use these tools to record (and update) the names and information of different characters, places, and items that the players encounter, important events that have occurred during the game, as well as information about who the current players are and what items and abilities those players have, and then the LLM would be provided with this "memory" in the context in place of a complete conversational record. The LLM would then just receive (for example) the most recent 15 or 30 minutes of conversation, in addition to that memory.

> I found the LLM to be too pliable as a DM.

I haven't tried using an LLM as a DM, but in my experience, GPT-4o is happy to hold its ground on things. This isn't like the GPT-3.5 days where it was a total pushover for anything and everything. I believe the big Gemini models are also stronger than the old models used to be in this regard. Maybe you just need a stricter prompt for the LLM that tells it how to behave?

I also think the new trend of "reasoning" models could be very interesting for use cases like this. The model could try to (privately) develop a more cohesive picture of the situation before responding to new developments. You could already do this to some extent by making multiple calls to the LLM, one for the LLM to "think", and then another for the LLM to provide a response that would actually go to the players.

One could also imagine giving the LLM access to other functions that it could call, such as the ability to play music and sound effects from a pre-defined library of sounds, or to roll the dice using an external random number generator.

> 4. Most importantly, I found that I most enjoy the human connection that I get through DnD and an LLM with a voice doesn't really satisfy that.

Sure, maybe it's not something people actually want... who knows. But, I think it looks pretty fun.[1]

One of the harder things with this would be helping the LLM learn when to speak and when to just let the players talk amongst themselves. A simple solution could just be to have a button that the players can press when they want, which will then trigger the LLM to respond to what's been recently said, but it would be cool to just have a natural flow.

[0]: https://ai.google.dev/gemini-api/docs/caching

[1]: https://www.youtube.com/watch?v=9oBdLUEayGI

0 comments

coder543

No comments yet

Contribute on Hacker News ↗