Comment by mnky9800n

1 day ago

This is why I made Zork bench. Zork, the text adventure game, is in the training data for LLMs. It’s also deterministic. Therefore it should be easy for an LLM to play and complete. Yet they don’t. Understanding why is the goal of Zork bench.

https://github.com/mnky9800n/zork-bench

17 comments

mnky9800n

kqr 1 day ago

I have worked on similar problems. See e.g. [1].

The LLMs I have tested have terrible world models and intuitions for how actions change the environment. They're also not great at discerning and pursuing the right goals. They're like an infinitely patient five-year old with amazing vocabulary.

[1]: https://entropicthoughts.com/updated-llm-benchmark

(more descriptions available in earlier evaluations referenced from there)

malfist 19 hours ago
I'm going to ignore all that and tell my developers working in complicated codebases that they have to use AI. I'm sure comprehending side effects in a world building text adventure is completely different that understanding spaghetti code
- red75prime 18 hours ago
  
  Desarcasmed version: "I think that problems with Zork make those models virtually useless in programming tasks." Correct?
  
  3 replies →
seanmcdirmid 18 hours ago
You can code your prompts to read and write an external world model on the side. This is what most people do who are seriously doing games with LLMs.
- stingraycharles 11 hours ago
  
  What do you mean with this? What is this world model, what does it capture?
  
  2 replies →
mnky9800n 20 hours ago

we should talk. i sent you an email.

WarmWash 21 hours ago

The open models only give the SOTA models a run for their money on gameable benchmarks. On the semi-private ARC-AGI 2 sets they do absolutely awfully (<10% while SOTA is at ~80%)

It might be too expensive, but I would be interested in the benchmarks for the current crop of SOTA models.

roenxi 20 hours ago
Have the open models been tried? When I look at the leaderboard [0] the only qwen model I see is 235B-A22B. I wouldn't expect an MoE model to do particularly well, from what I've seen (thinking mainly of a leaderboard trying to measure EQ [1]) MoE models are at a distinct disadvantage to regular models when it comes to complex tasks that aren't software benchmark targets.
[0] https://arcprize.org/leaderboard
[1] https://eqbench.com/index.html
- WarmWash 19 hours ago
  
  There is GLM 5 and kimi 2.5 (which gets 11.8%, but I digress)

CamperBob2 20 hours ago

Actually the Zorks weren't deterministic, especially Zork II. The Wizard could F you over pretty badly if he appeared at an inopportune time.

Schlagbohrer 4 hours ago

Was that using an RNG? Or is the entire game deterministic?

doingthehula 18 hours ago

[dead]