Comment by sigmoid10

2 months ago

Sooo... it can play Pokemon. Feels like they had to throw that in after Google IO yesterday. But the real question is now can it beat the game including the Elite Four and the Champion. That was pretty impressive for the new Gemini model.

41 comments

sigmoid10

minimaxir 2 months ago

That Google IO slide was somewhat misleading as the maintainer of Gemini Plays Pokemon had a much better agentic harness that was constantly iterated upon throughout the runtime (e.g. the maintainer had to give specific instructions on how to use Strength to get past Victory Road), unlike Claude Plays Pokemon.

The Elite Four/Champion was a non-issue in comparison especially when you have a lv. 81 Blastoise.

fourier456 2 months ago

Okay, wait though like I want to know the full transcript because that actually is a better / softer benchmark if you measure in terms of the necessary human input.

archon1410 2 months ago

Claude Plays Pokemon was the original concept and inspiration behind "Gemini Plays Pokemon". Gemini arguably only did better because it had access to a much better agent harness and was being actively developed during the run.

See: https://www.lesswrong.com/posts/7mqp8uRnnPdbBzJZE/is-gemini-...

montebicyclelo 2 months ago
Not sure "original concept" is quite right, given it had been tried earlier, e.g. here's a 2023 attempt to get gpt-4-vision to play pokemon, (it didn't really work, but it's clearly "the concept")
https://x.com/sidradcliffe/status/1722355983643525427
- archon1410 2 months ago
  
  I see, I wasn't aware of that. The earliest attempt I knew of was from May 2024,[1] while this gpt-4-vision attempt is from November 2023. I guess Claude Plays Pokemon was the first attempt that had any real success (won a badge), and got a lot of attention over its entertaining "chain-of-thought".
  [1] https://community.aws/content/2gbBSofaMK7IDUev2wcUbqQXTK6/ca...
silvr 2 months ago

I disagree - this is all an homage to Twitch Plays Pokemon, which was a noteworthy moment in internet culture/history.
https://en.wikipedia.org/wiki/Twitch_Plays_Pok%C3%A9mon

throwaway314155 2 months ago

Gemini can beat the game?

mxwsn 2 months ago
Gemini has beat it already, but using a different and notably more helpful harness. The creator has said they think harness design is the most important factor right now, and that the results don't mean much for comparing Claude to Gemini.
- throwaway314155 2 months ago
  
  Way offtopic to TFA now, but isn't using an improved harness a bit like saying "I'm going to hardcore as many priors as possible into this thing so it succeeds regardless of its ability to strategize, plan and execute?
  
  4 replies →
klohto 2 months ago

2 weeks ago

hansmayer 2 months ago

Right, but on the other hand... how is it even useful? Let's say it can beat the game, so what? So it can (kind of) summarise or write my emails - which is something I neither want nor need, they produce mountains of sloppy code, which I would have to end up fixing, and finally they can play a game? Where is the killer app? The gaming approach was exactly the premise of the original AI efforts in the 1960s, that teaching computers to play chess and other 'brainy' games will somehow lead to development of real AI. It ended as we know in the AI nuclear winter.

samrus 2 months ago
from a foundational research perspective, the pokemon benchmark is one of the most important ones.
these models are trained on a static task, text generation, which is to say the state they are operating in does not change as they operate. but now that they are out we are implicitly demanding they do dynamic tasks like coding, navigation, operating in a market, or playing games. this are tasks where your state changes as you operate
an example would be that as these models predict the next word, the ground truth of any further words doesnt change. if it misinterprets the word bank in the sentence "i went to the bank" as a river bank rather than a financial bank, the later ground truth wont change, if it was talking about the visit to the financial bank before, it will still be talking about that regardless of the model's misinterpretation. But if a model takes a wrong turn on the road, or makes a weird buy in the stock market, the environment will react and change and suddenly, what it should have done as the n+1th move before isnt the right move anymore, it needs to figure out a route of the freeway first, or deal with the FOMO bullrush it caused by mistakenly buying alot of stock
we need to push against these limits to set the stage for the next evolution of AI, RL based models that are trained in dynamic reactive environments in the first place
- hansmayer 2 months ago
  
  Honestly I have no idea what is this supposed to mean, and the high verbosity of whatever it is trying to prove is not helping it. To repeat: We already tried making computers play games. Ever heard of Deep Blue, and ever heard of it again since the early 2000s?
  
  9 replies →
j_maffe 2 months ago
> Where is the killer app?
My man, ChatGPT is the sixth most visited website in the world right now.
- hansmayer 2 months ago
  
  But I did not ask "what was the sixth most visited website in the world right now?", did I? I asked what was the killer app here. I am afraid vague and un-related KPIs will not help here, otherwise we may as well compare ChatGPT and PornHub based on the number of visits, as you seem to suggest.
  
  7 replies →
minimaxir 2 months ago

It's a fun benchmark, like simonw's pelican riding a bike. Sometimes fun is the best metric.
lechatonnoir 2 months ago
This is a weirdly cherry-picked example. The gaming approach was also the premise of DeepMind's AI efforts in 2016, which was nine years ago. Regardless of what you think about the utility of text (code), video, audio, and image generation, surely you think that their progress on the protein-folding problem and weather prediction have been useful to society?
What counts as a killer app to you? Can you name one?
- hansmayer 2 months ago
  
  Well the example came from their own press-release, so who cherry-picked it? Why should I name the next killer app ? Isnt that something that we just recognise the moment it shows up, like we did with www and e-commerce? Its not something a comittee staffed by a bunch of MBAs defines ahead of the time, as is currently the case with the use-cases that are being pushed into our faces every day. I would applaud and cheer if their efforts were focused on scientific problems that you mentioned. Unfortunately for us, this is not what the bean-counters heading all major tech corps see as useful. Do you honestly think any one of them has the benefit of society at heart? No, they want to make money by selling you bullshit products like e-mail summarising and such. Perhaps in the process also to get rid of software developers altogether as well. Then once we as the society lose the ability to do anything on our own, relying on these bullshit machines they gain not only in terms of being able to entshittify their products and squeeze that extra buck, but also opens a "world of possibilities" (for the rich) in terms of societal control. But sure, at least you will still have your, what is it now, two-day delivery from Amazon and a handholding tool to help you speak, write and do anything meaningful as a human being.
  
  1 reply →
- rxtexit 2 months ago
  
  The whole idea of a "killer app" is stupid.
  It is a dismissive rhetorical device to prove a wrong point on an internet forum such as this that has nothing to do with reality.
  
  1 reply →