Comment by demosthanos

19 hours ago

I'd say definitely do not do that. That would make the benchmark look more serious while still being problematic for knowledge cutoff reasons. Your prompt has become popular even outside your blog, so the odds of some SVG pelicans on bicycles making it into the training data have been going up and up.

Karpathy used it as an example in a recent interview: https://www.msn.com/en-in/health/other/ai-expert-asks-grok-3...

10 comments

demosthanos

diggan 18 hours ago

Yeah, this is the problem with benchmarks where the questions/problems are public. They're valuable for some months, until it bleeds into the training set. I'm certain a lot of the "improvements" we're seeing are just benchmarks leaking into the training set.

travisgriggs 17 hours ago
That’s ok, once bicycle “riding” pelicans become normative, we can ask it for images of pelicans humping bicycles.
The number of subject-verb-objects are near infinite. All are imaginable, but most are not plausible. A plausibility machine (LLM) will struggle with the implausible, until it can abstract well.
- zahlman 15 hours ago
  
  I can't fathom this working, simply because building a model that relates the word "ride" to "hump" seems like something that would be orders of magnitude easier for an LLM than visualizing the result of SVG rendering.
- diggan 16 hours ago
  
  > The number of subject-verb-objects are near infinite. All are imaginable, but most are not plausible
  Until there is enough unique/new subject-verb-objects examples/benchmarks so the trained model actually generalized it just like you did. (Public) Benchmarks needs to constantly evolve, otherwise they stop being useful.
  
  1 reply →

throwaway31131 16 hours ago

I’d say it doesn’t really matter. There is no universally good benchmark and really they should only be used to answer very specific questions which may or may not be relevant to you.

Also, as the old saying goes, the only thing worse than using benchmarks is not using benchmarks.

6LLvveMx2koXfwn 18 hours ago

I would definitely say he had no intention of doing that and was doubling down on the original joke.

colecut 18 hours ago

The road to hell is paved with the best intentions
clarification: I enjoyed the pelican on a bike and don't think it's that bad =p

telotortium 9 hours ago

Yeah, Simon needs to release a new benchmark under a pen name, like Stephen King did with Richard Bachman.