Comment by planb
1 day ago
And by a sample that has become increasingly known as a benchmark. Newer training data will contain more articles like this one, which naturally improves the capabilities of an LLM to estimate what’s considered a good „pelican on a bike“.
And that’s why he says he’s going to have to find a new benchmark.
Would it though? There really aren't that many valid answers to that question online. When this is talked about, we get more broken samples than reasonable ones. I feel like any talk about this actually sabotages future training a bit.
I actually don't think I've seen a single correct svg drawing for that prompt.
So what you really need to do is clone this blog post, find and replace pelican with any other noun, run all the tests, and publish that.
Call it wikipediaslop.org
If the any other noun becomes fish... I think I disagree.