Comment by parpfish
2 months ago
Project manager: “great news! Our model can count Rs in strawberry!”
Dev: “What about Bs in blueberry?”
PM: “you’ll need to open a new jira ticket”
2 months ago
Project manager: “great news! Our model can count Rs in strawberry!”
Dev: “What about Bs in blueberry?”
PM: “you’ll need to open a new jira ticket”
this is literally what likely happens at these companies. ie, have teams that monitor twitter/social media for fails, and fix them with data patches.
Which is why I don't trust any of the benchmarks LLM enthusiasts point to when they say "see the model is getting better". I have zero confidence that the AI companies are trying to make the system better, rather than using the measure as a target.
That reminds me of the time I found thread-safety-breaking changes in Intel's custom Android framework that were clearly designed to cheat benchmarks.