Comment by semanticintent
3 hours ago
The FieldWorkArena finding is the most revealing — not because it's the most sophisticated exploit, but because it's the simplest. A validator that checks "did the assistant reply?" instead of "was the reply correct?" was never a benchmark. It was a participation trophy.
The pattern underneath all of these: validation that runs after the fact on outputs the agent controlled. If the thing being measured can influence the measurement, the measurement is unreliable. That's not AI-specific — it's why compilers enforce constraints at parse time instead of trusting runtime checks.
> A validator that checks "did the assistant reply?" instead of "was the reply correct?" was never a benchmark. It was a participation trophy
People can't even write a two paragraph comment without ai now