Comment by theyCallMeSwift
5 hours ago
I love this idea, but have a hypothesis that 90% of agents that people actually use today would fail this test inadvertently (false negative).
Industry best practice + standard implementation for most agents right now is to do web browsing / fetching via subagents. Their output is summarized using a cheaper model and then passed back to the parent. It's very unlikely that without preserving the actual content the subagents see that the `CANARY-` strings would be found in the output.
Any thoughts on how you'd change the test structure with this in mind?
Hey there - I'm the test author, and you've hit on one of the main points. For the summarization/relevance-based content return, this is a consideration for some of the agent platforms (although I've found others actually do better here than I expected!) - which is part of the point I'm trying to drive home to folks who aren't as familiar with these systems.
I chose to structure it this way intentionally because this is the finding. Most people are surprised that agents aren't 'seeing' everything that's there, and get frustrated when an agent says something isn't there when it clearly is. Raising awareness of this is one of the main points of the exercise, to me.
This isn't best practice. It's certainly not industry best practice. It would fail some pretty basic tests, like these, resulting in poor UX and poor reviews.
I think it describes generally how we can picture Claude and OpenAI working, but neglects further implementation details that are hard to see from their blog posts, ex. a web search vs. a web get tool.
(source: maintained a multi-provider x llama.cpp LLM client for 2.5+ years and counting)
Yeah, my colleague and I have been seeing in testing how much this is actually a problem in practice. It has been - surprising, and a little dismaying - how much this negatively impacts content retrieval and results in poor UX.