Comment by equestria

20 days ago

I exclude all the published LLM User-Agents and have a content honeypot on my website. Google obeys, but ChatGPT and Bing still clearly know the content of the honeypot.

What's the purpose of the honeypot? Poisoning the LLM or identifying useragents/IPs that shouldn't be seeing it?

how do you determine that they know the content of the honeypot?

  • Presumably the "honeypot" is an obscured link that humans won't click (e.g. tiny white text on a white background in a forgotten corner of the page) but scrapers will. Then you can determine whether a given IP visited the link.

    • I know what a honeypot is, but the question is how the know the scraped data was actually used to train llms. I wondered whether they discovered or verified that by getting the llm to regurgitate content from the honeypot.

    • I interpreted it to mean that a hidden page (linked as u describe) is indexed in Bing or that some "facts" written on a hidden page are regurgitated by ChatGPT.