Comment by Ukv

20 days ago

Are these IPs actually from OpenAI/etc. (https://openai.com/gptbot.json), or is it possibly something else masquerading as these bots? The real GPTBot/Amazonbot/etc. claim to obey robots.txt, and switching to a non-bot UA string seems extra questionable behaviour.

I exclude all the published LLM User-Agents and have a content honeypot on my website. Google obeys, but ChatGPT and Bing still clearly know the content of the honeypot.

  • What's the purpose of the honeypot? Poisoning the LLM or identifying useragents/IPs that shouldn't be seeing it?

  • how do you determine that they know the content of the honeypot?

    • Presumably the "honeypot" is an obscured link that humans won't click (e.g. tiny white text on a white background in a forgotten corner of the page) but scrapers will. Then you can determine whether a given IP visited the link.

      2 replies →