Comment by Ukv
20 days ago
Are these IPs actually from OpenAI/etc. (https://openai.com/gptbot.json), or is it possibly something else masquerading as these bots? The real GPTBot/Amazonbot/etc. claim to obey robots.txt, and switching to a non-bot UA string seems extra questionable behaviour.
I exclude all the published LLM User-Agents and have a content honeypot on my website. Google obeys, but ChatGPT and Bing still clearly know the content of the honeypot.
What's the purpose of the honeypot? Poisoning the LLM or identifying useragents/IPs that shouldn't be seeing it?
how do you determine that they know the content of the honeypot?
Presumably the "honeypot" is an obscured link that humans won't click (e.g. tiny white text on a white background in a forgotten corner of the page) but scrapers will. Then you can determine whether a given IP visited the link.
2 replies →
Interesting - do you have a link?
Of course, but I'd rather not share it for obvious reasons. It is a nonsensical biography of a non-existing person.
I don't trust OpenAI, and I don't know why anyone else would at this point.