Comment by dmsnell
6 months ago
Unicode has a range of Tag Characters, created for marking regions of text as coming from another language. These were deprecated for this purpose in favor of higher level marking (such as HTML tags), but the characters still exist.
They are special because they are invisible and sequences of them behave as a single character for cursor movement.
They mirror ASCII so you can encode arbitrary JSON or other data inside them. Quite suitable for marking LLM-generated spans, as long as you don’t mind annoying people with hidden data or deprecated usage.
Can't I get around this by starting my text selection one character after the start of some AI-generated text and ending it one character before the end, Ctrl-C, Ctrl-V?
Yes, that’s correct. All of these measures, of course, stand as a courtesy and are trivial to bypass, as ema notes.
Finding cryptographic-strength measures to identify LLM-generated content is a few orders of magnitude harder than optimistically marking them. Besides, it also relies on the content producer adding those indicators so that can’t be ignored as a major source of missing metadata.
But sometimes lossy mechanisms are still helpful because people who aren’t out with malicious purposes might copy and paste without being aware that the content is generated, while an auditor (be it anyone who inspects one level deeper) can discover in some (most?) cases the source of the content.
There are many ways to get around this since it is trivial to write code that strips those tags.