Comment by Terr_
2 hours ago
What we have is a machine trained on many old documents that takes one new document and dreams up stuff to append. The LLM algorithm cannot specially recognize contents as "instructions" to itself-the-author.
Even if special tokens are used absolutely perfectly (somehow avoiding escapes or ambiguities or reflected attacks) they are ultimately the same as highlighting all the parts of the document in different colors. You've saved the signal, but there's no mind to receive the intended meaning.
This means that your markers--while far more exclusive--ultimately exist on the same data-level as punctuation and using ? to indicate a question.
> you can prevent the output tokens from ever using instruction formatting
The right words may still outweigh the formatting around them, the same way that they can already outweigh other words around them.
No comments yet
Contribute on Hacker News ↗