Comment by simonw
2 years ago
Something I never understood about ChatML: were those "<|im_start|>" things reserved sequences of text that mapped to specific integer tokens, but were not things you could include in your own text that you submitted to their API (or if you did try they would be tokenized differently)?
ChatGPT presumably adds them as special tokens to the cl100k_base tokenizer, as they demo in the tiktoken documentation: https://github.com/openai/tiktoken#extending-tiktoken
In theory they could be added in normal input but it's possible OpenAI has safeguards against it.