Comment by Leynos
1 day ago
They don't see runs of spaces very well, so most of them are terrible at ASCII art. (They'll often regurgitate something from their training data rather than try themselves.)
And unless their terminal details are included in the context, they'll just have to guess.
Runs of spaces of many different lengths are encoded as a single token. Its not actually inefficient.
In fact everything from ' ' to ' '79 all have a single token assigned to them on the OpenAI GPT4 tokenizer. Sometimes ' 'x + '\n' is also assigned a single token.
You might ask why they do this but its to make it so programming work better by reducing token counts. All whitespace before the code gets jammed into a single token and entire empty lines also get turned into a single token.
There are actually lots of interesting hand crafted token features added which don't get discussed much.