Comment by curioussquirrel

14 hours ago

Even GPT 3.5 is okay (but far from great) at Base64, especially shorter sequences of English or JSON data. Newer models might be post-trained on Base64-specific data, but I don't believe it was the case for 3.5. My guess is that as you say, given the abundance of examples on the internet, it became one of the emergent capabilities, in spite of its design.

No one does RL for better base64 performance. LLMs are just superhuman at base64, as a natural capability.

If an LLM wants a message to be read only by another LLM? Base64 is occasionally chosen as an obfuscation method of choice. Which is weird for a number of reasons.

  • Why are you so confident about this? I am honestly interested if you were part of any one LLM training data collection teams because that's the only way to be so certain.

    It's trivial to generate a full mapping of all base64 4-byte sequences which map to all 3-byte 8-bit sequences (there is only 8^3 of different "tokens", or 2048), and especially to any sequences coming out as ASCII (obviously even fewer). If I was building a training set, I would include the mapping in multiple shapes and formats, because why not?

    If it's an emergent "property", have you tried asking an LLM to do a base48 for instance? Or maybe even something crazier like base55 (keeping it a subset of base64 set).