Comment by necovek

13 hours ago

Why are you so confident about this? I am honestly interested if you were part of any one LLM training data collection teams because that's the only way to be so certain.

It's trivial to generate a full mapping of all base64 4-byte sequences which map to all 3-byte 8-bit sequences (there is only 8^3 of different "tokens", or 2048), and especially to any sequences coming out as ASCII (obviously even fewer). If I was building a training set, I would include the mapping in multiple shapes and formats, because why not?

If it's an emergent "property", have you tried asking an LLM to do a base48 for instance? Or maybe even something crazier like base55 (keeping it a subset of base64 set).

The conventional wisdom is that real world text is the most valuable pre-training data.

There is some experimentation on using algorithmically generated synthetic data in pre-training, as well as some intentional inclusions of "weird" data - like CSV logs of weather readings. But generally, it's seen as computationally inefficient - compared to "normal" pre-training done on natural data.

In a world where compute is much cheaper and getting new data is much more expensive, I would expect this kind of thing to be pursued more. We're heading for that world. But we aren't there yet.

I haven't experimented with baseN encodings myself, no. But if I were to down the expectations in advance:

1. Base64 is by far the best-known baseN encoding in LLMs.

2. This is driven mainly by how well represented meaningful base64 strings are in the natural "scraped web" datasets. LLMs learn base64 the way they learn languages.

3. Every LLM pre-trained on "scraped web" data will be somewhat capable of reading and writing base64.

4. Base64-encoded text is easier to read for an LLM than encoded non-text binary data.

5. The existence of a strict, learnable "4 characters -> 3 bytes" map is quite beneficial, but not vital.

For kicks, I've tried this out with ChatGPT5: it nicely explained how it will use A-Za-z0123 as the alphabet for base55, and then duly went and produced a string with a 4 in it. It's not even base64, so it's all sorts of messy :)