Comment by ACCount37

15 hours ago

The conventional wisdom is that real world text is the most valuable pre-training data.

There is some experimentation on using algorithmically generated synthetic data in pre-training, as well as some intentional inclusions of "weird" data - like CSV logs of weather readings. But generally, it's seen as computationally inefficient - compared to "normal" pre-training done on natural data.

In a world where compute is much cheaper and getting new data is much more expensive, I would expect this kind of thing to be pursued more. We're heading for that world. But we aren't there yet.

I haven't experimented with baseN encodings myself, no. But if I were to down the expectations in advance:

1. Base64 is by far the best-known baseN encoding in LLMs.

2. This is driven mainly by how well represented meaningful base64 strings are in the natural "scraped web" datasets. LLMs learn base64 the way they learn languages.

3. Every LLM pre-trained on "scraped web" data will be somewhat capable of reading and writing base64.

4. Base64-encoded text is easier to read for an LLM than encoded non-text binary data.

5. The existence of a strict, learnable "4 characters -> 3 bytes" map is quite beneficial, but not vital.

0 comments

ACCount37

No comments yet

Contribute on Hacker News ↗