Comment by canpan

14 hours ago

I wondered how is training data balanced? If you put in to much Wikipedia, and your model sounds like a walking encyclopedia?

After doing the Karpathy tutorials I tried to train my AI on tiny stories dataset. Soon I noticed that my AI was always using the same name for its stories characters. The dataset contains that name consistently often.

At this scale, that kind of thing is not really a problem; you just dump all of the data you can find into the model (pre-training)1. Of course, the pre-training data influences the model, but the reinforcement learning is really what determines the model’s writing style and, in general, how it “thinks” (post-training).

1 This data is still heavily filtered/cleaned