← Back to context

Comment by whinvik

3 days ago

Looks very interesting. Can you comment on why you think this model can give comparable performance with less training data?

We train the model with `explanations`. Most training asks the model to predict the next token or group of tokens. Our training says, predict the next group of tokens (causal diffusion), but also these tokens should be about {sports/art/coding/etc}. So in addition to token supervision, the model gets concept level supervision. The model is forced to more quickly learn these high level concepts.