Comment by Jabrov
8 hours ago
They absolutely are. The “maximum context window” of a model is a byproduct of the context length it was trained on.
If your model only ever sees 8K token samples during training, it won’t be as good at 128K context length than if you had trained on samples from 8 to 128
No comments yet
Contribute on Hacker News ↗