Comment by SeanAnderson
7 hours ago
yes to both.
absolutely requires longer training time and more compute.
once trained, predictions need to hold through many more steps because each step processes one token. if a token early in a sentence heavily implies a token will occur later in the sentence then that awareness needs to be maintained while processing each intermediary token and each step is a bit lossy. the fewer steps you need to take before leveraging that knowledge the better the prediction.
if you had infinite compute and data for training then performance would be equivalent though, i think.
No comments yet
Contribute on Hacker News ↗