← Back to context

Comment by SeanAnderson

10 hours ago

yes to both.

absolutely requires longer training time and more compute.

once trained, predictions need to hold through many more steps because each step processes one token. if a token early in a sentence heavily implies a token will occur later in the sentence then that awareness needs to be maintained while processing each intermediary token and each step is a bit lossy. the fewer steps you need to take before leveraging that knowledge the better the prediction.

if you had infinite compute and data for training then performance would be equivalent though, i think.