Comment by jerpint
2 months ago
Batching can lead to variance with things like batchnorm but most transformers use layer norm to avoid this problem
2 months ago
Batching can lead to variance with things like batchnorm but most transformers use layer norm to avoid this problem
Batchnorm can only have an effect between batches during training, not inference.