Comment by ACCount37

16 hours ago

Full distributions are a fucking pain to save - at this point just save the hidden states. But there are lossy compression tricks there.

To the previous poster's point, soft distributions are useful, even saving the top 10 logits is significantly more training signal than just the final token.