Comment by schopra909
15 days ago
That all being said, you can just delete the T5 from memory after encoding the text so save on memory.
The 2B parameters will take up 4 Gb of memory but activations will be a lot more given size of context windows for video.
A 720p 5 second video is roughly 100K tokens of context
No comments yet
Contribute on Hacker News ↗