Comment by plastic3169
16 hours ago
Great work! I have been wondering what would it take to train with higher image bit depth (10 or 12b) and/or using camera footage only, not already heavily processed images? The usefulness of video generation in most professional use cases is limited because models are too end to end and completely contaminated with stock footage. Maybe quantities of training material needed is simply not there?
Not blaming you, but asking as I don’t usually have access to professionals working with video training.
It’s a great question. In terms of pre-training even if they were was enough data at that quality, storing it and either demuxing it into raw frames OR compressing it with a sufficiently powerful encoder likely would cost a lot of $. But there’s a case to potentially use a much smaller subset of that data to dial in aesthetics towards the end of training. The gotcha there would come in terms of data diversity. Often you see that models will adapt to the new distribution and forget patterns from the old data. It’s hard to disentangle a model learning clarity of detail from concepts, so you might forget key ideas when picking up these details. Nevertheless maybe there is a way to use small amounts of this data in a RL finetuning setup? In our experience RL post training changes very little in the underlying model weights — so it might be a “light” enough touch to elicit the the desired details.