Comment by andersa

3 months ago

I heard a possibly unsubstantiated rumor that they had a major failed training run with the video model and canceled the project.

7 comments

andersa

qoez 3 months ago

Makes no sense since they should have checkpoints earlier in the run that they could restart from and they should have regular checks that keep track if a model has exploded etc.

embedding-shape 3 months ago
I didn't read "major failed training run" as in "the process crashed and we lost all data" but more like "After spending N weeks on training, we still didn't achieve our target(s)", which could be considered "failing" as well.
- echelon 3 months ago
  
  They could have done what Lightricks did with LTX-1 - build almost embarrassingly small models in the open and iteratively improve from learning.
  LTX's first model felt two years behind SOTA when it launched, but they viewed it as a success and kept going.
  The investment initially is low and can scale with confidence.
  BFL goes radio silent and then drops stuff. Now they're dropping stuff that is clearly middle of the pack.
  
  1 reply →
observationist 3 months ago

There's always a possibility that something implicit to the early model structure causes it to explode later, even if it's a well known, otherwise stable architecture, and you do everything right. A cosmic bit flip at the start of a training run can cascade into subtle instability and eventual total failure, and part of the hard decision making they have to do includes knowing when to start over.
I'd take it with a grain of salt; these people are chainsaw jugglers and know what they're doing, so any sort of major hiccup was probably planned for. They'd have plan b and c, at a minimum, and be ready to switch - the work isn't deterministic, so you have to be ready for failures. (If you sense an imminent failure, don't grab the spinny part of the chainsaw, let it fall and move on.)

latentspacer 3 months ago

lol, unless I’m wrong, that is not how model development works

a ‘major training run’ only becomes major after you sample from it iteratively every few thousand steps, check its good, fix your pipeline, then continue

almost by design, major training runs don’t fail

if I had to guess, like most labs. they’ve probably had to reallocate more time and energy to their image models than expected since the AI image editing market has exploded in size this year, and will do video later

rhdunn 3 months ago

It could be that they weren't able to produce stable video -- i.e. getting a consistent look across frames. Video is more complex than image because of this. If their architecture couldn't handle that properly then no amount of training would fix it.
If they found that their architecture worked better on static images then it is better to pivot to that than wasting the effort. Especially if you have a trained model that is good at producing static images and bad at generating video.