← Back to context

Comment by andersa

4 days ago

I heard a possibly unsubstantiated rumor that they had a major failed training run with the video model and canceled the project.

Makes no sense since they should have checkpoints earlier in the run that they could restart from and they should have regular checks that keep track if a model has exploded etc.

  • I didn't read "major failed training run" as in "the process crashed and we lost all data" but more like "After spending N weeks on training, we still didn't achieve our target(s)", which could be considered "failing" as well.

    • They could have done what Lightricks did with LTX-1 - build almost embarrassingly small models in the open and iteratively improve from learning.

      LTX's first model felt two years behind SOTA when it launched, but they viewed it as a success and kept going.

      The investment initially is low and can scale with confidence.

      BFL goes radio silent and then drops stuff. Now they're dropping stuff that is clearly middle of the pack.

      1 reply →

  • There's always a possibility that something implicit to the early model structure causes it to explode later, even if it's a well known, otherwise stable architecture, and you do everything right. A cosmic bit flip at the start of a training run can cascade into subtle instability and eventual total failure, and part of the hard decision making they have to do includes knowing when to start over.

    I'd take it with a grain of salt; these people are chainsaw jugglers and know what they're doing, so any sort of major hiccup was probably planned for. They'd have plan b and c, at a minimum, and be ready to switch - the work isn't deterministic, so you have to be ready for failures. (If you sense an imminent failure, don't grab the spinny part of the chainsaw, let it fall and move on.)

lol, unless I’m wrong, that is not how model development works

a ‘major training run’ only becomes major after you sample from it iteratively every few thousand steps, check its good, fix your pipeline, then continue

almost by design, major training runs don’t fail

if I had to guess, like most labs. they’ve probably had to reallocate more time and energy to their image models than expected since the AI image editing market has exploded in size this year, and will do video later

  • It could be that they weren't able to produce stable video -- i.e. getting a consistent look across frames. Video is more complex than image because of this. If their architecture couldn't handle that properly then no amount of training would fix it.

    If they found that their architecture worked better on static images then it is better to pivot to that than wasting the effort. Especially if you have a trained model that is good at producing static images and bad at generating video.