Comment by qoez

4 days ago

Makes no sense since they should have checkpoints earlier in the run that they could restart from and they should have regular checks that keep track if a model has exploded etc.

I didn't read "major failed training run" as in "the process crashed and we lost all data" but more like "After spending N weeks on training, we still didn't achieve our target(s)", which could be considered "failing" as well.

  • They could have done what Lightricks did with LTX-1 - build almost embarrassingly small models in the open and iteratively improve from learning.

    LTX's first model felt two years behind SOTA when it launched, but they viewed it as a success and kept going.

    The investment initially is low and can scale with confidence.

    BFL goes radio silent and then drops stuff. Now they're dropping stuff that is clearly middle of the pack.

    • Going from launching SOTA models to launching "embarrassingly small models" isn't something investors generally are into, specially when you're thinking about what training runs to launch and their parameters. And since BFL has investors, they have to make choices that try to maximize ROI for investors rather than the community at large, so this is hardly surprising.

There's always a possibility that something implicit to the early model structure causes it to explode later, even if it's a well known, otherwise stable architecture, and you do everything right. A cosmic bit flip at the start of a training run can cascade into subtle instability and eventual total failure, and part of the hard decision making they have to do includes knowing when to start over.

I'd take it with a grain of salt; these people are chainsaw jugglers and know what they're doing, so any sort of major hiccup was probably planned for. They'd have plan b and c, at a minimum, and be ready to switch - the work isn't deterministic, so you have to be ready for failures. (If you sense an imminent failure, don't grab the spinny part of the chainsaw, let it fall and move on.)