← Back to context

Comment by spyder

4 days ago

Great, especially that they still have an open-weight variant of this new model too. But what happened to their work on their unreleased SOTA video model? did it stop being SOTA, others got ahead, and they folded the project, or what? YT video about it: https://youtu.be/svIHNnM1Pa0?t=208 They even removed the page of that: https://bfl.ai/up-next/

Image models are more fundamentally important at this stage than video models.

Almost all of the control in image-to-video comes through an image. And image models still needs a lot of work and innovation.

On a real physical movie set, think about all of the work that goes into setting the stage. The set dec, the makeup, the lighting, the framing, the blocking. All the work before calling "action". That's what image models do and must do in the starting frame.

We can get way more influence out of manipulating images than video. There are lots of great video models and it's highly competitive. We still have so much need on the image side.

When you do image-to-video, yes you control evolution over time. But the direction is actually lower in terms of degrees of freedom. You expect your actors or explosions to do certain reasonable things. But those 1024x1024xRGB pixels (or higher) have way more degrees of freedom.

Image models have more control surface area. You exercise control over more parameters. In video, staying on rails or certain evolutionary paths is fine. Mistakes can not just be okay, they can be welcome.

It also makes sense that most of the work and iteration goes into generating images. It's a faster workflow with more immediate feedback and productivity. Video is expensive and takes much longer. Images are where the designer or director can influence more of the outcomes with rapidity.

Image models still need way more stylistic control, pose control (not just ControlNets for limbs, but facial expressions, eyebrows, hair - everything), sets, props, consistent characters and locations and outfits. Text layout, fonts, kerning, logos, design elements, ...

We still don't have models that look as good as Midjourney. Midjourney is 100x more beautiful than anything else - it's like a magazine photoshoot or dreamy Instagram feed. But it has the most lackluster and awful control of any model. It's a 2021-era model with 2030-level aesthetics. You can't place anything where you want it, you can't reuse elements, you can't have consistent sets... But it looks amazing. Flux looks like plastic, Imagen looks cartoony, and OpenAI GPT Image looks sepia and stuck in the 90's. These models need to compete on aesthetics and control and reproducibility.

That's a lot of work. Video is a distraction from this work.

  • Hot take: text-to-image models should be biased toward photorealism. This is because if I type in "a cat playing piano", I want to see something that looks like a 100% real cat playing a 100% real piano. Because, unless specified otherwise, a "cat" is trivially something that looks like an actual cat. And a real cat looks photorealistic. Not like a painting, or cartoon, or 3D render, or some fake almost-realistic-but-cleary-wrong "AI style".

    • FYI: photorealism is art that imitates photos, and I see the term misused a lot both in comments and prompts (where you'll actually get subideal results if you say "photorealism" instead of describing the camera that "shot" it!)

      2 replies →

As a startup, they pivoted and focused on image models (they are model providers, and image models often have more use cases than video models, not to mention they continue to have bigger image dataset moat, not video).

  • > bigger image dataset moat

    If they have so much data, then why do Flux model outputs look so God-awful bad?

    They have plastic skin, weird chins, and have that "AI" aura. Not the good AI aura, mind you. The cheap automated YouTube video kind that you immediately skip.

    Flux 2 seems to suffer from the exact same problems.

    Midjourney is ancient. Their CEO is off trying to build a 3D volume and dating companion or some nonsense and leaving the product without guidance and much change. It almost feels abandoned. But even so, Midjourney has 10,000x better aesthetics despite having terrible prompt adherence and control. Midjourney images are dripping with magazine spread or Pulitzer aesthetics. It's why Zuckerberg went to them to license their model instead of quasi "open source" BFL.

    Even SDXL looks better, and that's a literal dinosaur.

    Most of the amazing things you see on social media either come from Midjourney or SDXL. To this day.

    • >Even SDXL looks better, and that's a literal dinosaur.

      I’m not saying you are wrong in effect, but for reference just slightly over 2 years ago was SDZL released, and it took about a year to have great fine tunes.

I heard a possibly unsubstantiated rumor that they had a major failed training run with the video model and canceled the project.

  • Makes no sense since they should have checkpoints earlier in the run that they could restart from and they should have regular checks that keep track if a model has exploded etc.

    • I didn't read "major failed training run" as in "the process crashed and we lost all data" but more like "After spending N weeks on training, we still didn't achieve our target(s)", which could be considered "failing" as well.

      2 replies →

    • There's always a possibility that something implicit to the early model structure causes it to explode later, even if it's a well known, otherwise stable architecture, and you do everything right. A cosmic bit flip at the start of a training run can cascade into subtle instability and eventual total failure, and part of the hard decision making they have to do includes knowing when to start over.

      I'd take it with a grain of salt; these people are chainsaw jugglers and know what they're doing, so any sort of major hiccup was probably planned for. They'd have plan b and c, at a minimum, and be ready to switch - the work isn't deterministic, so you have to be ready for failures. (If you sense an imminent failure, don't grab the spinny part of the chainsaw, let it fall and move on.)

  • lol, unless I’m wrong, that is not how model development works

    a ‘major training run’ only becomes major after you sample from it iteratively every few thousand steps, check its good, fix your pipeline, then continue

    almost by design, major training runs don’t fail

    if I had to guess, like most labs. they’ve probably had to reallocate more time and energy to their image models than expected since the AI image editing market has exploded in size this year, and will do video later

    • It could be that they weren't able to produce stable video -- i.e. getting a consistent look across frames. Video is more complex than image because of this. If their architecture couldn't handle that properly then no amount of training would fix it.

      If they found that their architecture worked better on static images then it is better to pivot to that than wasting the effort. Especially if you have a trained model that is good at producing static images and bad at generating video.