Show HN: Text-to-video model from scratch (2 brothers, 2 years, 2B params)

18 days ago (huggingface.co)

Writeup (includes good/bad sample generations): https://www.linum.ai/field-notes/launch-linum-v2

We're Sahil and Manu, two brothers who spent the last 2 years training text-to-video models from scratch. Today we're releasing them under Apache 2.0.

These are 2B param models capable of generating 2-5 seconds of footage at either 360p or 720p. In terms of model size, the closest comparison is Alibaba's Wan 2.1 1.3B. From our testing, we get significantly better motion capture and aesthetics.

We're not claiming to have reached the frontier. For us, this is a stepping stone towards SOTA - proof we can train these models end-to-end ourselves.

Why train a model from scratch?

We shipped our first model in January 2024 (pre-Sora) as a 180p, 1-second GIF bot, bootstrapped off Stable Diffusion XL. Image VAEs don't understand temporal coherence, and without the original training data, you can't smoothly transition between image and video distributions. At some point you're better off starting over.

For v2, we use T5 for text encoding, Wan 2.1 VAE for compression, and a DiT-variant backbone trained with flow matching. We built our own temporal VAE but Wan's was smaller with equivalent performance, so we used it to save on embedding costs. (We'll open-source our VAE shortly.)

The bulk of development time went into building curation pipelines that actually work (e.g., hand-labeling aesthetic properties and fine-tuning VLMs to filter at scale).

What works: Cartoon/animated styles, food and nature scenes, simple character motion. What doesn't: Complex physics, fast motion (e.g., gymnastics, dancing), consistent text.

Why build this when Veo/Sora exist? Products are extensions of the underlying model's capabilities. If users want a feature the model doesn't support (character consistency, camera controls, editing, style mapping, etc.), you're stuck. To build the product we want, we need to update the model itself. That means owning the development process. It's a bet that will take time (and a lot of GPU compute) to pay off, but we think it's the right one.

What’s next? - Post-training for physics/deformations - Distillation for speed - Audio capabilities - Model scaling

We kept a “lab notebook” of all our experiments in Notion. Happy to answer questions about building a model from 0 → 1. Comments and feedback welcome!

26 comments

schopra909

tariqshams 17 days ago

Very cool, especially given that it’s a two person team. I will be checking this out on the weekend.

Also I’m super curious on how you’re attempting to have more realistic physics with post training.

WhitneyLand 18 days ago

Great work. How many GPU hours to train?

convivialdingo 17 days ago

That’s amazing effort - I am impressed.

Awesome to see more small teams making impressive leaps.

taherchhabra 17 days ago

I want to build my own video model, just for learning purposes, is there any course which can teach end to end

schopra909 17 days ago

I think YC just release video on the basics of diffusion, but honestly I don’t have a good end to end guide.
We’re going to write up going 0->1 on a video model (all the steps) over the coming months. But it likely won’t be a class or anything like that.
https://www.linum.ai/field-notes
We want to share our learnings with folks who are curious about the space - but don’t have time to make it a full class experience.
Hopefully karpathy does that with his courses in the future!
mandeepj 17 days ago

> I want to build my own video model, just for learning purposes
Sorry, it might sound like a cliche, but try that as a prompt to a deep thinking and learning model, and see what comes out.
An expensive option: Look at Project #5 at https://bytebyteai.com/

popalchemist 17 days ago

Incredibly impressive, dudes. Well done.

whywhywhywhy 17 days ago

> We kept a “lab notebook” of all our experiments in Notion

Couldn't find a link to this, is this public?

schopra909 17 days ago
Not public yet — we’re going to clean it up so it’s readable and release it as blog posts. First one will be everything you need to know on building a VAE for image and video. Should be out in a few weeks. We’re figuring out the write balance between spending time writing and all the work we have on our plate for the next model.
If you’re interested in this stuff, keep an eye on field notes (our blog).
- schopra909 17 days ago
  
  https://www.linum.ai/field-notes

throwaway314155 17 days ago

How much compute was ultimately required to get this done?

E-Reverance 18 days ago

Post it on r/StableDiffusion

glohbalrob 16 days ago

Nice work. Are you guys on X?

Jack_a11y 17 days ago

[dead]

streamer45 18 days ago

Rad! huggingface link gives 404 on my side though.

schopra909 18 days ago
Oh damn! Thanks for catching that -- going to ping the HF folks to see what they can do to fix the collection link.
In the meantime here's the individual links to the models:
https://huggingface.co/Linum-AI/linum-v2-720p https://huggingface.co/Linum-AI/linum-v2-360p
- streamer45 18 days ago
  
  Looks like 20GB VRAM isn't enough for the 360p demo :( need to bump my specs :sweat_smile:
- schopra909 18 days ago
  
  Should be fixed now! Thanks again for the heads up
  
  7 replies →

hackomorespacko 17 days ago

[flagged]