← Back to context

Comment by rubicon33

23 days ago

Is there anything unique here happening for the video aspect or is it just taking snapshots over and over?

I’ve been looking for a good video summarizing / understanding model!

Nothing unique, it's just taking a snapshot when it's processing the input. Even processing a single image will increase the TTFT by ~0.5s on my machine, so for now, it seems to be impossible for feeding a live video and expecting a real-time response.

In regards to the video capability, I haven't tested it myself, but here's a benchmark/comparison from Google [0]

[0] https://huggingface.co/blog/gemma4#video-understanding

  • I totally get these are very hard problems so solve and that we're on the bleeding edge of what's possible but I can't help and wonder when someone is going to crack real video understanding.

    sure, maybe it's still frame-by-frame but so fast and so often that the model retains a rolling context of what's going on and can answer cleanly temporal questions.

    "how packages were delivered over the last hour", etc.