Comment by storystarling
25 days ago
Generative models feel like the wrong abstraction here. I would try extracting keyframes and running them through CLIP or SigLIP to get embeddings. Then you can just do vector search to match the segments. Much lighter on compute.
I was talking to get LLMs to write the code or come up with an approach. I agree that the resulting solution does not need any kind of LLMs or even ML.