Comment by andai
25 days ago
In my experience an LLM could probably handle this. And it's not so novel. They can make an image stitcher which is basically the same problem.
It would probably need to download the whole video first though, so I'm not sure it would work as an extension. And analysing all frames would be expensive upfront. (If you're using it interactively and waiting for a video to start playing.)
You might be able to get away with just looking for repetition in the audio.
Yeah my point was to download videos in bulk and scan them to then mark these segments in Sponsorblock.
LLMs failed to produce any kind of performant solution.
Generative models feel like the wrong abstraction here. I would try extracting keyframes and running them through CLIP or SigLIP to get embeddings. Then you can just do vector search to match the segments. Much lighter on compute.
I was talking to get LLMs to write the code or come up with an approach. I agree that the resulting solution does not need any kind of LLMs or even ML.