Comment by asenna
14 hours ago
Not one image - 5 frames per clip, sent in a single request with a transcript snippet. So the multi-frame + subtitles in one call part is the same as yours.
But yeah, how it picks the frame is the weak-point here. Scene detection would definitely help - this is #1 on the Roadmap.
Could you share how your scene-detection picks the frames?
---
For the vector search, I went for the trade-off of not having it but keeping it simple with plain Markdown files for more portability. The knowledge travels with the files when an SSD moves, no index to keep in sync, and plain text that outlives the tool. But the other path you mentioned is interesting as well to explore.
I originally limited mine to 10 frames spread evenly throughout the video, but it missed a fair bit of context at the analysis step, and didn't scale with length. So now when a video is loaded the app extracts a bunch of frames for the entire video, then calculates an image histogram and compares similarity to the previous one. There's some configuration so it doesn't send too many to the LLM, but still gets a good cross-section of frames to send.
You could also just use FFmpeg as it can do scene detection too. I tested both but liked the results from the histogram analyzer more.
Yeah, markdown works well if you're going to search through it with Claude Code or something like that. I built ClipScape as an Electron app with a local SQLite database, as I wanted an interface I could search and chat in and see the relevant thumbnails.