Comment by carpo

18 hours ago

This is great. I wish I had enough ram for a local model. I just spent the last few weeks writing something very similar, but I made it a local Electron app with Whisper, ffmpeg and I added semantic search and embeddings for chatting with the videos. It talks to Claude for the vision analysis, tagging and video chat. Do you only send one image for yours? I used a customised scene detection algorithm to find multiple different images per video and then send them all in one request to Claude (along with the subtitles). It's definitely the most expensive part. Using Sonnet 4.6 for the analysis and Haiku for the tagging costs about $1 for an hour of footage, I can imagine it would be slow locally.

7 comments

carpo

nl 17 hours ago

Try some of the models on OpenRouter if you are looking to save money. Gemma 4 31B is $0.12/M input, $0.37/M output vs $1/M input, $5/M output for Haiku.

There are other options that are good too. Gemini 3.1 Flash Lite is great for this kind of thing (NOT Gemini 3.5 Flash though - the pricing for that is bad).

https://openrouter.ai/google/gemma-4-31b-it

carpo 16 hours ago
Cheers, I'll give it a try. How are those models at returning structured results? When I was writing the prompts for the analysis step and testing with older Claude models, it would have trouble structuring the XML consistently. Sonnet 4.6 handles it really well.
- nl 16 hours ago
  
  Use function calling/tool use, not XML output. The models are all trained for that now.
  Ie, instead of telling it to generate
  <name>Name</name> <age>19</name> <address>whatever</name>
  give it a function
  details(name: string, age: int, address: string)
  That is actually a JSON schema, and the models do great at it. Here's the claude docs, but they are all similar: https://platform.claude.com/docs/en/agents-and-tools/tool-us...
  
  2 replies →

asenna 12 hours ago

Not one image - 5 frames per clip, sent in a single request with a transcript snippet. So the multi-frame + subtitles in one call part is the same as yours.

But yeah, how it picks the frame is the weak-point here. Scene detection would definitely help - this is #1 on the Roadmap.

Could you share how your scene-detection picks the frames?

---

For the vector search, I went for the trade-off of not having it but keeping it simple with plain Markdown files for more portability. The knowledge travels with the files when an SSD moves, no index to keep in sync, and plain text that outlives the tool. But the other path you mentioned is interesting as well to explore.

carpo 10 hours ago

I originally limited mine to 10 frames spread evenly throughout the video, but it missed a fair bit of context at the analysis step, and didn't scale with length. So now when a video is loaded the app extracts a bunch of frames for the entire video, then calculates an image histogram and compares similarity to the previous one. There's some configuration so it doesn't send too many to the LLM, but still gets a good cross-section of frames to send.
You could also just use FFmpeg as it can do scene detection too. I tested both but liked the results from the histogram analyzer more.
Yeah, markdown works well if you're going to search through it with Claude Code or something like that. I built ClipScape as an Electron app with a local SQLite database, as I wanted an interface I could search and chat in and see the relevant thumbnails.