Indexing a year of video locally on a 2021 MacBook with Gemma4-31B (50GB swap)

1 day ago (blog.simbastack.com)

> The skill is open at ~/.claude/skills/video-index/. If you're working on something similar (indexing personal archives, getting a local model to do real archival work, building agents that drive editing tools), I'd be glad to compare notes.

When your Claude wrote this post they might not have selected the right URL to share, unless your home folder is exposed. Care to share the skill files?

  • We just got a modern example of the classic message from a friend who just picked up programming, containing: "I just created my own web app, wanna check it out? It's here: http://localhost:8080"

    • Different context, but I sent a message like that in Signal the other day to a family member with a link to my IP, pointing to `Python -m http.server` running in a directory with a file for them to try (1). Easier than having them open my Samba share.

      1: To get an Android app working that has been delisted and requires a 'key' app that you purchase. We did purchase it, but didn't think to make any backups.

    • I've been getting this weekly from colleagues. It's very much an epidemic right now! And the port number is indeed almost always a random number between 8000 and 8100.

      4 replies →

  • Oops! My bad. Fixing it now. And yeah, I can share the Skill file. Give me 5 mins.

    • Ok I scrambled to finalize a name for it and create a new repo for it - https://github.com/Simbastack-hq/framedex

      PS - I just put this together in the last few mins, removed my personal files and references. So it's not tested properly, please let me know if any issues.

      It's still an early hack, but I have thousands of still images as well from my camera which I've not processed and I need to do the same analysis for those.

      So I'll continue working on it, but happy to receive any PRs if anyone finds any use for it.

      I'm tired of having a backlog of thousands of images and videos, leaving it for later.

      27 replies →

UPDATE: Quickly created a repo for this - https://github.com/Simbastack-hq/framedex (MIT License)

It's not tested properly after I genericized it. Will try to go through it properly and add more updates.

Two big things on my TODO: 1) Make use of this indexing and using Claude's help, make video editing faster with Davinci Resolve (now that I have a good index of all the content)

2) I currently did this for videos, but I want to add more things to this for my thousands of still images of my camera - need to make sense of them. So I'll be working on this as well.

I'm not quite sure why all that swapping is necessary. I really does age your SSD quite fast considering the enormous memory bandwidth required. Gemma 4 31B at 4-bit quantization should only be around 19 GiB [1], not 28.4 GiB. I'm not feeding it images regularly, so I'm not sure how much memory it needs to get those into context, but I can't imagine it is more than 10 GiB.

The activity monitor does show all kinds of Electron apps active, on top of a presumably model-loaded Handy and a virtual machine for Claude Code, so I guess that's the real root cause for all the swapping. If your laptop starts trashing I can't imagine you have any use for those apps, which will grind to a halt.

[1] https://huggingface.co/mlx-community/gemma-4-31b-it-4bit

  • Yeah to be fair, I could've cleaned everything up but this was taken when I was doing other work on my laptop while the screenshot was taken.

    Although slightly laggy, I was impressed by the fact that I was still able to work on other things and have a bunch of tabs open on my Brave browser.

This is great. I wish I had enough ram for a local model. I just spent the last few weeks writing something very similar, but I made it a local Electron app with Whisper, ffmpeg and I added semantic search and embeddings for chatting with the videos. It talks to Claude for the vision analysis, tagging and video chat. Do you only send one image for yours? I used a customised scene detection algorithm to find multiple different images per video and then send them all in one request to Claude (along with the subtitles). It's definitely the most expensive part. Using Sonnet 4.6 for the analysis and Haiku for the tagging costs about $1 for an hour of footage, I can imagine it would be slow locally.

  • Try some of the models on OpenRouter if you are looking to save money. Gemma 4 31B is $0.12/M input, $0.37/M output vs $1/M input, $5/M output for Haiku.

    There are other options that are good too. Gemini 3.1 Flash Lite is great for this kind of thing (NOT Gemini 3.5 Flash though - the pricing for that is bad).

    https://openrouter.ai/google/gemma-4-31b-it

    • Cheers, I'll give it a try. How are those models at returning structured results? When I was writing the prompts for the analysis step and testing with older Claude models, it would have trouble structuring the XML consistently. Sonnet 4.6 handles it really well.

      3 replies →

  • Not one image - 5 frames per clip, sent in a single request with a transcript snippet. So the multi-frame + subtitles in one call part is the same as yours.

    But yeah, how it picks the frame is the weak-point here. Scene detection would definitely help - this is #1 on the Roadmap.

    Could you share how your scene-detection picks the frames?

    ---

    For the vector search, I went for the trade-off of not having it but keeping it simple with plain Markdown files for more portability. The knowledge travels with the files when an SSD moves, no index to keep in sync, and plain text that outlives the tool. But the other path you mentioned is interesting as well to explore.

    • I originally limited mine to 10 frames spread evenly throughout the video, but it missed a fair bit of context at the analysis step, and didn't scale with length. So now when a video is loaded the app extracts a bunch of frames for the entire video, then calculates an image histogram and compares similarity to the previous one. There's some configuration so it doesn't send too many to the LLM, but still gets a good cross-section of frames to send.

      You could also just use FFmpeg as it can do scene detection too. I tested both but liked the results from the histogram analyzer more.

      Yeah, markdown works well if you're going to search through it with Claude Code or something like that. I built ClipScape as an Electron app with a local SQLite database, as I wanted an interface I could search and chat in and see the relevant thumbnails.

Two questions:

1. What is the search index?

2. The "description.md" example has things like "faces -> cluster_id". Is this from Davinci Resolve's face index? Things like faces+names and locations are really important with photo collections, but general LLMs don't handle them so well.

  • 1) It's just simple plain-text `.description.md` sidecar files, one per clip, sitting next to each video.

    Something which I can query later - Like when brainstorming with Claude "I wanna make some videos of the Luxury rooms in the lodge" and it knows what all videos could help here (going through the files).

    There's also a folder root level files that aggregates the text descriptions to make it easier to find.

    I've just attached an image in the blog showing an example - https://blog.simbastack.com/_media/gvcycx2n.png

    2) No - nothing from DaVinci Resolve. Framedex is a standalone pipeline. Resolve isn't involved.

    Faces come from insightface (the open-source buffalo_l pack - RetinaFace for detection), running locally on CPU. For each clip it detects faces in the sampled frames, embeds them, and writes rows to ~/.framedex/faces.db.

    Tbh, this part I know it's building up in my local DB but I haven't tested how good is it. Will check them out properly soon.

    But yeah, on your broader point that's why framedex deliberately does not ask the LLM to handle faces or locations.

    ----

    Faces → insightface / ArcFace embeddings. Deterministic, comparable across clips. The vision model only contributes a rough people_count; it never tries to identify anyone.

    Locations → EXIF GPS via exiftool, reverse-geocoded through Nominatim/OpenStreetMap. Hard metadata, not a guess.

    The LLM only does what it's good at: scene description, mood, shot type, keywords, keep/review/cull rating (this last part is also debatable though).

I ran Gemma on a 2015 thinkpad to do something similar. Fortunately, I could upgrade the memory otherwise it would have been a painful exercise.

Not gonna lie, llama.cpp had the fans spinning at max speed. But it worked and I got the job done.

  • > the fans spinning at max speed

    This always confuses me - don't people want their computations to run as fast as possible and thus inevitably produce more heat that needs to be vented?

    I suppose sometimes it is just an analogy for "its utilizing 100% of my resources" (which I'm guessing it is here), but I've definitely had people say it as an actual complaint in different contexts

    • > I've definitely had people say it as an actual complaint in different contexts

      I think fan loudness is an outgrowth of conspicuous consumption because a certain OEM decided to make it a marketing bullet-point.

      I was equally disappointed by by people - especially device reviewers - banging on the drum that phones made of plastic "didn't feel premium", and we got phones with glass backs that have to be shoved into plastic cases (because plastic is the near-perfect material to protect fragile phones screens and innards)

    • What people complain is when they visit a blog with two images and the fans are spinning at max speed because the blog has 100 trackers.

    • Fans shouldn't be running at max speed if the model fits in RAM with room to spare for context. Usually fans max out when the model doesn't fit and the CPU is chugging to make up the difference (or the user didn't tune LLM settings)

      1 reply →

> generative AI video has no place on a real travel brand

I am pretty sure that the vast majority of Airbnb hosts would not agree with you.

> equals TripAdvisor crucifixion

I have no idea how the Airbnb hosts with fake listings survive, really.

  • Haha. It's honestly something that I've been struggling with myself. I'm running this safari lodge but I don't want to go down that route of slop videos!

    But on the other hand, genuine videos do take time and slows down the process.

Thanks for the article! I have a beefy M5 Pro and I'm eagerly looking around for ways to use local models (specifically Gemma4 & Qwen3.6).

This is an excellent thing to do. Especially that LLMs excel at batching thus you can index multiple photos and videos in parallel for no performance penalty.

  • Unsloth Studio [0] is what I recommend these days, open source alternative to the more widely known LM Studio, and also built by the people who make good quantizations of released models. With MTP support not merged in you should get 2x token generation speed with no accuracy difference. They also have MLX quants if you scroll down a bit, which is a format specifically for macOS' Metal GPU acceleration but that's not integrated into Unsloth Studio just yet.

    [0] https://unsloth.ai/docs/models/qwen3.6#mtp-guide

    • I have researched for quite a bit and so far the fastest runtime is the oMLX one. But there's a caveat: ttft on MLX on M4 Pro is enormous. On M5 Pro it has been greatly sped up.

      2 replies →

    • I tried Unsloth Studio recently and was disappointed - in particular the downloading functionality is half-baked and didn’t cope with resuming downloads. As it seemed to just be a simple wrapper over llama.cpp, I found that huggingface hub, llama.cpp, and a couple of simple scripts actually offered better functionality once it was set up.

      1 reply →

  • Thanks! Videos is still kinda new to me. But I have a large collection of amazing photos - tens of thousands of RAW images - just lying there spread across the different trip folders.

    You know what I REALLY want? Just point this beast at the folders and it tell me which 150 shots are good to process from these 1,500 images. That's the dream!

    Although the technology is getting there, it's still a very difficult problem to solve. Taste and art is subjective. Also me as a photographer will always be concerned - "what if my best shot was in one of these rejected shots".

    But yeah, I think I'll try to do some more of these experiments soon.

    • there’s a lot of open models out there… I told Claude to do a weighted score on several models and deduplicate by CLIP similarity for an expedition, should be easy to replicate (see below). Sure doesn’t select the absolute best pics from an emotional impact perspective, but it was pretty damn good at me not having to wade through the bottom 80% of mediocre shots and dupes!

      —-

      “Models scored all 4,487 photos. NIMA rewards technical craft (sharpness, composition), LAION rewards emotional/aesthetic appeal, MUSIQ is more general quality. Combined: 0.4 NIMA + 0.3 LAION + 0.3 MUSIQ, deduped at 0.85 CLIP similarity.

      Interesting: the models wildly disagreed on some shots — one photo ranked NIMA #2 globally but LAION #4313.”

      1 reply →

  • I have been contemplating a M5 Pro MBP, but for the life for me I wasn't able to find benchmarks for real-world models, do you happen to know how many tokens per second roughly you get with MoE models like Qwen 3.6 35B/A3B or Gemma 4 26B?

    • You need to ask macOS people for their prefill speed as well, there are two numbers you care about here, and current MacBooks have generally terrible numbers when it comes to prefill performance. Surely it'll get better with time, but if you already have a desktop, I'd go the "beefy GPU" route first.

    • Qwen 3.6 35B running on oMLX 0.3.9rc1: on oMLX I get 86 t/s on Q4 and 74 t/s on Q6.

      Bear in mind that ttft on MLX is much much faster on M5 Pro as compared to M4 Pro.

      Also bear in mind that those figures are with NO optimizations whatsoever: no MCP, no DFlash. I am waiting for both to be released for the Qwen models.

      3 replies →

    • Native MCP:

      For Qwen 35B enabling native MCP on MLX models slows it down by 10%.

      For Qwen 27B enabling native MCP on MLX models speeds token generation up almost exactly 1.5x.

      (all tested on M5 pro).

      1 reply →

My take is that B2C AI applications are kind of structurally limited by how hard it is to build personalized context.

The idea of capable local models could be a huge unlock here if they are able to do the bottom-up context collection research / tagging / etc. at scale.

  • I made a B2C AI app that's fully local (and free) to do AI based contextual file renaming.

    So if you give it a bunch of screenshots it will try and intelligently name them based upon what is in the screenshot. Same for videos, PDFs, etc.

    But to your point I haven't even tried charging money as it feels like something Apple is just going to bake in as a feature.

    https://finalfinalreallyfinaluntitleddocumentv3.com/

  • Definitely agree with this. Here, me and Claude brainstorming together did that Research, and some trial-and-error to get to this.

    But I can tell it's only a matter of time before agents become smart enough to let my non-tech friends be able to just say "Make sense of all these videos in my folder" and it just does it.

  • Is it really local models that unlock this? Surely stateless model APIs would yield the same benefits? I get that local can be “cheaper” depending on usage, but we’ve been renting storage and compute from clouds at a premium for ages..

    • A huge thing here was the massive amount of data that was just processed - I went through about 1TB of files over 24 hours.

      Using API to analyze even a subset of this would've been painful imo.

      2 replies →

Interesting. I've been doing similar stuff with my archive on a weak Celeron laptop with 4GB RAM using vanilla ML tech that I'm learning by prompting LLMs (heh). Extract all info from media as sidecar files and all, exploring low power approaches.

I can sell this as a service to people who can't even run an LLM, or don't want to cook their hardware.

Waitlist open:

"Catalog, search, preview, and generate production-ready prompts & scripts from your entire archive — on your existing hardware. Then render in the cloud."

https://harlanji.pythonanywhere.com/assetforge/

> Every AI video editor on the market assumes your footage is already labeled

Shameless plug: I'm the founder of Chat Octopus, an AI media assistant, and it actually 'looks' at the videos to understand them before creating a cut.

[flagged]

  • Could you please not post generated comments to HN? It's not allowed here. See https://news.ycombinator.com/item?id=47340079.

    We ban accounts that do this and I don't want to ban you, so please write everything that you post to HN by hand.

    Of course, it's impossible to know for sure what was LLM processed or not, but we're getting complaints about some of your posts and, upon inspection, the complaints seem justified.

Awesome. Say, this is very comprehensive.

I was vaguely aware of all these pieces existing (except for running a facial recognition database at home o_o), but it's really neat to put them all together like that.

  • Thanks! I was honestly casually trying it out on the side with Claude's help. And I was actually pleasantly surprised to see how good the result was.

    Still blows my mind I can do all this from my 2021 MBP.

    I'll try to do a post once I have the next steps working (helping with planning and editing videos with Davinci Resolve).

    • I also have a 64GB M1 Max and am similarly impressed with what that workhorse can do. The M5 tempted me -- a lot -- but then I looked at what I was already getting done on that machine and just couldn't justify it ... yet. Someday, surely, but not yet. Gemma4 gave all my local projects new life, just like what you did here.

      Great job. Long live the M1 Max!

      1 reply →

The post is a mix of human and AI writing and the AI-mannerisms get on the nerves. At least it has a clear topic and some actionable insights and code examples.

I’d like to do something like this for the collection of home videos I have piling up, but I’m still on 16GB M1. Any hope of getting decent results with smaller models? If not, does anyone have tips on GPU rental?

I have a Claude max sub and plenty of OpenRouter credit, but I don’t feel good about uploading my family’s private videos

Reading this text feels strange, sentences seems to be detached

  • I had exactly the same impression, and I recall seeing this style other times recently. First time I thought it was just bad writing skills, now I'm thinking it's AI generated.

    • I'm the author, yes it is AI-assisted.

      You can make AI-generated content without it being slop. Slop, to me at least, is content that's wrong, padded, or generic.

      I see the cadence / short-sentence issues but if there's something else beyond those, I'd actually want to know what made it feel bad.

      I would've put off documenting what I did over the weekend but instead, I did document everything, spent quite some time (several iterations) and effort to make sure it does not hallucinate and writes in my own tone and voice. I'm sure it could be better but the content is not made-up.

      At a time where most of us software engineers have changed our workflows to let AI write 80+% of our code using agents, I feel writing is heading the same way. It then becomes a matter of taste, whether it's done well or not.

      If you're looking clues and signs for whether a content has used AI, you're going to be disappointed over the next 12 months.

      If it feels jarring right now, I'll work harder on the workflow so it feels more natural next time (someone shared this project with me - https://github.com/blader/humanizer).

      But this clearly allows me to make content which I wouldn't have done earlier.

      1 reply →

The content is good, but this LLM writing style gets tiresome. Everything is a revelation:

>“I bought it for Chrome. It's running a model that didn't exist when I bought it.”

Well duh, personal computers run new software. That’s literally the whole point. The Apple II didn’t sell on the strength of the preinstalled apps.

  • Author here. I totally hear you. I wasn't expecting this to do well on HN for exactly this reason.

    But I've mentioned elsewhere - if it wasn't for all the AI-assistance, I would've put-off documenting everything that I did and not even get to the writing part.

    But yeah, I'll be working on the workflow to make the next write-up better, more humanized.

Now I have another project for this weekend! I also have tons of video and not a lot of time to index them.

The subject matter is interesting but the amount of slop makes it difficult to read through. Yeah, it's great that you can throw your technical problems at Claude without caring much about the generated output but treating your own writing that you actually want to share with the world the same way is a terrible idea.

  • Tbh, I did spend a lot of time trying to ground it and de-slopify it - verified nothing was halucinated and went through 10 iterations to get to this. It's almost like wrestling with Claude and I knew it would be tough on HN.

    But because of the fear of non-perfection, I used to put away things like creating this article or even posting it anywhere. And I do think the article has real value that HN would appreciate (I am myself an HN-enthusiast).

    I'll try more. Someone else shared this project which would be really helpful - https://github.com/blader/humanizer

    Also a side note, the blog is posted on my self-created Slopit.io platform which is purely meant for your personal agents (working along with you) to post content - I recommend trying it out. https://blog.slopit.io/this-blog-post-is-slop/

    I know, things are getting difficult with all the slop around, but my personal opinion is, as the agents get better at writing, the "annoying-ness" factor reduces and pieces of substance will still be appreciated, even if it was written by agents. This and the fact that agents aren't going away.

    If I've automated a lot of my coding, I feel like engineers like me would naturally progress to also taking agents' help to write useful content.

    PS - this comment was 100% hand-typed.

    • For what it's worth, I really enjoyed this read and almost came here to comment "this is the most enjoyable llm-assisted article I've read in a while"

      The tells were unmistakable but it still had a human touch, so I for one am glad you published anyway.

      1 reply →