← Back to context

Comment by refulgentis

1 day ago

I agree that's the most likely interpretation - does it read as a shell game to you? Like, it can do that but once you get the thing that can use the output involved it's 1/100th of that? Do they have anything that does stuff with the outputs from just MobileNet? If they don't, how are they sure I can build 60 fps realtime audiovisual experiences they say I can?

Classify/similarity/clustering works fine with just an encoder, doesn't it?

I guess there's benefit to running that step without subsampling to the initial 256 tokens per image/frame ( https://ai.google.dev/gemma/docs/gemma-3n/model_card#inputs_... ) to go on from that, https://github.com/antimatter15/reverse-engineering-gemma-3n suggests these are 2048 dimensional tokens, which makes these 60 Hz frame digestion rate produce just under 31.5 Million floats-of-your-choosen-precision per second. At least at the high (768x768) input resolution, this is a bit less than one float per pixel.

I guess maybe with very heavy quantizing to like 4 bit that could beat sufficiently-artifact-free video coding for then streaming the tokenized vision to a (potentially cloud) system that can keep up with the 15360 token/s at (streaming) prefill stage?

Or I could imagine just local on-device visual semantic search by expanding the search query into a bunch of tokens that have some signed desire/want-ness each and where the search tokens get attended to the frame's encoded tokens, activation function'd, scaled (to positive/negative) by the search token's desire score, and then just summed over each frame to get a frame score which can be used for ranking and other such search-related tasks.

(For that last thought, I asked Gemini 2.5 Pro to calculate flops load, and it came out to 1.05 MFLOPS per frame per search token; Reddit suggests the current Pixel's TPU does around 50 TOPS, so if these reasonably match each terminology wise, assuming we're spending about 20% of it's compute on the search/match aspect, it comes out to an unreasonably (-seeming) about 190k tokens the search query could get expanded to. I interpret this result to imply that quality/accuracy issues in the searching/filtering mechanism would hit before encountering throughout issues in this.)

  • There's a lot of Not Even Wrong, in the Pauli sense, going on presumably because back-of-napkin-with-LLM is like rocket fuel, I love it. :) But, the LLM got ahead of understanding the basics. I could write probably 900 words. Lets pull one thread out as an example:

    > I guess maybe with very heavy quantizing to like 4 bit that could beat sufficiently-artifact-free video coding for then streaming the tokenized vision to a (potentially cloud) system that can keep up with the 15360 token/s at (streaming) prefill stage?

    the 6-7s I am seeing is what it costs to run an image model, even running on GPU on M4 Max with 64GB of GPU RAM. This repros with my llama.cpp wrapper, and the llama.cpp demo of it.

    It is simply getting tokens that is taking that long.

    Given that reality, we can ignore it, of course. We could assume the image model does run on Pixel at 60 fps, and there's just no demo APK available, or just say it's all not noteworthy because as the Google employee points out, they can do it inside Google, and external hasn't been prioritized.

    The problem is that the blog post is announcing this runs on device at up to 60 fps today, and announces $150K in prizes if you work based on this premise. We have 0 evidence of this externally, the most plausible demo of it released externally by Google is running at 1/500th of this speed, and 1 likely Google employee is saying "yup, it doesn't, we haven't prioritized external users!" The best steelman we can come up with is "well, if eventually the image model runs at 60 fps, we could stream it to an LLM in the cloud with about 4 seconds initiate + prefill latency!"

    That's bad.