Comment by refulgentis

8 months ago

Somethings really screwy with on-device models from Google, I can't put my finger on what, and I think being ex-Google is screwing with my ability to evaluate.

Cherry-picking something that's quick to evaluate:

"High throughput: Processes up to 60 frames per second on a Google Pixel, enabling real-time, on-device video analysis and interactive experiences."

You can download an APK from the official Google project for this, linked from the blogpost: https://github.com/google-ai-edge/gallery?tab=readme-ov-file...

If I download it, run it on Pixel Fold, actual 2B model which is half the size of the ones the 60 fps claim is made for, it takes 6.2-7.5 seconds to begin responding (3 samples, 3 diff photos). Generation speed is shown at 4-5 tokens per second, slightly slower than what llama.cpp does on my phone. (I maintain an AI app that inter alia, wraps llama.cpp on all platforms)

So, *0.16* frames a second, not 60 fps.

The blog post is so jammed up with so many claims re: this is special for on-device and performance that just...seemingly aren't true. At all.

- Are they missing a demo APK?

- Was there some massive TPU leap since the Pixel Fold release?

- Is there a lot of BS in there that they're pretty sure won't be called out in a systematic way, given the amount of effort it takes to get this inferencing?

- I used to work on Pixel, and I remember thinking that it seemed like there weren't actually public APIs for the TPU. Is that what's going on?

In any case, either:

A) I'm missing something, big or

B) they are lying, repeatedly, big time, in a way that would be shown near-immediately when you actually tried building on it because it "enables real-time, on-device video analysis and interactive experiences."

Everything I've seen the last year or two indicates they are lying, big time, regularly.

But if that's the case:

- How are they getting away with it, over this length of time?

- How come I never see anyone else mention these gaps?

13 comments

refulgentis

mlsu 8 months ago

It looks to me by the marketing copy that the vision encoder can run 60FPS.

> MobileNet-V5-300M

Which makes sense as it's 300M in size and probably far less complex, not a multi billions of parameters transformer.

refulgentis 8 months ago
I agree that's the most likely interpretation - does it read as a shell game to you? Like, it can do that but once you get the thing that can use the output involved it's 1/100th of that? Do they have anything that does stuff with the outputs from just MobileNet? If they don't, how are they sure I can build 60 fps realtime audiovisual experiences they say I can?
- namibj 8 months ago
  
  Classify/similarity/clustering works fine with just an encoder, doesn't it?
  I guess there's benefit to running that step without subsampling to the initial 256 tokens per image/frame ( https://ai.google.dev/gemma/docs/gemma-3n/model_card#inputs_... ) to go on from that, https://github.com/antimatter15/reverse-engineering-gemma-3n suggests these are 2048 dimensional tokens, which makes these 60 Hz frame digestion rate produce just under 31.5 Million floats-of-your-choosen-precision per second. At least at the high (768x768) input resolution, this is a bit less than one float per pixel.
  I guess maybe with very heavy quantizing to like 4 bit that could beat sufficiently-artifact-free video coding for then streaming the tokenized vision to a (potentially cloud) system that can keep up with the 15360 token/s at (streaming) prefill stage?
  Or I could imagine just local on-device visual semantic search by expanding the search query into a bunch of tokens that have some signed desire/want-ness each and where the search tokens get attended to the frame's encoded tokens, activation function'd, scaled (to positive/negative) by the search token's desire score, and then just summed over each frame to get a frame score which can be used for ranking and other such search-related tasks.
  (For that last thought, I asked Gemini 2.5 Pro to calculate flops load, and it came out to 1.05 MFLOPS per frame per search token; Reddit suggests the current Pixel's TPU does around 50 TOPS, so if these reasonably match each terminology wise, assuming we're spending about 20% of it's compute on the search/match aspect, it comes out to an unreasonably (-seeming) about 190k tokens the search query could get expanded to. I interpret this result to imply that quality/accuracy issues in the searching/filtering mechanism would hit before encountering throughout issues in this.)
  
  1 reply →

catchmrbharath 8 months ago

The APK that you linked, runs the inference on CPU and does not run it on Google Tensor.

refulgentis 8 months ago
That sounds fair, but opens up another N questions:
- Are there APK(s) that run on Tensor?
- Is it possible to run on Tensor if you're not Google?
- Is there anything at all from anyone I can download that'll run it on Tensor?
- If there isn't, why not? (i.e. this isn't the first on device model release by any stretch, so I can't give benefit of the doubt at this point)
- catchmrbharath 8 months ago
  
  > Are there APK(s) that run on Tensor?
  No. AiCore service internally uses the inference on Tensor (http://go/android-dev/ai/gemini-nano)
  > Is there anything at all from anyone I can download that'll run it on Tensor?
  No.
  > If there isn't, why not? (i.e. this isn't the first on device model release by any stretch, so I can't give benefit of the doubt at this point)
  Mostly because 3P support has not been a engineering priority.
  
  4 replies →
lostmsu 8 months ago
How does their demo work then? It's been 3 months since 3n was first released publicly.
- refulgentis 8 months ago
  
  What demo?
  The only one we have works as described, TL;Dr 0.1 fps.