Comment by ivape

8 months ago

It takes about a $400 dollar graphics card to comfortably run something like a 3b-8b model. Comfortable as in fast inference, good sized context. 3b-5b models are what devices can somewhat fit. That means for us to get good running local models, we’d have to shrink one of those $400 dollar graphics cards down to a phone.

I don’t see this happening in the next 5 years.

The Mac mini being shrunk down to phone size is probably the better bet. We’d have to bring down the power consumption requirements too by a lot. Edge hardware is a ways off.

2 comments

ivape

nolist_policy 8 months ago

Gemma 3n E4B runs at 35tk/s prompt processing and 7-8 tk/s decode on my last last last gen flagship Android.

ivape 8 months ago

I doubt this. What kind of t/s are you getting once your context window is reasonably saturated? Probably slows down to a crawl making it not good enough yet (the hardware that is).