Comment by lelandbatey

20 hours ago

https://chatjimmy.ai being a demo of the "burn the model to an ASIC" approach being sold by Taalas[0], an approach which they use to run Llama 3.1 8B at ~17000 tokens per second.

[0] - https://taalas.com/products/

4 comments

lelandbatey

snek_case 13 hours ago

Not to downplay their accomplishment but Llama 3.1 8B is a terrible model. It's really outdated at this point. It's cool that they were able to accelerate a model with silicon, but it also feels wasteful since llama 8B is such a useless model?

puilp0502 10 hours ago
I guess their point was to demonstrate that it's possible to bake a decently-sized model to a silicon? As with anything related to HW, I guess the lead time will be considerably larger than the software counterparts, so I guess in 1-2 years timeframe we might see something like Gemma 4 baked onto a silicon.
- leoedin 8 hours ago
  
  Yeah, I think the important part is the process to convert the model to silicon, not the actual implementation itself.
  Whether it succeeds now depends a lot on the rate of improvement of model architecture. They're betting on model design and capability improvements slowing down - and then wiping the floor with everyone else with their inference economics.
imtringued 8 hours ago

I agree, Gemma 3 12B is a very good model for its size and it was only obsoleted by Gemma 4.
Heck, I'm still a fan of Gemma 2 9B.