Comment by lelandbatey
20 hours ago
https://chatjimmy.ai being a demo of the "burn the model to an ASIC" approach being sold by Taalas[0], an approach which they use to run Llama 3.1 8B at ~17000 tokens per second.
20 hours ago
https://chatjimmy.ai being a demo of the "burn the model to an ASIC" approach being sold by Taalas[0], an approach which they use to run Llama 3.1 8B at ~17000 tokens per second.
Not to downplay their accomplishment but Llama 3.1 8B is a terrible model. It's really outdated at this point. It's cool that they were able to accelerate a model with silicon, but it also feels wasteful since llama 8B is such a useless model?
I guess their point was to demonstrate that it's possible to bake a decently-sized model to a silicon? As with anything related to HW, I guess the lead time will be considerably larger than the software counterparts, so I guess in 1-2 years timeframe we might see something like Gemma 4 baked onto a silicon.
Yeah, I think the important part is the process to convert the model to silicon, not the actual implementation itself.
Whether it succeeds now depends a lot on the rate of improvement of model architecture. They're betting on model design and capability improvements slowing down - and then wiping the floor with everyone else with their inference economics.
I agree, Gemma 3 12B is a very good model for its size and it was only obsoleted by Gemma 4.
Heck, I'm still a fan of Gemma 2 9B.