Comment by barnas2
1 day ago
A company called Taalas is working on something like that. Not Opus4.6 quality, but I'm sure they're targeting larger models. Currently they're using a LLama 8B model. It runs at ~17k tokens per second, and you can test it at https://chatjimmy.ai/.
I'm rooting for them HARD but they've been quiet since their last (and only) blog. X and LinkedIn are empty too. I really hope it wasn't a pipe dream.
It starts to be interesting when latency is better than average website.
I’m not sure if this is what you meant, but at 17k t/s, you start to compete with the speed of network calls. You could approach the point of generating an HTML/js/css page faster than some websites can be returned over the network.
The immediate load (less than 200ms on my machine through a slow connection) is quite pleasant, tbh.
That's cool, I just tested it out and it is fast but unfortunately its accuracy is not great.
It's an 8B model. Consider it a proof-of-concept.