Comment by macNchz

1 day ago

This is something I've been thinking about for a while...the current state of things really does feel kind of like the dialup era, wondering what the "broadband" era could look like. Watching tokens stream in is reminiscent of watching a jpeg load a few rows of pixels at a time, and the various different loading and connecting animations that applications implemented before things got fast enough to make them less relevant.

Some of the work in that direction like Cerebras or Taalas have been doing is an interesting glimpse of what might be possible. In the meantime it's a fun thought experiment to wonder about what might be possible if even current state of the art models were available at like, a million tokens per second at a very low cost.

4 comments

macNchz

gavmor 20 hours ago

Take a look at https://chatjimmy.ai/ -- it's running against Taalas' "hardcore" silicon model, ie a dedicated, ASIC-like chip.

bikelang 15 hours ago
Wow - actually pretty astonishing how fast their inference is. So fast it feels fake?
- qingcharles 13 hours ago
  
  Yeah, when you find fast inference like that it almost feels like the answer arrives before you hit return. Now imagine it running locally with no server round-trip.

adamsmark 12 hours ago

Groq was the preview of the broadband era of LLMs for me. I remember asking a question on the demo site and the answer text showed up near instantly. Far faster than I could read. This was ~1 year ago and pre-acquisition.