Comment by skybrian

20 hours ago

Watching the computer write text sort of reminds me of using a modem to call a BBS in the old days. This seems like going from 300 baud to 1200 - a significant improvement, but still pretty slow, and someday we will wonder how we put up with it.

21 comments

skybrian

macNchz 19 hours ago

This is something I've been thinking about for a while...the current state of things really does feel kind of like the dialup era, wondering what the "broadband" era could look like. Watching tokens stream in is reminiscent of watching a jpeg load a few rows of pixels at a time, and the various different loading and connecting animations that applications implemented before things got fast enough to make them less relevant.

Some of the work in that direction like Cerebras or Taalas have been doing is an interesting glimpse of what might be possible. In the meantime it's a fun thought experiment to wonder about what might be possible if even current state of the art models were available at like, a million tokens per second at a very low cost.

gavmor 17 hours ago
Take a look at https://chatjimmy.ai/ -- it's running against Taalas' "hardcore" silicon model, ie a dedicated, ASIC-like chip.
- bikelang 12 hours ago
  
  Wow - actually pretty astonishing how fast their inference is. So fast it feels fake?
  
  1 reply →
adamsmark 9 hours ago

Groq was the preview of the broadband era of LLMs for me. I remember asking a question on the demo site and the answer text showed up near instantly. Far faster than I could read. This was ~1 year ago and pre-acquisition.

garciasn 19 hours ago

You're right about it being reminiscent of the dial-up area, but I don't believe it's 300 to 1200; it's more like 4800:

Modem vs Claude according to Claude:

300 @ 2368 characters - 1m 19s

1200 @ 2368 characters - 19.7s

2400 @ 2368 characters - 9.9s

14.4K @ 2368 characters - 1.6s

33.6K @ 2368 characters - 705 ms

56K @ 2368 characters - 447 ms

Claude @ 2368 characters - 7.9s

jeffhuys 19 hours ago

Check chatjimmy.ai

lelandbatey 17 hours ago
https://chatjimmy.ai being a demo of the "burn the model to an ASIC" approach being sold by Taalas[0], an approach which they use to run Llama 3.1 8B at ~17000 tokens per second.
[0] - https://taalas.com/products/
- snek_case 10 hours ago
  
  Not to downplay their accomplishment but Llama 3.1 8B is a terrible model. It's really outdated at this point. It's cool that they were able to accelerate a model with silicon, but it also feels wasteful since llama 8B is such a useless model?
  
  3 replies →

MagicMoonlight 19 hours ago

There was a startup posted here which built custom hardware that let the AI respond instantly. Thousands of tokens per second.

tln 18 hours ago
Taalas. A sibling comment of yours posted the chat demo URL -
https://chatjimmy.ai/
- 2ndorderthought 17 hours ago
  
  Woah. How is this working? It's stupid fast.
  
  1 reply →
Grosvenor 18 hours ago

cerebras
They built an entire wafer ASIC. The entire thing is one huge active ASIC. it takes a lot of cool engineering and cooling to make it work, and is very cool.
zargon 19 hours ago
Groq.
- beavisringdin 18 hours ago
  
  No, it was a custom ASIC chip with weights baked in for a singular model. I do envision a future where we return to cartridges. Local AI is de facto and massively optimised chips are built to be plug and play running a single SoTA model.
  
  2 replies →