Comment by nl

5 days ago

Releasing this on the same day as Taalas's 16,000 token-per-second acceleration for the roughly comparable Llama 8B model must hurt!

I wonder how far down they can scale a diffusion LM? I've been playing with in-browser models, and the speed is painful.

https://taalas.com/products/

17 comments

aurareturn 5 days ago

Nothing to do with each other. This is a general optimization. Taalas' is an ASIC that runs a tiny 8B model on SRAM.

But I wonder how Taalas' product can scale. Making a custom chip for one single tiny model is different than running any model trillions in size for a billion users.

Roughly, 53B transistors for every 8B params. For a 2T param model, you'd need 13 trillion transistor assuming scale is linear. One chip uses 2.5 kW of power? That's 4x H100 GPUs. How does it draw so much power?

If you assume that the frontier model is 1.5 trillion models, you'd need an entire N5 wafer chip to run it. And then if you need to change something in the model, you can't since it's physically printed on the chip. So this is something you do if you know you're going to use this exact model without changing anything for years.

Very interesting tech for edge inference though. Robots and self driving can make use of these in the distant future if power draw comes down drastically. 2.4kW chip running inside a robot is not realistic. Maybe a 150w chip.

spuz 5 days ago
The 2.5kW figure is for a server running 10 HC1 chips:
> The first generation HC1 chip is implemented in the 6 nanometer N6 process from TSMC. ... Each HC1 chip has 53 billion transistors on the package, most of it very likely for ROM and SRAM memory. The HC1 card burns about 200 watts, says Bajic, and a two-socket X86 server with ten HC1 cards in it runs 2,500 watts.
https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...
- aurareturn 5 days ago
  
  I’m confused then. They need 10 of these to run an 8B model?
  
  1 reply →
nl 5 days ago

> Our second model, still based on Taalas’ first-generation silicon platform (HC1), will be a mid-sized reasoning LLM. It is expected in our labs this spring and will be integrated into our inference service shortly thereafter.
> Following this, a frontier LLM will be fabricated using our second-generation silicon platform (HC2). HC2 offers considerably higher density and even faster execution. Deployment is planned for winter.
From https://taalas.com/the-path-to-ubiquitous-ai/
Personally I think anything around the level of Sonnet 4.5 is worth burning to silicon because agentic workflows work. There are plenty of places where spending $50,000 for that makes sense (I have no idea of the pricing though)

LASR 5 days ago

Just tried this. Holy fuck.

I'd take an army of high-school graduate LLMs to build my agentic applications over a couple of genius LLMs any day.

This is a whole new paradigm of AI.

stavros 5 days ago
A billion stupid LLMs don't make a smart one, they just make one stupid LLM that's really fast at stupidity.
- abeppu 5 days ago
  
  I think maybe there are subsets of problems where you can have either a human or a smart LLM write a verifier (e.g. a property-based test?) and a performance measurement and let the dumb models generate candidates iterate on candidates?
  
  2 replies →
turnsout 5 days ago

Man, I'm in the exact opposite camp. 1 smart model beats 1000 chaos monkeys any day of the week.
esafak 5 days ago
What did you try and how?
- root_axis 5 days ago
  
  https://chatjimmy.ai
  
  1 reply →
tokenless 5 days ago

When that genrates 10k of output slop in less latency than my web server doing some crud shit....amazing!

small_model 5 days ago

This is exceptionally fast (almost instant) whats the catch? Answer was there before I lifted return key!

precompute 5 days ago

This is crazy!