Comment by Alifatisk

5 days ago

What's happening in the comment section? How come so many cannot understand that his is running Llama 3.1 8B? Why are people judging its accuracy? It's almost a 2 years old 8B param model, why are people expecting to see Opus level response!?

The focus here should be on the custom hardware they are producing and its performance, that is whats impressive. Imagine putting GLM-5 on this, that'd be insane.

This reminds me a lot of when I tried the Mercury coder model by Inceptionlabs, they are creating something called a dLLM which is like a diffusion based llm. The speed is still impressive when playing aroun with it sometimes. But this, this is something else, it's almost unbelievable. As soon as I hit the enter key, the response appears, it feels instant.

I am also curious about Taalas pricing.

> Taalas’ silicon Llama achieves 17K tokens/sec per user, nearly 10X faster than the current state of the art, while costing 20X less to build, and consuming 10X less power.

Do we have an idea of how much a unit / inference / api will cost?

Also, considering how fast people switch models to keep up with the pace. Is there really a potential market for hardware designed for one model only? What will they do when they want to upgrade to a better version? Throw the current hardware and buy another one? Shouldn't there be a more flexible way? Maybe only having to switch the chip on top like how people upgrade CPUs. I don't know, just thinking out loudly.

They don't give cost figures in their blog post but they do here:

https://www.nextplatform.com/wp-content/uploads/2026/02/taal...

Probably they don't know what the market will bear and want to do some exploratory pricing, hence the "contact us" API access form. That's fair enough. But they're claiming orders of magnitude cost reduction.

> Is there really a potential market for hardware designed for one model only?

I'm sure there is. Models are largely interchangeable especially as the low end. There are lots of use cases where you don't need super smart models but cheapness and fastness can matter a lot.

Think about a simple use case: a company has a list of one million customer names but no information about gender or age. They'd like to get a rough understanding of this. Mapping name -> guessed gender, rough guess of age is a simple problem for even dumb LLMs. I just tried it on ChatJimmy and it worked fine. For this kind of exploratory data problem you really benefit from mass parallelism, low cost and low latency.

> Shouldn't there be a more flexible way?

The whole point of their design is to sacrifice flexibility for speed, although they claim they support fine tunes via LoRAs. LLMs are already supremely flexible so it probably doesn't matter.

  • Yes, there are all kinds of fuzzy NLP tasks that this would be great for. Jobs where you can chunk the text into small units and add instructions and only need a short response. You could burn through huge data sets very quickly using these chips.

That is my concern too. A chip optimised for a model or specific model architecture will not be useful for long.

  • I just tried the demo and I think, this is huge! If they manage to build a chip in 2 or 3 years, that can run something like Opus 4.6 or even Sonnet, at that speed, the disruption in the world of software development will be more than we saw in the last 3-5 years. LLMs today are somewhat useful, but they are still too slow and expensive for a meaningful ralph loop. Being able to runs those loops (or if you want to call it "thinking") much faster, will enable a lot of stuff, that is not feasible today. Writing things like openclaw will not take weeks, but hours. Maybe even rewriting entire tools, kernels or OSes will be feasible because the LLM can run through almost endless tries.

    Speed and cost wins over quality and this will also be true for LLMs.

I personally don't buy it, cerebras is way more advanced than this, comparing this tok/s to cerebras is disingenious.

  • Cerebras is a totally different product though. They can (theoretically) run any frontier model provided it gets compiled a certain way. Like a wafer scale TPU.

    This is using hardwired weights with on-die SRAM used for K/V for example. It's WAY more power efficient and faster. The tradeoff being it's hardwired.

    Still, most frontier models are "good enough" where an obscenely fast version would be a major seller.

If it's so easy to do custom silicon for any model (they say only 2 months), why didn't they demo one of the newer DeepSeek models instead? Using a 2-year model is so bad. I'm not buying it.