Comment by dust42
4 days ago
This is not a general purpose chip but specialized for high speed, low latency inference with small context. But it is potentially a lot cheaper than Nvidia for those purposes.
Tech summary:
- 15k tok/sec on 8B dense 3bit quant (llama 3.1)
- limited KV cache
- 880mm^2 die, TSMC 6nm, 53B transistors
- presumably 200W per chip
- 20x cheaper to produce
- 10x less energy per token for inference
- max context size: flexible
- mid-sized thinking model upcoming this spring on same hardware
- next hardware supposed to be FP4
- a frontier LLM planned within twelve months
This is all from their website, I am not affiliated. The founders have 25 years of career across AMD, Nvidia and others, $200M VC so far.
Certainly interesting for very low latency applications which need < 10k tokens context. If they deliver in spring, they will likely be flooded with VC money.
Not exactly a competitor for Nvidia but probably for 5-10% of the market.
Back of napkin, the cost for 1mm^2 of 6nm wafer is ~$0.20. So 1B parameters need about $20 of die. The larger the die size, the lower the yield. Supposedly the inference speed remains almost the same with larger models.
Interview with the founders: https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...
This math is useful. Lots of folks scoffing in the comments below. I have a couple reactions, after chatting with it:
1) 16k tokens / second is really stunningly fast. There’s an old saying about any factor of 10 being a new science / new product category, etc. This is a new product category in my mind, or it could be. It would be incredibly useful for voice agent applications, realtime loops, realtime video generation, .. etc.
2) https://nvidia.github.io/TensorRT-LLM/blogs/H200launch.html Has H200 doing 12k tokens/second on llama 2 12b fb8. Knowing these architectures that’s likely a 100+ ish batched run, meaning time to first token is almost certainly slower than taalas. Probably much slower, since Taalas is like milliseconds.
3) Jensen has these pareto curve graphs — for a certain amount of energy and a certain chip architecture, choose your point on the curve to trade off throughput vs latency. My quick math is that these probably do not shift the curve. The 6nm process vs 4nm process is likely 30-40% bigger, draws that much more power, etc; if we look at the numbers they give and extrapolate to an fp8 model (slower), smaller geometry (30% faster and lower power) and compare 16k tokens/second for taalas to 12k tokens/s for an h200, these chips are in the same ballpark curve.
However, I don’t think the H200 can reach into this part of the curve, and that does make these somewhat interesting. In fact even if you had a full datacenter of H200s already running your model, you’d probably buy a bunch of these to do speculative decoding - it’s an amazing use case for them; speculative decoding relies on smaller distillations or quants to get the first N tokens sorted, only when the big model and small model diverge do you infer on the big model.
Upshot - I think these will sell, even on 6nm process, and the first thing I’d sell them to do is speculative decoding for bread and butter frontier models. The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious. Hopeful. We could use other words as well.
I hope these guys make it! I bet the v3 of these chips will be serving some bread and butter API requests, which will be awesome.
> any factor of 10 being a new science / new product category,
I often remind people two orders of quantitative change is a qualitative change.
> The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious. Hopeful. We could use other words as well.
The real product they have is automation. They figured out a way to compile a large model into a circuit. That's, in itself, pretty impressive. If they can do this, they can also compile models to an HDL and deploy them to large FPGA simulators for quick validation. If we see models maturing at a "good enough" state, even a longer turnaround between model release and silicon makes sense.
While I also see lots of these systems running standalone, I think they'll really shine combined with more flexible inference engines, running the unchanging parts of the model while the coupled inference engine deals with whatever is too new to have been baked into silicon.
I'm concerned with the environmental impact. Chip manufacture is not very clean and these chips will need to be swapped out and replaced at a cadence higher than we currently do with GPUs.
Having dabbled in VLSI in the early-2010s, half the battle is getting a manufacturing slot with TSMC. It’s a dark art with secret handshakes. This demonstrator chip is an enormous accomplishment.
2 replies →
There might be a foodchain of lower order uses when they become "obsolete".
1 reply →
I think the next major innovation is going to be intelligent model routing. I've been exploring OpenClaw and OpenRouter, and there is a real lack of options to select the best model for the job and execute. The providers are trying to do that with their own models, but none of them offer everything to everyone at all times. I see a future with increasingly niche models being offered for all kinds of novel use cases. We need a way to fluidly apply the right model for the job.
Agree that routing is becoming the critical layer here. Vllm iris is really promising for this https://blog.vllm.ai/2026/01/05/vllm-sr-iris.html
There's already some good work on router benchmarking which is pretty interesting
At 16k tokens/s why bother routing? We're talking about multiple orders of magnitude faster and cheaper execution.
Abundance supports different strategies. One approach: Set a deadline for a response, send the turn to every AI that could possibly answer, and when the deadline arrives, cancel any request that hasn't yet completed. You know a priori which models have the highest quality in aggregate. Pick that one.
7 replies →
I came across this yesterday. Haven't tried it, but it looks interesting:
https://agent-relay.com/
[dead]
> speculative decoding for bread and butter frontier models. The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious
Can we use older (previous generation, smaller) models as a speculative decoder for the current model? I don't know whether the randomness in training (weight init, data ordering, etc) will affect this kind of use. To the extent that these models are learning the "true underlying token distribution" this should be possible, in principle. If that's the case, speculative decoding is an elegant vector to introduce this kind of tech, and the turnaround time is even less of a problem.
> The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious. Hopeful. We could use other words as well.
They may be using Rapidus, which is a Japanese government backed foundry built around all single wafer processing vs traditional batching. They advertise ~2 month turnaround time as standard, and as short as 2 weeks for priority.
For speculative decoding, wouldn’t this be of limited use for frontier models that don’t have the same tokenizer as Llama 3.1? Or would it be so good that retokenization/bridging would be worth it?
My understanding as well is that speculative decoding only works with a smaller quant of the same model. You're using the faster sampling of the smaller models representation of the larger models weights in order to attempt to accurately predict its token output. This wouldn't work cross-model as the token probabilities are completely different.
4 replies →
I think they’d commission a quant directly. Benefits go down a lot when you leave model families.
Think about this for solving questions in math where you need to explore a search space. You can run 100 of these for the same cost and time of doing one api call to open ai.
The guts of a LLM isn't something I'm well versed in, but
> to get the first N tokens sorted, only when the big model and small model diverge do you infer on the big model
suggests there is something I'm unaware of. If you compare the small and big model, don't you have to wait for the big model anyway and then what's the point? I assume I'm missing some detail here, but what?
Speculative decoding takes advantage of the fact that it's faster to validate that a big model would have produced a particular sequence of tokens than to generate that sequence of tokens from scratch, because validation can take more advantage of parallel processing. So the process is generate with small model -> validate with big model -> then generate with big model only if validation fails
More info:
* https://research.google/blog/looking-back-at-speculative-dec...
* https://pytorch.org/blog/hitchhikers-guide-speculative-decod...
1 reply →
Verification is faster than generation, one forward pass for verification of multiple tokens vs a pass for every new token in generation
I don't understand how it would work either, but it may be something similar to this: https://developers.openai.com/api/docs/guides/predicted-outp...
When you predict with the small model, the big model can verify as more of a batch and be more similar in speed to processing input tokens, if the predictions are good and it doesn't have to be redone.
They are referring to a thing called "speculative decoding" I think.
Most importantly this opens up an amazing future where we get the real version of the classic science fiction MacGuffin of a physical AI chip. Pair this with several TB of flash storage and you have persistent artificial consciousness that can be carried around with you. Bonus points if it's quirky, custom-trained and the chip is one of a kind that you stole from an evil corporation. Additional bonus points if the packaging is such that it's small enough to plug into the USB-C port on your smart glasses and has an eBPF module it can leverage to see what you're doing and talk to you in real time about your actions.
I enjoy envisioning futures more whimsical than "the bargain-basement LLM provider that my insurance company uses denied my claim because I chose badly-vectored words".
In 20$ a die, they could sell Gameboy style cartridges for different models.
Okay, now _this_ is the cyberpunk future I asked for.
That would be very cool, get an upgraded model every couple of months. Maybe PCIe form factor.
Yes, and even holding couple of cartridges for different scenarios e.g image generation, coding, tts/stt, etc
Make them shaped like floppy disks to confuse the younger generations.
Microsoft
dude that would be so incredibly cool
> Certainly interesting for very low latency applications which need < 10k tokens context.
I’m really curious if context will really matter if using methods like Recursive Language Models[0]. That method is suited to break down a huge amount of context into smaller subagents recursively, each working on a symbolic subset of the prompt.
The challenge with RLM seemed like it burned through a ton of tokens to trade for more accuracy. If tokens are cheap, RLM seems like it could be beneficial here to provide much more accuracy over large contexts despite what the underlying model can handle
0. https://arxiv.org/abs/2512.24601
Don’t forget that the 8B model requires 10 of said chips to run.
And it’s a 3bit quant. So 3GB ram requirement.
If they run 8B using native 16bit quant, it will use 60 H100 sized chips.
> Don’t forget that the 8B model requires 10 of said chips to run.
Are you sure about that? If true it would definitely make it look a lot less interesting.
Their 2.4 kW is for 10 chips it seems based on the next platform article.
I assume they need all 10 chips for their 8B q3 model. Otherwise, they would have said so or they would have put a more impressive model as the demo.
https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...
5 replies →
Were we go towards really smart roboters. It is interesting what kind of diferent model chips they can produce.
There is nothing smart about current LLMs. They just regurgitate text compressed in their memory based on probability. None of the LLMs currently have actual understanding of what you ask them to do and what they respond with.
If LLMs just regurgitate compressed text, they'd fail on any novel problem not in their training data. Yet, they routinely solve them, which means whatever's happening between input and output is more than retrieval, and calling it "not understanding" requires you to define understanding in a way that conveniently excludes everything except biological brains.
11 replies →
We know that, but that does not make them unuseful. The opposite in fact, they are extremely useful in the hands of non-idiots.We just happen to have a oversupply of idiots at the moment, which AI is here to eradicate. /Sort of satire.
So you are saying they are like copy, LLMs will copy some training data back to you? Why do we spend so much money training and running them if they "just regurgitate text compressed in their memory based on probability"? billions of dollars to build a lossy grep.
I think you are confused about LLMs - they take in context, and that context makes them generate new things, for existing things we have cp. By your logic pianos can't be creative instruments because they just produce the same 88 notes.
I have a gut feeling, huge portion of deficiencies we note with AI is just reflection of the training data. For instance, wiki/reddit/etc internet is just a soup of human description of the world model, not the actual world model itself. There are gaps or holes in the knowledge because codified summary of world is what is remarkable to us humans, not a 100% faithful, comprehensive description of the world. What is obvious to us humans with lived real world experience often does not make it into the training data. A simple, demonstrable example is whether one should walk or drive to car wash.
Thats not how they work, pro-tip maybe don't comment until you have a good understanding?
5 replies →
Just HI slop. Ask any decent model, it can explain what's wrong this this description.
> 880mm^2 die
That's a lot of surface, isn't it? As big an M1 Ultra (2x M1 Max at 432mm² on TSMC N5P), a bit bigger than an A100 (820mm² on TSMC N7) or H100 (814mm² on TSMC N5).
> The larger the die size, the lower the yield.
I wonder if that applies? What's the big deal if a few parameter have a few bit flips?
> I wonder if that applies? What's the big deal if a few parameter have a few bit flips?
We get into the sci-fi territory where a machine achieves sentience because it has all the right manufacturing defects.
Reminds me of this https://en.wikipedia.org/wiki/A_Logic_Named_Joe
Also see Adrian Thompson's Xilinx 6200 FPGA, programmed by a genetic algorithm that worked but exploited nuances unique to that specific physical chip, meaning the software couldn't be copied to another chip. https://news.ycombinator.com/item?id=43152877
1 reply →
2000s movie line territory:
> There have always been ghosts in the machine. Random segments of code, that have grouped together to form unexpected protocols.
An on-device reasoning model what that kind of speed and cost would completely change the way people use their computers. It would be closer to star trek than anything else we've ever had. You'd never have to type anything or use a mouse again.
Hardware decoders make sense for fixed codecs like MPEG, but I can't see it making sense for small models that improve every 6 months.
K-V caches are large, but hidden states aren't necessarily that large. And if you can run a model once ridiculously fast, then you can loop it repeatedly and still be fast. So I wonder about the 'modern RNNs' like RWKV here...
There’s a bit of a hidden cost here… the longevity of GPU hardware is going to be longer, it’s extended every time there’s an algorithmic improvement. Whereas any efficiency gains in software that are not compatible with this hardware will tend to accelerate their depreciation.
There is nothing new here. This has been demonstrated several times by previous researchers:
https://arxiv.org/abs/2511.06174
https://arxiv.org/abs/2401.03868
For a real world use case, you would need an FPGA with terabytes of RAM. Perhaps it'll be a Off chip HBM. But for s large models, even that won't be enough. Then you would need to figure out NV-link like interconnect for these FPGAs. And we are back to square one.
This is new. You are citing FPGA prototypes. Those papers do not demonstrate the same class of scaling or hardware integration that Taalas is advocating. For one, the FPGA solutions typically use fixed multipliers (or lookup tables), the ASIC solution has more freedom to optimize routing for 4 bit multiplication.
Maybe they can stack LLM parameters in 200 layers like 3D NAND flash and make the chip very small ...
Do not overlook traditional irrational investor exuberance, we've got an abundance of that right now. With the right PR manouveurs these guys could be a tulip craze.
This is insane if true - could be super useful for data extraction tasks. Sounds like we could be talking in the cents per millions of tokens range.
Yea its fast af but very quickly loses context/hallucinates from my own tests with large chunks of text
Doesn't the blog state that it's now 4bit (the first gen was 3bit + 6bit)?
Sounds perfect for use in consumer devices.
It's weird to me to train such huge models to then destroy them by using them a 3 bits quantization per presumably 16bits (bfloat16) weights. Why not just train smaller models then.
Low-latency inference is a huge waste of power; if you're going to the trouble of making an ASIC, it should be for dog-slow but very high throughput inference. Undervolt the devices as much as possible and use sub-threshold modes, multiple Vt and body biasing extensively to save further power and minimize leakage losses, but also keep working in fine-grained nodes to reduce areas and distances. The sensible goal is to expend the least possible energy per operation, even at increased latency.
Low latency inference is very useful in voice-to-voice applications. You say it is a waste of power but at least their claim is that it is 10x more efficient. We'll see but if it works out it will definitely find its applications.
This is not voice-to-voice though, end-to-end voice chat models (the Her UX) are completely different.
1 reply →
I think it's really useful for agent to agent communication, as long as context loading doesn't become a bottleneck. Right now there can be noticeable delays under the hood, but at these speeds we'll never have to worry about latency when chain calling hundreds or thousands of agents in a network (I'm presuming this is going to take off in the future). Correct me if I'm wrong though.