Comment by ydj

1 day ago

80tp/s with 5080 3090 combo is wild. I’ve been working with a 4090 and two Tenstorrent p150 cards, and manage only about 30 tps utilizing all three for qwen3.6 27b q8. Guess I got more optimization to do.

Would like to see the perf of their setup with and without mtp and ngram speculative decoding though, as well as parallel decode performance (once llamacpp mtp plays well with multiple slots).

Being in California electricity alone puts this non-competitive with just paying a cloud though.

That’s the cost of using a new hardware provider. A single RTX Pro 6000 Blackwell Max-Q will do better than that and be much more usable. I have 2 running DS4 Flash at 160 tok/s with max num seqs 4.

Very interesting though, these Tenstorrent chips. Might get one to experiment with.

  • Yeah that’s definitely the smarter buy if you want to just have models running quickly. But the cost of 2 p150 and a 4090 was <$5000 for me.

    The main issue is the immature software, and somewhat baroque way of writing kernels. Please, buy one and join us.

    • Were you able to connect the two P150 using the qsfp-dd cable? They only sell 4x and 8x topologies so I’m curious if that worked for you. Are you able to run them tensor parallel?

I get 28tps for Qwen3.6 27B on a Ryzen AI Max 395+, with enough spare memory to run another two small models on the side. 60tps for 35B. Am surprised this is not more common.

Do you get anything useful out of your 4090 (I have one too)? Local cloud sounds like a fun idea but I just don’t see how it competes against OpenAI/Anthopic

  • I think it’s not really worth it compared to just buying tokens or a coding plan.

    My setup has 4090 handling attention while TT accelerators handles MLP. With just a 4090 you can have CPU handle the MLP layers and use a MoE model, assuming sufficiently powerful cpu. I tried that setup with minimax 2.5 before, and was able to eke out around 10 to 15 tps (albeit with a 7965wx cpu)

How is the software compatibilty with the Tenstorrent cards? Are you stuck using vendor supplied runtimes/models?

It's surprising how little these things come up given the price they go for

  • The software stack is pretty immature, definitely very DIY. Their officially supported models are pretty old at this point, though there’s community support for gemma4, and models with GDN like qwen3.6 is supposedly very close.

    The entire stack (minus some binary blobs in firmware) is open source, so if you have the time and persistence you can get whatever you want done.

    A few community members have been working on support with llamacpp, where we can have supported operations offloaded to the TT cards, while having unsupported ops running on GPU or CPU. Llamacpp is pretty good at that. The existing kernels could definitely be better, and I’ll try my hand at writing some kernels some time.