Comment by SamDc73

2 months ago

This older HN thread shows R1 running on a ~$2k box using ~512 GB of system RAM, no GPU, at ~3.5-4.25 TPS: https://news.ycombinator.com/item?id=42897205

If you scale that setup and add a couple of used RTX 3090s with heavy memory offloading, you can technically run something in the K2 class.

14 comments

SamDc73

nl 2 months ago

Is 4 TPS actually useful for anything?

That's around 350,000 tokens in a day. I don't track my Claude/Codex usage, but Kilocode with the free Grok model does and I'm using between 3.3M and 50M tokens in a day (plus additional usage in Claude + Codex + Mistral Vibe + Amp Coder)

I'm trying to imagine a use case where I'd want this. Maybe running some small coding task overnight? But it just doesn't seem very useful.

zarzavat 2 months ago
3.5-50M tokens a day? What are you doing with all those tokens?
Yesterday I asked Claude to write one function. I didn't ask it to do anything else because it wouldn't have been helpful.
- KronisLV 2 months ago
  
  Here’s my own stats, for comparison: https://news.ycombinator.com/item?id=46216192
  Essentially migrating codebases, implementing features, as well as all of the referencing of existing code and writing tests and various automation scripts that are needed to ensure that the code changes are okay. Over 95% of those tokens are reads, since often there’s a need for a lot of consistency and iteration.
  It works pretty well if you’re not limited by a tight budget.
- nl 2 months ago
  
  https://github.com/nlothian/Vibe-Prolog chews a lot of tokens.
  Have a bunch of other side projects as well as my day job.
  It's pretty easy to get through lots of tokens.
SamDc73 2 months ago

I only run small models (70b at my hardware gets me around 10-20 TOPS) for just random things (personal assistant kind of thing) but not for coding tasks.
For coding related tasks I consume 30-80M tokens per day and I want something as fast as it gets

BoredPositron 2 months ago

Stop recommending 3090s they are all but obsolete now. Not having native bf16 is a showstopper.

qayxc 2 months ago
Hard disagree. The difference in performance is not something you'll notice if you actually use these cards. In AI benchmarks, the RTX 3090 beats the RTX 4080 SUPER, despite the latter having native BF16 support. 736GiB/s (4080) memory bandwidth vs 936 GiB/s (3090) plays a major role. Additionally, the 3090 is not only the last NVIDIA consumer card to support SLI.
It's also unbeatable in price to performance as the next best 24GiB card would be the 4090 which, even used, is almost tripple the price these days while only offering about 25%-30% more performance in real-world AI workloads.
You can basically get an SLI-linked dual 3090 setup for less money than a single used 4090 and get about the same or even more performance and double the available VRAM.
- BoredPositron 2 months ago
  
  If you run fp32 maybe but no sane person does that. The tensor performance of the 3090 is also abysmal. If you run bf16 or fp8 stay away from obsolete cards. Its barely usable for llms and borderline garbage tier on video and image gen.
  
  4 replies →
SamDc73 2 months ago
Even with something like a 5090, I’d still run Q4_K_S/Q4_K_M because they’re far more resource-efficient for inference.
Also, the 3090 supports NVLink, which is actually more useful for inference speed than native BF16 support.
Maybe if you're training bf16 matters?
- BoredPositron 2 months ago
  
  That's a smart thing todo considering a 5090 has native tensor cores for 4bit precision...