Comment by yieldcrv

10 months ago

On r/localllama there is someone that got 120B OSS running on 8gb ram and 35 tokens/sec from the CPU (!!) after noticing 120B has a different architecture of only 5B “active” parameters

This makes it incredibly cheap to run on existing hardware, consumer off the shelf hardware

Its equally as likely that GPT 5 leverages a similar advancement in architecture, which would give them an order of magnitude more use of their existing hardware without being bottlenecked by GPU orders and TSMC

1 comment

yieldcrv

lelanthran 10 months ago

> On r/localllama there is someone that got 120B OSS running on 8gb ram and 35 tokens/sec from the CPU (!!) after noticing 120B has a different architecture of only 5B “active” parameters

If anyone else was as interested as I was, here's the link: https://www.reddit.com/r/LocalLLaMA/comments/1mke7ef/120b_ru...