Comment by yieldcrv

7 months ago

On r/localllama there is someone that got 120B OSS running on 8gb ram and 35 tokens/sec from the CPU (!!) after noticing 120B has a different architecture of only 5B “active” parameters

This makes it incredibly cheap to run on existing hardware, consumer off the shelf hardware

Its equally as likely that GPT 5 leverages a similar advancement in architecture, which would give them an order of magnitude more use of their existing hardware without being bottlenecked by GPU orders and TSMC