Comment by RandomGerm4n

6 days ago

9b with 4bits runs with around 60 tok/s on my RTX 4070 with 12GB VRAM and 35b-A3B runs with around 14 tok/s and partial offloading. For roleplaying I prefer the faster 9b Version but for coding tasks both aren't really usable and Claude is still way better especially if you manage to persuade your employer to give you unlimited access.

> 35b-A3B runs with around 14 tok/s and partial offloading

FYI, this is what I am seeing for pure CPU inference so something is likely off with your setup.

Test setup is intel 13500 w/ 6 threads and 64GB DDR4 ram, a newer system should be much faster