Comment by fy20
6 days ago
I guess you are doing offloading to system RAM? What tokens per second do you get? I've got an old gaming laptop with a RTX 3060, sounds like it could work well as a local inference server.
6 days ago
I guess you are doing offloading to system RAM? What tokens per second do you get? I've got an old gaming laptop with a RTX 3060, sounds like it could work well as a local inference server.
I'm getting about 15-20 tok/s with a 128k context window using the Q3_K_S version.
For running the server:
In the article, they claim up to 25t/s for the LARGEST model with a 24GB VRAM card. Need a lot of RAM obviously