Comment by littlestymaar

10 months ago

You can still offload most of the model to RAM and use the GPU for compute, but it's obviously much slower than what it would be if everything was on the GPU memory.

see ktransformers: https://www.reddit.com/r/LocalLLaMA/comments/1jpi0n9/ktransf...

2 comments

littlestymaar

kristopolous 10 months ago

I'm certainly not the brightest person in this thread but has there been effort to maybe bucket the computational cost of the model so that more expensive parts are on the gpu and less expensive parts are on the cpu?

phonon 10 months ago

Take a look at https://github.com/kvcache-ai/ktransformers/blob/main/doc/en...