Comment by taneq

6 months ago

Unless something’s changed you will need the whole model on the HPU anyway, no? So way beyond a 4090 regardless.

4 comments

taneq

You can still offload most of the model to RAM and use the GPU for compute, but it's obviously much slower than what it would be if everything was on the GPU memory.

see ktransformers: https://www.reddit.com/r/LocalLLaMA/comments/1jpi0n9/ktransf...

kristopolous 6 months ago
I'm certainly not the brightest person in this thread but has there been effort to maybe bucket the computational cost of the model so that more expensive parts are on the gpu and less expensive parts are on the cpu?
- phonon 6 months ago
  
  Take a look at https://github.com/kvcache-ai/ktransformers/blob/main/doc/en...

kristopolous 6 months ago

A habana just for inference? Are you sure?

Also I see the 4 bit quants put it at a h100 which is fine ... I've got those at work. Maybe there will be distilled for running at home