Comment by TOMDM
14 days ago
Yes loaded from RAM and loaded to RAM are the big distinction here.
It will still be slow if portions of the model need to be read from disk to memory each pass, but only having to execute portions of the model for each token is a huge speed improvement.
It's not too expensive of a Macbook to fit 109B 4-bit parameters in RAM.
Is a 64GiB RAM Macbook really that expensive, especially compared against NVidia GPUs?
That's why I said it's not too expensive.
1 reply →