← Back to context

Comment by TOMDM

3 months ago

Yes loaded from RAM and loaded to RAM are the big distinction here.

It will still be slow if portions of the model need to be read from disk to memory each pass, but only having to execute portions of the model for each token is a huge speed improvement.

4 comments

TOMDM

Reply

mlyle 3 months ago

It's not too expensive of a Macbook to fit 109B 4-bit parameters in RAM.

utopcell 3 months ago
Is a 64GiB RAM Macbook really that expensive, especially compared against NVidia GPUs?
- mlyle 3 months ago
  
  That's why I said it's not too expensive.
  
  1 reply →