Comment by p12tic

10 months ago

For all intents and purposes cache may not exist when the working set is 17B or 109B parameters. So it's still better that less parameters are activated for each token. 17B parameters works ~6x faster than 109B parameters just because less data needs to be loaded from RAM.

5 comments

p12tic

TOMDM 10 months ago

Yes loaded from RAM and loaded to RAM are the big distinction here.

It will still be slow if portions of the model need to be read from disk to memory each pass, but only having to execute portions of the model for each token is a huge speed improvement.

mlyle 10 months ago
It's not too expensive of a Macbook to fit 109B 4-bit parameters in RAM.
- utopcell 10 months ago
  
  Is a 64GiB RAM Macbook really that expensive, especially compared against NVidia GPUs?
  
  2 replies →