Comment by computably

1 month ago

Storage is multiple orders of magnitude slower than RAM. Pretty sure it'd be more like 10s/tok than anything reasonable.

1 comment

computably

zozbot234 1 month ago

Active params for this model is 13B which takes about 6.5GB at full native quantization, or perhaps 3.25GB at the 2bit quant that's being provided here, that should take significantly less than 10s to fetch on Mac storage, especially given that some fraction of the model weights would be cached in RAM. Sounds like something worth testing out if it can be made to work out of the box with DS4.