← Back to context

Comment by TOMDM

14 days ago

Yes loaded from RAM and loaded to RAM are the big distinction here.

It will still be slow if portions of the model need to be read from disk to memory each pass, but only having to execute portions of the model for each token is a huge speed improvement.