Comment by highfrequency
6 months ago
Interesting that for these small models, it is optimal for the embedding parameters to be a huge fraction of the total (170e6/250e6) = 68%!
6 months ago
Interesting that for these small models, it is optimal for the embedding parameters to be a huge fraction of the total (170e6/250e6) = 68%!
No comments yet
Contribute on Hacker News ↗