Comment by famouswaffles
12 hours ago
Squeezing a model like this complete with 'big model smell' into 16GB...Honestly it's not even possible or feasibly possible today.
It'll require some kind of:
- breakthrough in architecture or
- breakthrough in hardware or
- some breakthrough quantisization technique
The problem is that all the parameters need to be in memory, even the ones that aren't active (say for Mixture Of Expert Models) because switching parametrs in and out of ram is far too slow.
"That’s where EMO comes in.
We show that EMO – a 1B-active, 14B-total-parameter (8-expert active, 128-expert total) MoE trained on 1 trillion tokens – supports selective expert use: for a given task or domain, we can use only a small subset of experts (just 12.5% of total experts) while retaining near full-model performance."
https://allenai.org/blog/emo