Comment by zozbot234

6 months ago

Given that the 671B model is reportedly MoE-based, it definitely could be powering the web interface and API. MoE slashes the per-inference compute cost - and when serving the model for multiple users you only have to host a single copy of the model params in memory, so the bulk doesn't hurt you as much.

1 comment

zozbot234

adastra22 6 months ago

They can still run a lot more users on the same number of GPUs (and they don't have a lot) using distilled models.