Comment by zozbot234
6 months ago
Given that the 671B model is reportedly MoE-based, it definitely could be powering the web interface and API. MoE slashes the per-inference compute cost - and when serving the model for multiple users you only have to host a single copy of the model params in memory, so the bulk doesn't hurt you as much.
They can still run a lot more users on the same number of GPUs (and they don't have a lot) using distilled models.