Comment by Alifatisk

6 months ago

I think it's because of a combination between the MoE model architecture and the inference done in large batches and run in parallel

0 comments