Comment by fantispug
1 day ago
Yes, this seems to be a common capability - Anthropic and Mistral have something very similar as do resellers like AWS Bedrock.
I guess it lets them better utilise their hardware in quiet times throughout the day. It's interesting they all picked 50% discount.
Inference throughout scales really well with larger batch sizes (at the cost of latency) due to rising arithmetic intensity and the fact that it's almost always memory BW limited.
50% is my personal threshold for a discount going from not worth it to worth it.
Bedrock has a batch mode but only for claude 3.5 which is like one year old, which isn't very useful.