← Back to context

Comment by DavidSJ

2 months ago

You're presumably using a very small batch size compared to what I described, thus getting very low model FLOP utilization (MFU) and high dollar cost per token.

Yes, very tiny batch size on average. Have not optimized for MFU. This is optimized for a varying (~1-60ish) numbers of active requests while minimizing latency (time to first token and time to last token from final token) given short to medium known "prompts" and short structured responses, with very little in the way of shared prefixes in concurrent prompts.