Comment by DavidSJ

9 months ago

You're presumably using a very small batch size compared to what I described, thus getting very low model FLOP utilization (MFU) and high dollar cost per token.

1 comment

DavidSJ

zxexz 8 months ago

Yes, very tiny batch size on average. Have not optimized for MFU. This is optimized for a varying (~1-60ish) numbers of active requests while minimizing latency (time to first token and time to last token from final token) given short to medium known "prompts" and short structured responses, with very little in the way of shared prefixes in concurrent prompts.