← Back to context

Comment by simonw

18 hours ago

They key difference here is that training costs are fixed. If you train a model for $100m dollars, how much of that training fee should you allocate to each token that the model serves?

It's impossible to know, because you don't know how many tokens total will be served by that model until you retire it at some point in the future.

So you can't say "1,000,000 tokens costs $X in inference and $Y in training" because $Y is not possible to correctly calculate.

So, if you want to have a productive conversation about "margin on inference", it's sensible to look at the cost of serving the tokens independently of the cost of training the underlying model.