Comment by lostmsu
10 days ago
> To put it in simple terms, the time it takes to generate the Millionth output token is the same as the first output token.
This is wrong. Current models still use some full attention layers AFAIK, and their computational cost grows linearly (per token) with the token number.
I have seen exactly one model that charges more for longer contexts:
https://ai.google.dev/gemini-api/docs/pricing
Gemini 1M context window
That said the cost increase isn't very significant, approximately 2x at the longer end of the context window.
This is in stark contrast with the quadratic phenomenon claimed by the article.
They just do averaging. Imagine a quadratic pricing structure. Who'd want to deal with it?
I guess 1.0001 ^2 is quadratic too, but note how it really only charges you 1.5x for more output tokens. Even if cost were quadratic with output length here, we are talking about a very small difference, nothing like the quadratic cost structure proposed by OP:
>Pop quiz: at what point in the context length of a coding agent are cached reads costing you half of the next API call? By 50,000 tokens, your conversation’s costs are probably being dominated by cache reads.
These are two different cost components, and the one you bring up is minor, OP is talking about a cost that at 1M output tokens, would cause the cost to be 20x per token. You are talking about a cost that at 1M output tokens would cost 1.5x, different things.
The first is an imperfection of the API encapsulation, the latter may be a natural cost phenomenon related to the internals of the state of the state of the art algorithms
1 reply →