Comment by danroblew

1 year ago

Cost as in, cost to you? Or cost to serve?

If the cost-to-serve is subsidized by VC money, they aren't getting cheaper, they're just leading you on.

11 comments

danroblew

I've heard from insiders that AWS Nova and Google Gemini - both incredibly cheap - are still charging more for inference than they spend on the server costs to run a query. Since those are among the cheapest models I expect this is true of OpenAI and Anthropic as well.

The subsidies are going to the training costs. I don't know if any model is running at a profit once training/research costs are included.

athrowaway3z 1 year ago

As a society we choose to let the excess wealth pile up into the hands of people that are investing to bring about their own utopia.

If we're stretching, we can talk about opportunity cost. But the people spending and creating the "bubble" don't have better opportunities. They're not nations that see a ROI on things like transportation infrastructure or literacy.

So unless the discussion is taken more broadly and higher taxes are on the table, there really isn't a cost or subsidy imo.

jsnell 1 year ago

The cost to serve.

jcgrillo 1 year ago

> Cost as in, cost to you? Or cost to serve?

This. IIUC to serve an LLM is to perform an O(n^2) computation on the model weights for every single character of user input. These models are 40+GB so that means I need to provision about 40GB RAM per concurrent user and perform hundreds of TB worth of computations per query.

How much would I have to charge for this? Are there any products where the users would actually get enough value out of it to pay what it costs?

Compare to the cost of a user session in a normal database backed web app. Even if that session fans out thousands of backend RPCs across a hundred services, each of those calls executes in milliseconds and requires only a fraction of the LLM's RAM. So I can support thousands of concurrent users per node instead of one.

jsnell 1 year ago
> IIUC to serve an LLM is to perform an O(n^2) computation on the model weights for every single character of user input.
The computations are not O(n^2) in terms of model weights (parameters), but linear. If it were quadratic, the number would be ludicrously large. Like, "it'll take thousands of years to process a single token" large.
(The classic transformers are quadratic on the context length, but that's a much smaller number. And it seems pretty obvious from the increases in context lengths that this is no longer the case in frontier models.)
> These models are 40+GB so that means I need to provision about 40GB RAM per concurrent user
The parameters are static, not mutated during the query. That memory can be shared between the concurrent users. The non-shared per-query memory usage is vastly smaller.
> How much would I have to charge for this?
Empirically, as little as 0.00001 cents per token.
For context, the Bing search API costs 2.5 cents per query.
- jcgrillo 1 year ago
  
  Ah got it, that's more sensible. So is anyone making money with these things yet?
simonw 1 year ago
The efficiency gains over the past 18 months have been incredible. Turns out there was a lot of low hanging fruit to make these things faster, cheaper and more resource efficient. https://simonwillison.net/2024/Dec/31/llms-in-2024/#llm-pric...
- jcgrillo 1 year ago
  
  Interesting. There's obviously been a precipitous drop in the sticker price, but has there really been a concomitant efficiency increase? It's hard to believe the sticker price these companies are charging has anything to do with reality given how massively they're subsidized (free Azure compute, billions upon billions in cash, etc). Is this efficiency trend real? Do you know of any data demonstrating it?
  
  3 replies →