Comment by martinald

7 months ago

Yes but prompt evaluation is far faster than inference as it can be done (mostly) in parallel, so I don't think that's true.

8 comments

martinald

danenania 7 months ago

The problem is that input token cost dominates output token cost for the majority of tasks.

Once you've given the model your prompt and are reading the first output token for classification, you've already paid most of the cost of just prompting it directly.

That said, there could definitely be exceptions for short prompts where output costs dominate input costs. But these aren't usually the interesting use cases.

energy123 7 months ago
No, you're talking about costs to user, which are oversimplifications of the costs that providers bear. One output token with a million input tokens is incredibly cheap for providers
- danenania 7 months ago
  
  > One output token with a million input tokens is incredibly cheap for providers
  Source? Afaik this is incorrect.
  
  2 replies →
redox99 7 months ago
That's usually not the case for thinking models. And usually hard problems have a very short prompt.
- danenania 7 months ago
  
  For me personally (using mostly for coding and project planning) it's nearly always the case, including with thinking models. I'm usually pasting in a bunch of files, screenshots, etc., and having long conversations. Input nearly always heavily dominates output.
  I don't disagree that there are hard problems which use short prompts, like math homework problems etc., but they mostly aren't what I would categorize as "real work". But of course I can only speak to my own experience /shrug.
  
  1 reply →