← Back to context

Comment by alexwebb2

18 hours ago

I think your intuition on this might be lagging a fair bit behind the current state of LLMs.

System message: answer with just "service" or "product"

User message (variable): 20 bottles of ferric chloride

Response: product

Model: OpenAI GPT-4o-mini

$0.075/1Mt batch input * 27 input tokens * 10M jobs = $20.25

$0.300/1Mt batch output * 1 output token * 10M jobs = $3.00

It's a sub-$25 job.

You'd need to be doing 20 times that volume every single day to even start to justify hiring an NLP engineer instead.

You might be able to use an even cheaper model. Google Gemini 1.5 Flash 8B is Input: $0.04 / Output: $0.15 per 1M tokens.

17 input tokens and 2 output tokens * 10 million jobs = 170,000,000 input tokens, 20,000,000 output tokens... which costs a total of $6.38 https://tools.simonwillison.net/llm-prices

As for rate limits, https://ai.google.dev/pricing#1_5flash-8B says 4,000 requests per minute and 4 million tokens per minute - so you could run those 10 million jobs in about 2500 minutes or 42 hours. I imagine you could pull a trick like sending 10 items in a single prompt to help speed that up, but you'd have to test carefully to check the accuracy effects of doing that.

The question is not average cost but marginal cost of quality - same as voice recognition, which had relatively low uptake even at ~2-4% error rates due to context switching costs for error correction.

So you'd have to account for the work of catching the residue of 2-8%+ error from LLMs. I believe the premise is for NLP, that's just incremental work, but for LLM's that could be impossible to correct (i.e., cost per next-percentage-correction explodes), for lack of easily controllable (or even understandable) models.

But it's most rational in business to focus on the easy majority with lower costs, and ignore hard parts that don't lead to dramatically larger TAM.

  • I am absolutely not an expert in NLP, but I wouldn't be surprised if for many kinds of problems LLMs would have far less error rate, than any NLP software.

    Like, lemmation is pretty damn dumb in NLP, while a better LLM model will be orders of magnitude more correct.

This assumes you don’t care about our rapidly depleting carbon budget.

No matter how much energy you save personally, running your jobs on Sam A’s earth killer ten thousand cluster of GPUs is literally against your own self interest of delaying climate disasters.

LLM have huge negative externalities, there is a moral argument to only use them when other tools won’t work.

How do you validate these classifications?

  • Isn't it easier and cheaper to validate than to classify (requires expensive engineers)? I mean the skill is not as expensive - many companies do this at scale.

  • The same way you check performance for any problem like this: by creating one or more manually-labeled test datasets, randomly sampled from the target data and looking at the resulting precision, recall, f-scores etc. LLMs change pretty much nothing about evaluation for most NLP tasks.

Yeah... Let's talk time needed for 10M prompts and how that fits into a daily pipeline. Enlighten us, please.

  • Run them all in parallel with a cloud function in less than a minute?

    • Yes, how did I not think of throwing more money at cloud providers on top of feeding open ai, when I could have just code a simple binary classifier and run everything on something as insignificant as an 8-th geh, quad core i5....

      2 replies →

    • Obviously all the LLM API providers have a rate limit. Not a fan of GP's sarcastic tone, but I suppose many of us would like to know roughly what that limit would be for a small business using such APIs.

      2 replies →

>You'd need to be doing 20 times that volume every single day to even start to justify hiring an NLP engineer instead.

How much for the “prompt engineer”? Who is going to be doing the work and validating the output?

  • You do not need a prompt engineer to create: “answer with just "service" or "product"”

    Most classification prompts can be extremely easy and intuitive. The idea you have to hire a completely different prompt engineer is kind of funny. In fact you might be able to get the llm itself to help revise the prompt.

  • All software engineers are (or can be) prompt engineers, at least to the level of trivial jobs like this. It's just an API call and a one-liner instruction. Odds are very good at most companies that they have someone on staff who can knock this out in short order. No specialized hiring required.

  • Prompt engineering is less and less of an issue the simpler the job is and the more powerful the model is. You also don't need someone with deep nlp knowledge to measure and understand the output.