Comment by axegon_

1 year ago

No, it has not and will not in the foreseeable future. This is one of my responsibilities at work. LLMs are not feasible when you have a dataset of 10 million items that you need to classify relatively fast and at a reasonable cost. LLMs are great at mid-level complexity tasks given a reasonable volume of data - they can take away the tedious job of figuring out what you are looking at or even come up with some basic mapping. But anything at large volumes.. Na. Real life example: "is '20 bottles of ferric chloride' a service or a product?"

One prompt? Fair. 10? Still ok. 100? You're pushing it. 10M - get help.

103 comments

axegon_

segmondy 1 year ago

You are not pushing it at 100. I can classify "Is 20 bottles of ferric chloride' a service or product in probably 2 seconds with a 4090. Something that most people don't realize is you can run multiple inference. So with something like a 4090, some solid few shots, and instead of having it classify one example at a time, you can do 5. We can probably run 100 parallel inference at 5 at a time. For about a rate of 250 a second on a 4090. So in 11 hours I'll be done. I'm going with a 7-8B model too. Some of the 1.5-3B models are great and will even run faster. Take a competent developer who knows python and how to use an OpenAI compatible API, they can put this together in 10-15 minutes, with no data science/scikit learn or other NLP toolchain experience.

So for personal, medium or even large workloads, I think it has killed it. It needs to be extremely large. If you are classifying or segmenting comments on a social media platform were you need to deal with billions a day, then LLM would be a very inefficient approach, but for 90+% of use cases. I think it wins.

I'm assuming you are going to run it locally because everyone is paranoid about their data. It's even cheaper if you use a cloud API.

griomnib 1 year ago

Or you can build a DistilBERT model and get your egregiously inefficient 2 seconds down to tens of milliseconds.
mikeocool 1 year ago
If you have to classify user input as they’re inputting it to provide a response — so it can’t be batched - 2 seconds could potentially be really slow.
Though LLMs sure have made creating training data to train old school models for those cases a lot easier.
- griomnib 1 year ago
  
  Yeah, that’s what I do: use LLM to help make training data for small models. It’s ao much more efficient, fast, and ergo, scalable.
magic_hamster 1 year ago

Yes and no. Having used these tools extensively I think it will be some time before LLMs are truly performant. Even smaller models can't be compared to running optimized code with efficient data structures. And smaller models (in general) do reduce the quality of your results in most cases. Maybe LLMs will kill off NLP and other pursuits pretty soon, but at the moment, each have their tradeoffs.
WildGreenLeave 1 year ago
Correct me if I'm wrong, but, if you run multiple inferences at the same time on the same GPU you will need load multiple models in the vram and the models will fight for resources right? So running 10 parallel inferences will slow everything down 5 times right? Or am I missing something?
- Palmik 1 year ago
  
  Inference for single example is memory bound. By doing batch inference, you can interleave computation with memory loads, without losing much speed (up until you cross the compute bound threshold).
- bavell 1 year ago
  
  You will most likely be using the same model so just 1 to load into vram.
- aeternum 1 year ago
  
  No, the key is to use the full context window so you structure the prompt as something like: For each line below, repeat the line, add a comma then output whether it most closely represents a product or service:
  20 bottles of ferric chloride
  salesforce
  ...
  
  1 reply →
rldjbpin 1 year ago

even more naive way - just club several requests into batch of classification requests into one prompt. in practice, this is not production-ready as the llm output does not always contain results for the same number of input (sometimes more than inputted even!)
vrighter 1 year ago

two seconds is a VERY VERY VERY long time. That is mind-bogglingly, insanely slow.
mystified5016 1 year ago

At 2s per query for 10m entries, that's 251 days to run through the database.
axegon_ 1 year ago
FFS... "Lots of writers, few readers". Read again and do the math: 2 seconds, multiply that by 10 million records which contain this, as well as "alarm installation in two locations" and a whole bunch of other crap with little to no repetition (<2%) and where does that get you? 2 * 10,000,000 = 20,000,000 SECONDS!!!! A day has 86,400 seconds (24 * 3600 = 86,400). The data pipeline needs to finish in <24 hours. Everyone needs to get this into their heads somehow: LLM's are not a silver bullet. They will not cure cancer anytime soon, nor will they be effective or cheap enough to run at massive scale. And I don't mean cheap as in "oh, just get openai subscription hurr durr". Throwing money mindlessly into something is never an effective way to solve a problem.
- why_only_15 1 year ago
  
  Assuming the 10M records is ~2000M input tokens + 200M output tokens, this would cost $300 to classify using llama-3.3-70b[1]. If using llama lets you do this in say one day instead of two days for a traditional NLP pipeline, it's worthwhile.
  [1]: https://openrouter.ai/meta-llama/llama-3.3-70b-instruct
  
  2 replies →
- gbnwl 1 year ago
  
  Why are you using 2 seconds? The commenter you are responding to hypothesized being able to do 250/s based on "100 parallel inference at 5 at a time". Not speaking to the validity of that, but find it strange that you ran with the 2 seconds number after seemingly having stopped reading after that line, while yourself lamenting people don't read and telling them to "read again".
  
  7 replies →

alexwebb2 1 year ago

I think your intuition on this might be lagging a fair bit behind the current state of LLMs.

System message: answer with just "service" or "product"

User message (variable): 20 bottles of ferric chloride

Response: product

Model: OpenAI GPT-4o-mini

$0.075/1Mt batch input * 27 input tokens * 10M jobs = $20.25

$0.300/1Mt batch output * 1 output token * 10M jobs = $3.00

It's a sub-$25 job.

You'd need to be doing 20 times that volume every single day to even start to justify hiring an NLP engineer instead.

simonw 1 year ago

You might be able to use an even cheaper model. Google Gemini 1.5 Flash 8B is Input: $0.04 / Output: $0.15 per 1M tokens.
17 input tokens and 2 output tokens * 10 million jobs = 170,000,000 input tokens, 20,000,000 output tokens... which costs a total of $6.38 https://tools.simonwillison.net/llm-prices
As for rate limits, https://ai.google.dev/pricing#1_5flash-8B says 4,000 requests per minute and 4 million tokens per minute - so you could run those 10 million jobs in about 2500 minutes or 42 hours. I imagine you could pull a trick like sending 10 items in a single prompt to help speed that up, but you'd have to test carefully to check the accuracy effects of doing that.
w10-1 1 year ago
The question is not average cost but marginal cost of quality - same as voice recognition, which had relatively low uptake even at ~2-4% error rates due to context switching costs for error correction.
So you'd have to account for the work of catching the residue of 2-8%+ error from LLMs. I believe the premise is for NLP, that's just incremental work, but for LLM's that could be impossible to correct (i.e., cost per next-percentage-correction explodes), for lack of easily controllable (or even understandable) models.
But it's most rational in business to focus on the easy majority with lower costs, and ignore hard parts that don't lead to dramatically larger TAM.
- gf000 1 year ago
  
  I am absolutely not an expert in NLP, but I wouldn't be surprised if for many kinds of problems LLMs would have far less error rate, than any NLP software.
  Like, lemmation is pretty damn dumb in NLP, while a better LLM model will be orders of magnitude more correct.
griomnib 1 year ago
This assumes you don’t care about our rapidly depleting carbon budget.
No matter how much energy you save personally, running your jobs on Sam A’s earth killer ten thousand cluster of GPUs is literally against your own self interest of delaying climate disasters.
LLM have huge negative externalities, there is a moral argument to only use them when other tools won’t work.
- amanaplanacanal 1 year ago
  
  It's digging fossil carbon out of the ground that's the problem, not using electricity. Switch to electricity not from fossil carbon and you're golden.
  
  1 reply →
- renewiltord 1 year ago
  
  Haha, this is pretty good. I’m going to take a plane to SF while I laugh at this.
elicksaur 1 year ago
How do you validate these classifications?
- bugglebeetle 1 year ago
  
  The same way you check performance for any problem like this: by creating one or more manually-labeled test datasets, randomly sampled from the target data and looking at the resulting precision, recall, f-scores etc. LLMs change pretty much nothing about evaluation for most NLP tasks.
- segmondy 1 year ago
  
  The same way you validate it if you didn't use an LLM.
- jeswin 1 year ago
  
  Isn't it easier and cheaper to validate than to classify (requires expensive engineers)? I mean the skill is not as expensive - many companies do this at scale.
- scarface_74 1 year ago
  
  You need a domain expert either way. I mentioned in another reply that one of my niches is implementing call centers with Amazon Connect and Amazon Lex (the NLP engine).
  https://news.ycombinator.com/item?id=42748189
  I don’t know the domain beforehand they are working in, I do validation testing with them.
axegon_ 1 year ago
Yeah... Let's talk time needed for 10M prompts and how that fits into a daily pipeline. Enlighten us, please.
- FloorEgg 1 year ago
  
  Run them all in parallel with a cloud function in less than a minute?
  
  8 replies →
LeafItAlone 1 year ago
>You'd need to be doing 20 times that volume every single day to even start to justify hiring an NLP engineer instead.
How much for the “prompt engineer”? Who is going to be doing the work and validating the output?
- blindriver 1 year ago
  
  You do not need a prompt engineer to create: “answer with just "service" or "product"”
  Most classification prompts can be extremely easy and intuitive. The idea you have to hire a completely different prompt engineer is kind of funny. In fact you might be able to get the llm itself to help revise the prompt.
- alexwebb2 1 year ago
  
  All software engineers are (or can be) prompt engineers, at least to the level of trivial jobs like this. It's just an API call and a one-liner instruction. Odds are very good at most companies that they have someone on staff who can knock this out in short order. No specialized hiring required.
  
  2 replies →
- IanCal 1 year ago
  
  Prompt engineering is less and less of an issue the simpler the job is and the more powerful the model is. You also don't need someone with deep nlp knowledge to measure and understand the output.
  
  2 replies →

vlovich123 1 year ago

That’s the argument the article makes but the reasoning is a little questionable on a few fronts:

- It uses f16 for the data format whereas quantization can reduce the memory burden without a meaningful drop in accuracy, especially as compared with traditional NLP techniques.

- The quality of LLMs typically outperform OpenCV + NER.

- You can choose to replace just part of the pipeline instead of using the LLM for everything (e.g. using text-only 3B or 1B models to replace the NER model while keeping OpenCV)

- The (LLM compute / quality) / watt is constantly decreasing. Meaning even if it’s too expensive today, the system you’ve spent time building, tuning and maintaining today is quickly becoming obsolete.

- Talking with new grads in NLP programs, all the focus is basically on LLMs.

- The capability + quality out of models / size of model keeps increasing. That means your existing RAM & performance budget keeps absorbing problems that seemed previously out of reach

Now of course traditional techniques are valuable because they can be an important tool in bringing down costs (fixed function accelerator vs general purpose compute), but it’s going to become more niche and specialized with most tasks transitioning to LLMs I think.

The “bitter lesson” paper is really relevant to these kinds of discussions.

vlovich123 1 year ago

Not an independent player so obviously important to be critical of papers like this [1], but it’s claiming a ~10x cost in LLM inference every year. This lines up with the technical papers I’m seeing that are continually improving performance + the related HW improvements.
That’s obviously not sustainable indefinitely, but these kinds of exponentials are precisely why people often make incorrect conclusions on how long change will take to happen. Just a reminder: CPUs were 2x more performance every 18 months and continued to continually upend software companies for 20 years who weren’t in tune with this cycle (i.e. focusing on performance instead of features). For example, even if you’re spending $10k/month for LLM vs $100/month to process the 10M item, it can still be more beneficial to go the LLM route as you can buy cheaper expertise to put together your LLM pipeline than the NLP route to make up the ~100k/year difference (assuming the performance otherwise works and the improved quality and robustness of the LLM solution isn’t providing extra revenue to offset).
[1] https://a16z.com/llmflation-llm-inference-cost/

blindriver 1 year ago

That’s sort of like asking a horse and buggy driver whether automobiles are going to put them out of business.

I think for the most part, casual nlp is dead because of LLMs. And LLM costs are going to plummet soon, so large scale nlp that you’re talking about is probably dead within 5 years or less. The fact that you can replace programmers with prompts is huge in my opinion so no one needs to learn an nlm API anymore, just stuff it into a prompt. Once costs to power LLMs decrease to meet the cost of programmers it’s game over.

dartos 1 year ago
> LLM costs
Inference costs, not training costs.
> The fact that you can replace programmers
You can’t… not for any real project. For quick mockups they’re serviceable
> That’s sort of like asking a horse and buggy driver whether automobiles
Kind of an insult to OP, no? Horse and buggy drivers were not highly educated experts in their field.
Maybe take the word of domain experts rather than AI company marketing teams.
- blindriver 1 year ago
  
  > Maybe take the word of domain experts rather than AI company marketing teams.
  Appeal to authority is a well known logical fallacy.
  I know how dead NLP is personally because I’ve never been able to get NLP working but once ChatGPT came around, I was able to classify texts extremely easily. It’s transformational.
  I was able to get ChatGPT to classify posts based on how political it was from a scale of 1 to 10 and which political leaning they were and then classify the persons likely political affiliations.
  All of this without needing to learn any APIs or anything about NLPs. Sorry but given my experience, NLPs are dead in the water right now, except in terms of cost. And cost will go down exponentially as they always do. Right now I’m waiting for the RTC 5090 so I can just do it myself with open source LLM.
  
  10 replies →
- chaos_emergent 1 year ago
  
  > Inference costs, not training costs.
  Why does training cost matter if you have a general intelligence that can do the task for you, that’s getting cheaper to run the task on?
  > for quick mockups they’re serviceable
  I know multiple startups that use LLMs as their core bread-and-butter intelligence platform instead of tuned but traditional NLP models
  > take the word of domain experts
  I guess? I wouldn’t call myself an expert by any means but I’ve been working on NLP problems for about 5 years. Most people I know in NLP-adjacent fields have converged around LLMs being good for most (but obviously not all) problems.
  > kind of an insult
  Depends on whether you think OP intended to offend, ig
  
  2 replies →
- elwebmaster 1 year ago
  
  Reply didn’t say that the expert is uneducated, just that their tool is obsolete. Better look at facts the way they are, sugar coating doesn’t serve anyone.
otabdeveloper4 1 year ago
> The fact that you can replace programmers with prompts
No, you can't. The only thing LLM's replace is internet commentators.
- blindriver 1 year ago
  
  As I explained below, I avoided having to learn anything about ML, PyTorch or any other APIs when trying to classify posts based on how political they were and which affiliation they were. That was holding me back and it was easily replaced by an llm and a prompt. Literally took me minutes what would have taken days or weeks and the results are more than good enough.
  
  3 replies →
- portaouflop 1 year ago
  
  No you can’t; LLMs are dog shit at internet banter, too neutered
arandomhuman 1 year ago

>The fact that you can replace programmers with prompts
this is how you end up with 1000s of lines of slop that you have no idea how it functions.

simonw 1 year ago

What NLP approaches are you using to solve the "is '20 bottles of ferric chloride' a service or a product?" problem?

pona-a 1 year ago

How about a naive Bayesian Bag of Words? Just find/scrape/generate with an LLM a large enough corpus of products/services, build the term frequency matrix, calculate class priors and P(term|class) and inference with straightforward application of Bayes theorem.
This particular problem, at least to me, seems trivial, and to use an LLM for anything like this for more than a hundred cases seems incredibly wasteful.

devjab 1 year ago

While I agree with both you and the article I also think it'll depend on more than just the volume of your data. We have quite a lot of documents that we classify. It's around 10-100k a month, some rather large others simple invoices. We used to have a couple of AI specialists who handled the classification with local NLP models, but when they left we had to find alternatives. For us this was the AI services in the cloud we use and the result has been a document warehouse which is both easier for the business to manage and a "pipeline" which is much cheaper than having those AI specialists on the payroll.

I imagine this wouldn't be the case if we were to do more classification projects, but we aren't. We did try to find replacements first, but it was impossible for us to attract any talent, which isn't too much of a surprise considering it's mainly maintenance. Using external consultants for that maintenance proved to be almost more expensive than having two full time employees.

bloomingkales 1 year ago

I suspect any solution like that will be wholesale thrown away in a year or two. Unless the damn thing is going to make money in the next 2-3 years, we are all mostly going to write throwaway code.

Things are such an opportunity cost now days. It’s like trying to capture value out of a transient amorphous cloud, you can’t hold any of it in your hand but the phenomenon is clearly occurring.

MasterScrat 1 year ago

Can you talk about the main non-LLM NLP tools you use? e.g. BERT models?

> One prompt? Fair. 10? Still ok. 100? You're pushing it. 10M - get help.

Assuming you could do 10M+ LLM calls for this task at trivial cost and time, would you do it? i.e. is the only thing keeping you away from LLM the fact they're currently too cumbersome to use?

gf000 1 year ago

Why not just run a local LLM for practically free? You can even trivially parallelize it with multiple instances.

I would believe that many NLP problems can be easily solved even by smaller LLM models.

scarface_74 1 year ago

http://www.incompleteideas.net/IncIdeas/BitterLesson.html

specproc 1 year ago

I see LLMs best used as part of a more traditional NLP pipeline.

For example, an approach that does me well is clustering then using LLMs on representative docs. Tools like bertopic are great for this.

I also don't see a clear cut difference between the two in certain areas. Embeddings are critical in LLM pipelines, but for me anyway, also "old school" tools.

I think NLP as described in the article is certainly under threat, but the tools and approaches compliment LLM use well, are far more efficient, and distinguish the pros from the neophytes.

If you're using LLMs for NLP-type tasks, but don't know the NLP tools, you're missing out.

sireat 1 year ago

So what would you use to classify whether a document is a critique or something else in 1M documents in a non-English language?

This is a real problem I am dealing with at a library project.

Each document is between 100 to 10k tokens.

Most top (read most expensive) LLMs available in OpenRouter work great, it is the cost (and speed) that is the issue.

If I could come up with something locally runnable that would be fantastic.

Presumably BERT based classifiers would work if I had one properly trained for the language.

rahimnathwani 1 year ago

I guess you've already seen https://huggingface.co/collections/answerdotai/modernbert-67... ?

HarHarVeryFunny 1 year ago

10M items @ 10 tokens each ("20 bottles of ferric chloride" etc) plus 10M tokens out (category) is 100M tokens in 10M tokens out.

Claude Haiku is $0.25 per 1M tokens in, $1.05 per 1M out, so cost would be ~$35.

GPT-4o mini is even cheaper at $0.15 per 1M in.

Of course if your volume justifies the hardware cost you could always run Llama locally, for the cost of the electricity used.

llmsolutions 1 year ago

You can use embeddings to build classification models using various methods. Not sure what qualifies as "get help" level of cost/throughput, but certainly most providers offer large embedding APIs at much lower cost/higher throughput than their completion APIs.

WhitneyLand 1 year ago

For context, 10M would cost ~$27.

Say Gemini Flash 8B, allowing ~28 tokens for prompt input at $0.075/1M tokens, plus 2 output tokens at $0.30/1M. Works out to $0.0027 per classification. Or in other words, for 1 penny you could do this classification 3.7 times.

hulitu 1 year ago

That was also my impresion. LLM can "describe" but not classify. Hallucinate but nothing precise.

Kuinox 1 year ago

Prompt caching would lower the cost, later similar tech would lower the inference cost too. You have less than 25 tokens, thats between 1-5$.

There may be some use case but I'm not convinced with the one you gave.

minimaxir 1 year ago

So there's a bit of an issue with prompt caching implementations: for both OpenAI API and Claude's API, you need a minimum of 1024 tokens to build the cache for whatever reason. For simple problems, that can be hard to hit and may require padding the system prompt a bit.

crystal_revenge 1 year ago

> LLMs are not feasible when you have a dataset of 10 million items that you need to classify relatively fast and at a reasonable cost.

What? That's simply not true.

Current embedding models are incredibly fast and cheap and will, in the vast majority of NLP tasks, get you far better results than any local set of features you can develop yourself.

I've also done this at work numerous times, and have been working on various NLP tasks for over a decade now. For all future traditional NLP tasks the first pass is going to be to get fetch LLM embeddings and stick on a fairly simple classification model.

> One prompt? Fair. 10? Still ok. 100? You're pushing it. 10M - get help.

"Prompting" is not how you use LLMs for classification tasks. Sure you can build 0-shot classifiers for some tricky tasks, but if you're doing classification for documents today and you're not starting with an embedding model you're missing some easy gains.

anon373839 1 year ago

Embedding models are not LLMs in the sense that the term is being used in the title of this post. They are “traditional NLP.”

fud101 1 year ago

Can you recommend a way to classify a small number of objects? Local only and Python preferably.

diggan 1 year ago

So TLDR: You agree with the author, but not for the same reasons?