← Back to context

Comment by potatolicious

1 day ago

> "I'm particularly annoyed by using LLMs to evaluate the output of LLMs."

+1, and IMO part of a general trend where we're just not serious about making sure this shit works. Higher scores make stonks go up, who cares if it actually leads to reliably working products.

But also more importantly it's starting to expose the fact that we haven't solved one of ML's core challenges: data collection and curation. On the training side we have obviated this somewhat (by ingesting the whole internet, for example), but on the eval side it feels like we're increasing just going "actually constructing rigorous evaluation data, especially at this scale, would be very expensive... so let's not".

I was at a local tech meetup recently where a recruiting firm was proudly showing off the LLM-based system they're using to screen candidates. They... did not evaluate the end-to-end efficacy of their system. At all. This seems like a theme within our industry - we're deploying these systems based purely on vibes without any real quantification of efficacy.

Or in this case, we're quantifying efficacy... poorly.

> +1, and IMO part of a general trend where we're just not serious about making sure this shit works.

I suspect quite a lot of the industry is actively _opposed_ to that, because it could be damaging for the "this changes everything" narrative.