Comment by chrisdbanks
2 years ago
How much has changed since 2015. With NLP and ML it used to be so hard to create high-quality datasets. It was a case of rubbish in, rubbish out. Not LLMs have solved that problem. It seems that if you put in a huge amount of data into a big enough model then an emergent ability seems to be the ability to discern the wheat from the chaff. Certainly in the NLP space, the days of crowd sourced datasets seem to be over replaced with few shot learning. So much value has been unlocked.
There's an interesting dark side to this as well, which is that in 2023 when you think you are crowdsourcing data you may actually just be tasking it to ChatGPT. A lot of turkers just turn around and use an LLM!
Which is of course absolutely terrible for the quality of the dataset you're trying to produce
I think the author's point broadly still holds -- you can get further with more engineering resources and data, whether you're using 2015 era models or 2023 retrieval-augmented LLMs and fine-tuning. Just that now you can accomplish a lot more quickly with a ChatGPT prompt.