← Back to context

Comment by anotherpaulg

2 years ago

I think there's a new approach for “How do you get the data?” that wasn't available when this article was written in 2015. The new text and image generative models can now be used to synthesize training datasets.

I was working on an typing autocorrect project and needed a corpus of "text messages". Most of the traditional NLP corpuses like those available through NLTK [0] aren't suitable. But it was easy to script ChatGPT to generate thousands of believable text messages by throwing random topics at it.

Similarly, you can synthesize a training dataset by giving GPT the outputs/labels and asking it to generate a variety of inputs. For sentiment analysis... "Give me 1000 negative movie reviews" and "Now give me 1000 positive movie reviews".

The Alpaca folks used GPT-3 to generate high-quality instruction-following datasets [1] based on a small set of human samples.

Etc.

[0] https://www.nltk.org/nltk_data/

[1] https://crfm.stanford.edu/2023/03/13/alpaca.html

An interesting question is, if you can get ChatGPT to generate high quality data for you, should you just cut out the middle-model and be using ChatGPT as your classifier?

The answer probably depends a lot on your specific problem domain and constraints, but a non-trivial amount of the time the answer will be that your task could be solved by a wrapper around the ChatGPT API.

  • You definitely can use LLMs to do your modeling. But sometimes you need very fast, cheap, and smaller models instead. Also there's research out there showing that using LLM to generate training data for targeted & specific models may result in better performance.

  • >should you just cut out the middle-model and be using ChatGPT as your classifier?

    Oh you certainly could.

    See here: GPT-3.5 outperforming elite crowdworkers on MTurk for Text annotation https://arxiv.org/abs/2303.15056

    GPT-4 going toe to toe with expertrs (and significantly outperforming crowdworkers) on NLP tasks

    https://www.artisana.ai/articles/gpt-4-outperforms-elite-cro...

    I guess it will tke some time before the reality really sinks but the days of the artificial sota being obviously behind human efforts for NLP has come and gone.

  • > should you just cut out the middle-model and be using ChatGPT as your classifier?

    And hope OpenAI forever provides the service, and at a reasonable price, latency, and volume?

    • > And hope OpenAI forever provides the service, and at a reasonable price, latency, and volume?

      They are enjoying the be the market leader for now, but OpenAI will soon be facing real competition, and LLM services will become a commodity product. That must be partly why they seeked Microsoft backing: to be a part of the "big tech".

This is a very bad idea for image models. They pick up and amplify imperceptible distortions in images no human reviewer would catch... Not to speak of big ones when the output is straight up erroneous.

This may apply to text too.

Partial or fully synthetic data is OK when finetuning existing LLMs. I personally discovered its not OK for finetuning ESRGAN. Not sure about diffusion models.

  • > Not sure about diffusion models.

    Diffusion models are still approximate density estimators, not explicit. They lose information because you don't have an unique mapping to the subsequent step. Got to think about the relationships of your image and preimage.

    So while they have better distribution that GANs, they still aren't reliable for dataset synthesis. But they are better than GANs for that (GANs will be very mean focused, which is why we had such high quality images from them but we also see huge diversity issues and amplification of biases).

  • > Not sure about diffusion models.

    Human-curated synthetic data is commonly used in finetuning (or LoRa-training) for SD. I doubt that uncurated synthetic data would be very usable. There might be use cases where curating synthetic data with some kind of vision model would be valuable, but my intuition would be that it would be largely hit-or-miss and hard to predict.

> The new text and image generative models can now be used to synthesize training datasets.

No. Just no. Dear god, no.

This isn't too different from GPT-4 grading itself (looking at you MIT math problems)!

Current models don't accurately estimate the probability distribution of data, so they can't be reliable for dataset synthesis. Yes, synthesis can help, but you also have to specifically remember that typically they don't because they generate the highest likelihood data, which is already abundant. Getting non-mean data is the difficult part and without good density estimation you can't reliably do this. The density estimation networks are rather unpopular and haven't received nearly as much funding or research. Though I highly suggest it, but I'm biased because this is what I work in (explicit density estimation and generative modeling).

Sampling an AI output when the distribution you want is human data is incredibly stupid.

  • I don't think it is. The distribution of an AI model that was trained on such a huge amount of movie reviews is very close to the human distribution.

    At least that's true around the mean. If your application needs to handle long-tail cases, an LLM won't easily give you that. But depending on the application, that may not be necessary. So yeah, sometimes this is a bad idea, but for many applications it may be just fine.

It's funny, for my lil startup, "How do you get the data" is now _less_ tech than ever. I pay an hourly wage to a human to generate/transcribe it. This method is both much more cost effective and scalable than tech-enabled alternatives.

> The new text and image generative models can now be used to synthesize training datasets.

Only with heavy curation [0], otherwise your new models will be trained on progressively worse data than earlier models.