Comment by jsnell

3 years ago

There is basically no chance the captchas are actually being used for generating training data at this point. The puzzles have not changed for ages. Like, five years? How many billions (trillions?) of labels do you think they have for buses and traffic lights at this point?

If there was an economic value to using captcha solutions for labeling, somebody would be rotating novel tasks into the mix. But they don't seem to be.

(And if the goal of running the service was to generate labels, they would not have built solutions to make it possible to pass the captchas without a puzzle, like recaptcha v3.)

So rest assured, your work in solving the captchas is totally b useless, just like you wanted!

Lately I've seen captchas that ask me to identify things in images that are clearly generated with AI. Like frogs without backs.

I think at this point it is clear we are not training image recognition so much as providing them with free scoring for their image generation algorithms.

  • Just to be clear, are you saying that you saw the "frogs without backs" puzzle on Recaptcha? Because I definitely have not seen anything but the streetview images there for ages.

    Now, if it is a captcha provider whose advertised business model is to sell access to the users for labeling and split the profits with the website that integrates their captcha (e.g. hCaptcha), then I can believe somebody would submit a image generation eval dataset. But it seems irrelevant to discussion of whether solving a Recaptcha is free work.

    • I mostly see streetview stuff, but twice I've seen one that had stuff like "which one is a ladybug" or even "which one looks like a blah without a blah".

      This prompt of course could still be for classification of images.

      But then the "ladybugs" often were heavily distorted to the point where they did things like morph into other animals or the background. They did not seem possible to be photos, but I could be wrong. The prompts were very odd.

I would rather guess that the emperor has no clothes, i.e. AI is still so bad that it needs insane amounts of training data and hasn't got enough yet.

  • That's a fun guess? Do you flip a coin every time you make a decision or...

    I can assure you that if CLIP and ALIGN exist, there is objectively no reason for them to collect what would amount to a dataset for...solving Google CAPTCHA's? Which I'm pretty sure is a solved problem even without the data.

It is just continuous QA for Waymo. Measures how well existing ML is working in the real world.

  • That's a great example of something that would require them mixing in novel tasks, rather than recycle the stale traffic light detection puzzles. Because let's be honest, detecting traffic lights is not anywhere closest to the hard part about self-driving cars. Knowing how well you can do it tells you nothing about how well you can solve the actual difficult things.

    What would a task that uses humans to solve that problem actually look like? I'm guessing it would need to be short videos, not images. And look for things with some ambiguity. "Select any videos where a pedestrian looks like they intend to cross the street".

    • I’m guessing Waymo has very little influence over Google.

      Waymo: “I see you have that hammer, we have a usecase for it”

      Google: “Ok, but we aren’t changing the hammer or how hard it is to use it”