← Back to context

Comment by agnosticmantis

6 months ago

“… we have a verbal agreement that these materials will not be used in model training”

Ha ha ha. Even written agreements are routinely violated as long as the potential upside > downside, and all you have is verbal agreement? And you didn’t disclose this?

At the time o3 was released I wrote “this is so impressive that it brings out the pessimist in me”[0], thinking perhaps they were routing API calls to human workers.

Now we see in reality I should’ve been more cynical, as they had access to the benchmark data but verbally agreed (wink wink) not to train on it.

[0: https://news.ycombinator.com/threads?id=agnosticmantis#42476... ]

You can still game a test set without training on it, that’s why you usually have a validation set and a test set that you ideally seldom use. Routinely running an evaluation on the test set can get the humans in the loop to overfit the data

OpenAI doesn't respect copyright so why would they let a verbal agreement get in the way of billion$

  • Can somehow explain to me how they can simply not respect copyright and get away with it? Also is this a uniquely open-ai problem, or also true of the other llm makers?

    • Their argument is that using copyrighted data for training is transformative, and therefore a form of fair use. There are a number of ongoing lawsuits related to this issue, but so far the AI companies seem to be mostly winning. Eg. https://www.reuters.com/legal/litigation/openai-gets-partial...

      Some artists also tried to sue Stable Diffusion in Andersen v. Stability AI, and so far it looks like it's not going anywhere.

      In the long run I bet we will see licensing deals between the big AI players and the large copyright holders to throw a bit of money their way, in order to make it difficult for new entrants to get training data. Eg. Reddit locking down API access and selling their data to Google.

      12 replies →

    • The FSF funded some white papers a while ago on CoPilot: https://www.fsf.org/news/publication-of-the-fsf-funded-white.... Take a look at the analysis by two academics versed in law at https://www.fsf.org/licensing/copilot/copyright-implications... starting with §II.B that explains why it might be legal.

      Bradley Kuhn also has a differing opinion in another whitepaper there (https://www.fsf.org/licensing/copilot/if-software-is-my-copi...) but then again he studied CS, not law. Nor has the FSF attempted AFAIK to file any suits even though they likely would have if it were an open and shut case.

      3 replies →

    • A lot of people want AI training to be in breach of copyright somehow, to the point of ignoring the likely outcomes if that were made law. Copyright law is their big cudgel for removing the thing they hate.

      However, while it isn't fully settled yet, at the moment it does not appear to be the case.

      12 replies →

    • "When I was a kid, I was praying to a god for bicycle. But then I realized that god doesn't work this way, so I stole a bicycle and prayed to a god for forgiveness." (c)

      Basically a heist too big and too fast to react. Now every impotent lawmaker in the world is afraid to call them what they are, because it will inflict on them wrath of both other IT corpos an of regular users, who will refuse to part with a toy they are now entitled to.

      2 replies →

    • Simply put, if the model isn’t producing an actual copy, they aren’t violating copyright (in the US) under any current definition.

      As much as people bandy the term around, copyright has never applied to input, and the output of a tool is the responsibility of the end user.

      If I use a copy machine to reproduce your copyrighted work, I am responsible for that infringement not Xerox.

      If I coax your copyrighted work out of my phones keyboard suggestion engine letter by letter, and publish it, it’s still me infringing on your copyright, not Apple.

      If I make a copy of your clip art in Illustratator, is Adobe responsible? Etc.

      Even if (as I’ve seen argued ad nauseaum) a model was trained on copyrighted works on a piracy website, the copyright holder’s tort would be with the source of the infringing distribution, not the people who read the material.

      Not to mention, I can walk into any public library and learn something from any book there, would I then owe the authors of the books I learned from a fee to apply that knowledge?

      25 replies →

    • > Can somehow explain to me how they can simply not respect copyright and get away with it? Also is this a uniquely open-ai problem, or also true of the other llm makers?

      Uber showed the way. They initially operated illegally in many cities but moved so quickly as to capture the market and then they would tell the city that they need to be worked with because people love their service.

      https://www.theguardian.com/news/2022/jul/10/uber-files-leak...

    • The short answer is that there is actually a number of active lawsuits alleging copyright violation, but they take time (years) to resolve. And since it's only been about two years since we've had the big generative AI blow up, fueled by entities with deep pockets (i.e., you can actually profit off of the lawsuit), there quite literally hasn't been enough time for a lawsuit to find them in violation of copyright.

      And quite frankly, between the announcement of several licensing deals in the past year for new copyrighted content for training, and the recent decision in Warhol "clarifying" the definition of "transformative" for the purposes of fair use, the likelihood of training for AI being found fair is actually quite slim.

    • You'll find people on this forum especially using the false analogy with a human. Like these things are like or analogous to human minds, and human minds have fair use access, so why shouldn't a these?

      Magical thinking that just so happens to make lots of $$. And after all why would you want to get in the way of profit^H^H^Hgress?

    • I wonder if Google can sue them for downloading the YouTube videos plus automatically generated transcripts in order to train their models.

      And if Google could enforce removal of this content from their training set and enforce a "rebuild" of a model which does not contain this data.

      Billion-dollar lawsuits.

    • “There must be in-groups whom the law protects but does not bind, alongside out-groups whom the law binds but does not protect.”

    • It's because the copyright is fake and the only thing supporting it were million dollar business. It naturally crumbles while facing billion dollar business.

  • Why do HN commenters want OpenAI to be considered in violation of copyright here? Ok, so imagine you get your wish. Now all the big tech companies enter into billion dollar contracts with each other along with more traditional companies to get access to training data. So we close off the possibility of open development of AI even further. Every tech company with user-generated content over the last 20 years or so is sitting on a treasure trove now.

    I’d prefer we go the other direction where something like archive.org archives all publicly accessible content and the government manages this, keeps it up-to-date, and gives cheap access to all of the data to anyone on request. That’s much more “democratizing” than further locking down training data to big companies.

OpenAI's benchmark results looking like Musk's Path of Exile character..

This has me curious about ARC-AGI.

Would it have been possible for OpenAI to have gamed ARC-AGI by seeing the first few examples and then quickly mechanical turking a training set, fine tuning their model, then proceeding with the rest of the evaluation?

Are there other tricks they could have pulled?

It feels like unless a model is being deployed to an impartial evaluator's completely air gapped machine, there's a ton of room for shenanigans, dishonesty, and outright cheating.

  • > This has me curious about ARC-AGI

    In the o3 announcement video, the president of ARC Prize said they'd be partnering with OpenAI to develop the next benchmark.

    > mechanical turking a training set, fine tuning their model

    You don't need mechanical turking here. You can use an LLM to generate a lot more data that's similar to the official training data, and then you can train on that. It sounds like "pulling yourself up by your bootstraps", but isn't. An approach to do this has been published, and it seems to be scaling very well with the amount of such generated training data (They won the 1st paper award)

    • I know nothing about LLM training, but do you mean there is a solution to the issue of LLMs gaslighting each other? Sure this is a proven way of getting training data, but you can not get theorems and axioms right by generating different versions of them.

      2 replies →

  • > OpenAI to have gamed ARC-AGI by seeing the first few examples

    not just few examples. o3 was evaluated on "semi-private" test, which was previously already used for evaluating OAI models, so OAI had access to it already for a long time.

  • In their benchmark, they have a tag "tuned" attached to their o3 result. I guess we need they to inform us of the exact meaning of it to gauge.

Why would they use the materials in model training? It would defeat the purpose of having a benchmarking set

  • Compare:

    "O3 performs spectacularly on a very hard dataset that was independently developed and that OpenAI does not have access to."

    "O3 performs spectacularly on a very hard dataset that was developed for OpenAI and that only OpenAI has access to."

    Or let's put it another way: If what they care about is benchmark integrity, what reason would they have for demanding access to the benchmark dataset and hiding the fact that they finance it? The obvious thing to do if integrity is your goal is to fund it, declare that you will not touch it, and be transparent about it.

  • If you’re a research lab then yes.

    If you’re a for profit company trying to raise funding and fend off skepticism that your models really aren’t that much better than any one else’s, then…

    It would be dishonest, but as long as no one found out until after you closed your funding round, there’s plenty of reason you might do this.

    It comes down to caring about benchmarks and integrity or caring about piles of money.

    Judge for yourself which one they chose.

    Perhaps they didn’t train on it.

    Who knows?

    It’s fair to be skeptical though, under the circumstances.

    • 6 months ago it would be unimaginable to do anything that may be harmful to the quality of the product, but I’m trusting OpenAI less and less

>perhaps they were routing API calls to human workers

Honest question, did they?

  • How would that even work? Aren’t the responses to the API equally fast as the Web interface? Can any human write a response with the speed of an LLM?

    • No but a human can solve a problem that an LLM can't solve and then an LLM can generate a response to the original prompt including the solution found by the human.

verbal agreement ... that's just saying that you're a little dumb or you're playing dumb cause you're in on it.