Comment by onetrickwolf

6 hours ago

“Distillation attack” are we joking here.

If anything these models should be compelled to be public since they have been trained off public data. What an absurd overreach to call this an attack.

It’s clear they are scapegoating national security and China at this point to build an anti-competitive moat.

I generally really like Anthropic’s work and models but stuff like this scares me for the future. We are positioning these companies to have too much power. The public’s life is getting worse while these companies consolidate power using data they stole from the public.

> If anything these models should be compelled to be public since they have been trained off public data

I'm starting to come around to this idea TBH. For a while my position was: "these companies have invested billions into training these models, therefore they should be able to control them and profit off them" but looking deeper at where they got their training data, my view is starting to shift.

IMHO I feel like we need new laws around AI, specifically training data. Something like: "you can train an AI model and ignore copyright laws, BUT you must then make the model open weight", a company can still develop closed weight models but then they must aquire permission to use training data.

But it gets murky because if something like that was on the books then AI labs would just train open weight models and then distill them into their closed weight models.

  • labs invest multiple billion dollars a year each in private data, and that number is growing. internet training data is not where frontier capabilities come from, this view is outdated

    • This is a misleading statement. The "private data" is still largely publicly produced data that has been curated through private agreements instead of scraping, such as reddit posts/comments (this is the "third-party data agreements" that companies like OpenAI mention). And yes, there is still a lot of processing done on this data, which is the norm for preparing training data.

      5 replies →

    • When did they start doing so? We all know that they DID train on all the available public information, so at what point did they stop? Is the public information still in the training set? If so, they should STILL release ALL the data as public, as they are including training data that was acquired without permission.

      1 reply →

    • > internet training data is not where frontier capabilities come from

      In that case, it should be no problem for the labs to train their new models without using public data, right?

    • Define "come from". Could they have gotten those frontier capabilities, or any capabilities, without internet training data? It seems to me that without the private data, you might get a slightly less competitive model, but without the CommonCrawl-style data piles used in "pretraining", you get no model at all.

      Even accepting the copying-as-theft framing, if I go to a village, steal some vegetables from everyone's gardens and ham from their sheds, and then add some prohibitively expensive spices I bought myself to make soup, do I get to claim it as mine and punish the villagers for trying to take it?

    • > internet training data is not where frontier capabilities come from

      We 100% would not be at the current progress without it, though. And it's not like they only train on this once. They keep training on all the internet data PLUS the private data. Private data only (probably) wouldn't work, as learning the base regularities of language takes a lot of weights.

    • Does this private data come from places like Reddit, Twitter, etc., where it’s contributed by users? I think it is unethical for these companies to accept payment for user-contributed data.

    • Okay that's fine, then make the law say they must provide publicly owned models off of publicly obtained data. To think that such a baseline of critical information isn't is the literal foundation of everything they will do, both now in the future, is just exposing what their end game is: control.

      There no reason to not to otherwise outside of the poor little billion dollar corporations not wanting to provide a public utility they stolen from the public.

      Anything that removes control from American big tech is a good thing for American citizens and the world writ large.

    • No, you're talking about fine tuning and most of it is coming from your customers or someone else's. Get off ya high horse.

      Copyright needs abolishing.

      Companies can't be trusted with societies need for open progress.

  • I'm not taking sides here but this situation is not so black and white and it has always been the darker side of capitalism.

    The concept of Intellectual property exists not because it's fair but because it creates incentive to make said "intellectual property" exist. If intellectual property can be instantly copied by a competitor... why would I spend a dime to even create such a thing? I want to profit off of what I make because I'm a capitalist and money is what drives me (as a capitalist).

    Anthropic models wouldn't exist if they couldn't keep a unholy grip on it. Same with openAI. Same with many life saving drugs.

    Of course everyone here is talking about the obvious stuff like how it's morally wrong to with-hold life saving drugs or to have AI literally take over the world and be under the control of one company and all of this is true. But it is also true that greed is the engine that drives our economy and if you want our economy to produce "intellectual property" you must allow people to "capitalize" on that greed.

    There are two controversial issues here. What is moral/fair? And what is realistically practical in optimizing the economy if said economy is based on money.

    The distillation in my mind is a win for practicality because Competition also drives our economic engine. First you don't want a monopoly, but you also don't want these models to be so damn open that there's zero incentive to make them.

    • That intellectual property argument goes both ways. The model might not exist without protection, but it also would not exist without the data.

    • This perfectly explains why current LLMs should be illegal in an actual capitalist market.

      Why should anyone publish anything if it can be stolen with impunity? Is the value of these LLMs even remotely close to the amount of value they stole and the amount of value they will detract from economy because people will be more hesitant to publish anything now?

The core of the training data is public, but the part that actually makes these models smart came from (pretty highly-paid) experts via platforms like Mercor. Claude didn't magically learn to write good code by reading all of GitHub - humans trained it in that, more or less manually.

  • If you pay me to curate a playlist of musical hits, can you now publish and charge people for access to that playlist (*including the curated material)? Can we do the same with movies? Books?

    /edit Added a note to make it more obvious that the material is included in the playlist, just like the material is incorporated as part of curated AI models.

    • >> If you pay me to curate a playlist of musical hits, can you now publish and charge people for access to that playlist?

      If the contract was "work-for-hire" then yes, of course I can.

      1 reply →

  • Given the breadth of LLM knowledge, I somehow doubt this. Sure, it’s probably responsible for the quality of LLM insights, but I don’t think anyone was asking experts about e.g. the complex ecological effects of invasive zebra mussels and their provenance in Lake Michigan.

  • No, they do RLVR (reinforcement learning with verifiable rewards) like everyone else. And probably use claude data too, with human in the loop and tool feedback.

  • ...and the rest of the training data (ie. the entire corpus of copyrighted works) was not written by experts expecting compensation? Double standards.

> If anything these models should be compelled to be public since they have been trained off public data. What an absurd overreach to call this an attack.

> It’s clear they are scapegoating national security and China at this point to build an anti-competitive moat.

If all that is required to train these models is public data, why can't Alibaba just use that?

The fact that Alibaba has to resort to scraping Claude suggests there already is a moat...

  • This feels more nuanced than you are giving it credit for? Much of the training data that was available has been withdrawn, atleast for OpenAI we know that much of the training data was garnered in less-than above the board methods

Should Google search index be forced to be public too?

  • Honestly, yes it should in some form. If their index contains the actual data from the sites, and they are making that information public in one way or another, then it should be available as a downloadable dataset.

its mainly just a lot cheaper. copying is always cheaper anyway, very little r&d - ai or no ai.

> If anything these models should be compelled to be public since they have been trained off public data.

Isn't that a bit like saying if you read books in a public library to pick up a new skill you should work for free?

> What an absurd overreach to call this an attack.

Would it be an attack to take your meal by force if you used a public recipe to prepare the meal?

  • > Isn't that a bit like saying if you read books in a public library to pick up a new skill you should work for free?

    Only if you’re trying to muddy the waters. No, obviously it’s not. One can also support licensing for driving a car on public roads but not for walking, even though both involve traveling. This is only confusing to people pretending to be confused, for effect.

    > Would it be an attack to take your meal by force if you used a public recipe to prepare the meal?

    “You wouldn’t download a car…” (unless it worked like copying an MP3, then, of course, you would, everyone would)

    It’s as if you’re using terrible analogies and comparisons because stronger ones don’t exist. Great news for the AI-should-be-open crowd.

    • I think the analogies are appropriate. Anthropic took public data and added value on top of it. It is that added value that Alibaba is targeting. If it was the underlying data, that's freely available.

      4 replies →

> It’s clear they are scapegoating national security and China at this point to build an anti-competitive moat.

They are also fear mongering (and getting shills to as well) the idea that once open weight (Chinese) models catch up to Mythos we're all doomed. Maybe I'd be bit less cynical if they weren't prepping for IPO?

Wasn't OpenAI spreading similar FUD back when GPT 2 came out?

Guys... AGI is right around the corner. Pinky swear. Now buy our stock.

Keep in mind that the entire US economy is currently propped up by AI spending, so a lot of people (banks, government) are incentivized to make sure these companies succeed. Expect this propaganda to ratchet up a notch if / when the economy starts to nose dive.

  • Yes. They're turning on the consent manufacturing machine to make it an issue of "national security" to download some gguf file from Hugging Face. Absolutely disgusting.

There's probably at 10-15% percent chance of a war between the US and China over the next 10 years. Maybe better than even chance of a militarized crisis that might have led to war, but somehow de-escalates.

Regardless of how sad late stage capitalism makes you, or how outrageous one claims to find "hypocrisy", any national security argument about limiting Chinese AI capability stands on it's own, at least for nations likely to be drawn into a war.

Also, all the local model enthusiasts who assume Chinese firms are going be allowed to endlessly release models if they have disruptive potential attributed to Mythos are probably in for a rude awakening. Just because the PRC is content about what has happened in the past doesn't mean that they would tolerate an open model that could be truly destabilizing.

  • As a third party I would rather be happy about the way Chinese labs are acting in the here and now while US labs first masquerade as a public good, then turn around, bail on all promises of open AI, turn into a corporation and attempt to own the world while its runner-up is trying to scaremonger people into buying their product.

    I know most Americans are fed a steady diet of “evil China” and China MAY have issues. But on the AI front they are heaps better. Even if everything got closed tomorrow, we have a plethora of good models we can inspect and tweak while from the US labs we have… a single old 120b model ?

    And with the way the US is treating its allies, maybe a bunch of us are quite content with a more even match rather than US hegemony.

Since they hide their thinking traces it really doesn't make too much sense. We know one of their fixed degradations they talked about in a recent blog post was if you left claude code idle for too long they would rehydrate it without the thinking traces in the context and it degraded performance. So direct forms of distillation wouldn't be expected to get as good of results as they are getting.

However, they could have used it as a judge etc. during training.

What they're trying to do under the umbrella of "national security" is to legislate how we can use the results we pay for when accessing these models. This way they will control the "intellectual property" that was acquired illegally.

Two wrongs don't make a right

  • In this scenario it does, because consumers win.

    Everyone in AI industry wants to fight dirty, but gets angry when their competitor fights dirty as well. And I’ve mentioned it before, how I generally like Ant and its products.

  • Closest analogy to distillation is api reimplementation, without which current software industry wouldn’t exist.

    There’s nothing fundamentally wrong with distillation.

> The public’s life is getting worse while these companies consolidate power using data they stole from the public

How can you “steal” public information?

  • really? You know this just like everyone else: Just because the information is available publicly, does not mean that you can do whatever you want with the information. Copyright exists for a reason, and if the copyright lobby is going to continue to push for the poor poor media companies to keep their copyrights, then we should do the same towards the AI companies. So yes, they Stole the information from everyone else, and they keep doing so, as you can see their scanners still hitting every website on the web to get an updated dataset. It does not matter what they do AFTER they steal all the information, as they already stole it.