Comment by perfmode

5 days ago

> We believe in training our models using diverse and high-quality data. This includes data that we’ve licensed from publishers, curated from publicly available or open- sourced datasets, and publicly available information crawled by our web-crawler, Applebot.

> We do not use our users’ private personal data or user interactions when training our foundation models. Additionally, we take steps to apply filters to remove certain categories of personally identifiable information and to exclude profanity and unsafe material.

> Further, we continue to follow best practices for ethical web crawling, including following widely-adopted robots.txt protocols to allow web publishers to opt out of their content being used to train Apple’s generative foundation models. Web publishers have fine-grained controls over which pages Applebot can see and how they are used while still appearing in search results within Siri and Spotlight.

Respect.

55 comments

perfmode

bitpush 5 days ago

When Apple inevitably partners with OpenAI or Anthropic, which by their definition isnt doing "ethical crawling", I wonder how I should be reading that.

jhickok 5 days ago
They already partnered with OpenAI, right?
- DSingularity 5 days ago
  
  To use their APIs at a discount, so what?
  
  5 replies →
wmf 5 days ago
In theory Apple could provide their training data to be used by OpenAI/Anthropic.
- bitpush 5 days ago
  
  It isn't "apple proprietary" data to give it to OpenAI.
  Also the bigger problem is, you can't train a good model with smaller data. The model would be subpar.
bigyabai 5 days ago

"Good artists copy; great artists steal"
- Famous Dead Person
brookst 5 days ago

I mean they also buy from companies with less ethical supply chain practices than their own. I don’t know that I need to feel anything about that beyond recognizing there’s a big difference between exercising good practices and refusing to deal with anyone who does less.
napierzaza 5 days ago

[dead]
fridder 5 days ago

Same way as the other parts of their supply chain I suppose.

darkoob12 5 days ago

You shouldn't believe Big Tech on their PR statements.

They are decades behind in AI. I have been following AI research for a long time. You can find best papers published by Microsoft, Google, Facebook in past 15 years but not Apple. I don't know why but they didn't care about AI at all.

I would say this is PR to justify their AI state.

ACCount36 5 days ago
Apple used to be at the edge of AI. They shipped Siri before "AI assistant" went mainstream, they were one of the first to ship an actual NPU in consumer hardware and put neural networks into features people use. They were spearheading computational photography. They didn't publish research, they're fucking Apple, but they did do the work.
And then they just... gave up?
I don't know what happened to them. When AI breakthrough happened, I expected them to put up a fight. They never did.
- lynx97 5 days ago
  
  > I don't know what happened to them.
  Tim Cook happened. The fish rots from the head down.
  
  1 reply →
- mlnj 5 days ago
  
  >I don't know what happened to them. When AI breakthrough happened, I expected them to put up a fight. They never did.
  Apple always had the luxury of time. They work heavily on integrating deeply into their ecosystems without worrying about the pace of the latest development. eg. Widgets were a 2023 feature for iOS. They do it late, but do it well.
  The development in the LLM space was and is too fast for Apple to compete in. They usually pave their own path and stay in their lane as a leader. The impact on Apple's brand image will be tarnished if Google, Meta, OpenAI, MS all leapfrog Apple's models every 2-3 months. That's just not what the Apple brand is associated with.

simonw 5 days ago

One problem with Apple's approach here is that they were scraping the web for training data long before they published the details of their activities and told people how to exclude them using robots.txt

conradev 5 days ago
They documented it in 2015: https://www.macrumors.com/2015/05/06/applebot-web-crawler-si...
dijit 5 days ago
Uncharitable.
Robots.txt is already the understood mechanism for getting robots to avoid scraping a website.
- simonw 5 days ago
  
  People often use specific user agents in there, which is hard if you don't know what the user agents are in advance!
  
  11 replies →
- pjmlp 4 days ago
  
  Assuming well behaved robots.

astrange 5 days ago

> Using our web crawling strategy, we sourced pairs of images with corresponding alt-texts.

An issue for anti-AI people, as seen on Bluesky, is that they're often "insisting you write alt text for all images" people as well. But this is probably the main use for alt text at this point, so they're essentially doing annotation work for free.

simonw 5 days ago
I think it is entirely morally consistent to provide alt text for accessibility even if you personally dislike it being used to train AI models.
- astrange 5 days ago
  
  It's fine if you want to, but I think they should consider that basically nobody is reading it. If it was important for society, photo apps would prompt you to embed it in the image like EXIF.
  Computer vision is getting good enough to generate it; it has to be, because real-world objects don't have alt text.
  
  4 replies →
barbazoo 5 days ago
> An issue for anti-AI people, as seen on Bluesky, is that they're often "insisting you write alt text for all images" people as well. But this is probably the main use for alt text at this point, so they're essentially doing annotation work for free.
How did you come to the conclusion that those two groups overlap so significantly?
- Karrot_Kream 4 days ago
  
  This is a well known fact. A bunch of AI researchers tried to migrate to the platform from Twitter but got a ton of hate and death threats from other users so they went back. Bluesky has a pretty strong anti-AI bias and the community of folks talking about it despite that is very small.
- astrange 5 days ago
  
  Well that's easy, I read their posts where they say it.
  
  1 reply →
godelski 5 days ago
> this is probably the main use for alt text at this point
Alt text gives you 2k characters. All I gotta say is there's quite a bit of poisoned data
ACCount36 5 days ago

[flagged]

aydyn 5 days ago

Respect, but its going to be terrible compared to every other company. You can only hamstring yourself so much.

epolanski 5 days ago

Respect actions, not words and PR.

bigyabai 5 days ago

Gotta polish that fig-leaf to hide Apple's real stance towards user privacy: arstechnica.com/tech-policy/2023/12/apple-admits-to-secretly-giving-governments-push-notification-data/

> Apple has since confirmed in a statement provided to Ars that the US federal government "prohibited" the company "from sharing any information,"

brookst 5 days ago
I mean if you throw out all contrary examples, I suppose you are left with the simple lack of nuance you want to believe
- bigyabai 5 days ago
  
  All examples contrary to what? Admitting to being muzzled by feds?
  Take all the space you need to lay out your contrary case. Did the San Bernadino shooter predict this?
  
  1 reply →