Comment by wraptile
2 months ago
Days of just getting data off the web are coming to an end as everything requires a full browser running thousands of lines of obfuscated js code now. So instead of a website giving me that 1kb json that could be cached now I start a full browser stack and transmit 10 megabytes through 100 requests, messing up your analytics and security profile and everyone's a loser. Yay.
On the bright side, that opens an opportunity for 10,000 companies whose only activity is scraping 10MB worth of garbage and providing a sane API for it.
Luckily all that is becoming a non-issue, as most content on these websites isn't worth scraping anymore.
*and whose only customers are using it for AI training
They can afford it because the market rightfully bets on such trained models being more useful than upstream sources.
In fact, at this point in time (it won't last), one of the most useful applications of LLMs is to have them deal with all the user-hostile crap that's bulk of the web today, so you don't have to suffer through it yourself. It's also the easiest way to get any kind of software interoperability at the moment (this will definitely not last long).
This 1kb os json still sounds like a modern thing, where you need to download many MB of JavaScript code to execute and display the 1kb json data.
What you want is to just download the 10-20kb html file, maybe a corresponding css file, and any images referenced by the html. Then if you want the video you just get the video file direct.
Simple and effective, unless you have something to sell.
The main reason for doing video through JS in the first place, other than obfuscation, is variable bitrate support. Oddly enough some TVs will support variable bitrate HLS directly, and I believe Apple devices, but not regular browsers. See https://github.com/video-dev/hls.js/
> unless you have something to sell
Video hosting and its moderation is not cheap, sadly. Which is why we don't see many competitors.
P2P services proved a long ago that hosting is not a problem. Politics is a problem.
What we don't see is more web video services and services that successfully trick varied content creators to upload regularly to their platform.
https://en.wikipedia.org/wiki/PeerTube also must be mentioned here.
And by "not many" you really mean zero competitors.
(before you ask: Vimeo is getting sold to an enshitification company)
5 replies →
It's an arms race. Websites have become stupidly/unnecessarily/hostilely complicated, but AI/LLMs have made it possible (though more expensive) to get whatever useful information exists out of them.
Soon, LLMs will be able to complete any Captcha a human can within reasonable time. When that happens, the "analog hole" may be open permanently. If you can point a camera and a microphone at it, the AI will be able to make better sense of it than a person.
The future will just be every web session gets tied to a real ID and if the service detects you as a bot you just get blocked by ID.
> The future will just be every web session gets tied to a real ID
This seems like an awful future. We already had this in form of limited ipv4 addresses wher each IP is basically an identity. People started buying up ip addresses and selling them as proxies. So any other form of ID would suffer the same fate unless enforced at government level.
Worst case scenario we have 10,000 people sitting in front of the screens clicking page links because hiring someone to use their "government id" to mindlessly browse the web is the only way to get data of the public web. That's not the future we should want.
I definitely agree logins will be required for many more sites, but how would the site be able to distinguish humans from bots controlling the browser? Captcha is almost obsolete. ARC AGI is too cumbersome for verifying every time.
2 replies →
Please remember that an LLM accessing any website isn't the problem here. It's the scraping bots that saturate the server bandwidth (a DoS attack of sorts) to collect data to train the LLMs with. An LLM solving a captcha or an Anubis style proof of work problem isn't a big concern here, because the worst they're going to do with the collected data is to cache them for later analysis and reporting. Unlike the crawlers, LLMs don't have any incentives in sucking up huge amounts of data like a giant vacuum cleaner.
Scraping was a thing before LLMs, there's a whole separate arms race around this for regular competition and "industrial espionage" reasons. I'm not really sure why model training would become a noticeable fraction of scrapping activity - there's only few players on the planet that can afford to train decent LLMs in the first place, and they're not going to re-scrape the content they already have ad infinitum.
4 replies →
And it's all to sell more ads.
fortunately it is now easier than ever to do small-scale scraping, the kind yt-dlp does.
I can literally just go write a script that uses headless firefox + mitmproxy in about an hour or two of fiddling, and as long as I then don't go try to run it from 100 VPS's and scrape their entire website in a huge blast, I can typically archive whatever content I actually care about. Basically no matter what protection mechanisms they have in place. Cloudflare won't detect a headless firefox at low (and by "low" I mean basically anything you could do off your laptop from your home IP) rates, modern browser scripting is extremely easy, so you can often scrape things with mild single-person effort even if the site is an SPA with tons of dynamic JS. And obviously at low scale you can just solve captchas yourself.
I recently wrote a scraper script that just sent me a discord ping whenever it ran into a captcha, and i'd just go look at my laptop and fix it, and then let it keep scraping. I was archiving a comic I paid for but was in a walled-garden app that obviously didn't want you to even THINK of controlling the data you paid for.
> fortunately it is now easier than ever to do small-scale scraping, the kind yt-dlp does.
this is absolutely not the case. I've been web scraping since 00s and you could just curl any html or selenium the browser for simple automation but now it's incredibly complex and expensive even with modern tools like playwright and all of the monthly "undetectable" flavors of it. Headless browsers are laughably easy to detect because they leak the fact they are being automated and that they are headless. Not to even mention all of the fingerprinting.
> modern browser scripting is extremely easy, so you can often scrape things with mild single-person effort even if the site is an SPA with tons of dynamic JS.
I think he means the JS part is now easy to run and scrape compared to the transition time from basic download scraping to JS execution/headless browser scraping. It is more complex but the tools haven’t been as evolved as they are now a couple of years ago.
mozilla-unified/dom/base/Navigator.cpp - find Navigator::Webdriver and make it always return false, then recompile.
+1
I made a web scraper in Perl a few years ago. It no longer works because I need a headless browser now or whatever it is called these days.
Web scraping is MUCH WORSE TODAY[1].
[1] I am not yelling, just emphasizing. :)
Those days are not coming to an end:
* PeerTube and similar platforms for video streaming of freely-distributable content;
* BitTorrent-based mechanisms for sharing large files (or similar protocols).
Will this be inconvenient? At first, somewhat. But I am led to believe that in the second category one can already achieve a decent experience.
To how many content creators have you written to request them share their content on PeerTube or BitTorrent? How did they respond? How will they monetize?
1. Zero
2. N/A, but enough content creators on YT are very much aware of the kind of prison it is, especially in the years after the Adpocalypse.
3. Obviously, nobody should be able to monetize the copying of content. If it is released, it is publicly released. But they can use LibrePay/Patreon/Buy me a coffee, they can sell merch or signed copies of things, they can do live appearances, etc.
1 reply →
I think this is just another indication of how the web is a fragile equilibrium in a very adversarial ecosystem. And to some extent, things like yt-dlp and adblocking only work if they're "underground". Once they become popular - or there's a commercial incentive, like AI training - there ends up being a response.
Not only that, but soon it will require age verification and device attestation. Just in case you're trying to watch something you're not supposed to.
For now, yes, but soon CloudFlare and ever more annoying captchas may make that option practically impossible.
You should be thankful for the annoying captchas, I hear they're moving to rectal scans soon.
> Days of just getting data off the web are coming to an end
All thanks to great ideas like downloading the whole internet and feeding it into slop-producing machines fueling global warming in an attempt to make said internet obsolete and prop up an industry bubble.
The future of the internet is, at best, bleak. Forget about openness. Paywalls, authwalls, captchas and verification cans are here to stay.
The Internet was turned into a slop warehouse well before LLMs became a thing - in fact, a big part of why ChatGPT et al. has so extreme adoption worldwide is because they let people accomplish many tasks without having to inflict on yourself the shitfest that's the modern web.
Personally, when it became available, o3 model in ChatGPT cut my use of web search by more than half, and it wasn't because Google became bad at search (I use Kagi anyway) - it's because even the best results are all shit, or embedded in shit websites, and the less I need to browse through that, the better for me.
> The Internet was turned into a slop warehouse well before LLMs became a thing
I suppose that's thanks to Google and their search algos favoring ad-ridden SEO spam. LLMs are indeed more appealing and convenient. But I fear that legitimate websites (ad-supported or otherwise) that actually provide useful information will be on the decline. Let's just hope then that updated information will find its way into LLMs when such websites are gone.
1 reply →
Do you know what Accelerate means?
I want them to go overboard. I want BigTech to go nuts on this stuff. I want broken systems and nonsense.
Because that’s the only way we’re going to get anything better.
Accelerationism is a dead-end theory with major holes in its core. Or I should say, "their" core, because there's a million distant and mutually-incompatible varieties. Everyone likes to say "gosh, things are awful, it MUST end in collapse, and after the collapse everyone will see things MY way." They can't all be right. And yet, all of them with their varied ideas still think it'll be a good idea to actively push to make things worse in order to bring on the collapse more quickly.
It doesn't work. There aren't any collapses like that to be had. Big change happens incrementally, a bit of refactoring and a few band-aids at a time, and pushing to make things worse doesn't help.
I'm not waiting for the collapse to fix things - I'm waiting for it so that I won't have any more distractions and I can go back to my books.
4 replies →
Look at history, things improve and then things get worse, in cycles.
During the "things get worse" phase, why not make it shorter?
4 replies →
If you showed me the current state of YouTube 8 years ago - multiple unskippable ads before each video, 5 midrolls for a 10 minute video, comments overran with bots, video dislikes hidden, the shorts hell, the dysfunctional algorithm, .... - I would've definitely told you "Yep, that will be enough to kill it!"
At this point I don't know - I still have the feeling that "they just need to make it 50% worse again and we'll get a competitor," but I've seen too many of these platforms get 50% worse too many times, and the network effect wins out every time.
It's classic frog boiling. I want them (for whatever definition of "them") to just nuke the frog from orbit.