Comment by cynicalsecurity
7 months ago
Okay, that looked a bit ridiculous in the pre-AI era (who needs to download the whole Wikipedia?), but now I can see the sense in it.
7 months ago
Okay, that looked a bit ridiculous in the pre-AI era (who needs to download the whole Wikipedia?), but now I can see the sense in it.
Too bad the AI scrapers don't care, and are melting Wikipedia's production servers anyway.
https://arstechnica.com/information-technology/2025/04/ai-bo...
I would like companies to start aggressively pushing back against AI scrapers using things like Anubis[0]. If you can't be a good steward of the internet or respectful to other peoples' resources, then people have the right to deny them to you.
[0] https://github.com/TecharoHQ/anubis
I bet someone like Cloudflare could pull the dataset each day and serve up a plain text/Markdown version of Wikipedia for rounding error levels of spend. I just loaded a random Wikipedia page and it had a weight of 1.5MB in all for what I worked out would be about 30KB of Markdown (i.e. 50x less bandwidth).
Of course, the problem then is getting all these scrapers and bots to actually use the alternative, but Wikimedia could potentially redirect suspected clients in that direction..
Someone suggested to me to apply a filter that serves .md or txt to bots/ai scrapers instead of the regular website, seems smart if it works but i hate it when i get captchas and this could end up similarly detecting non-bots as bots
maybe a view full website link loaded on js so bots dont see it idk
5 replies →
I wonder if Wikipedias recent switch to client side rendering has hurt their performance too. Serving a prerendered page might have helped this situation. I don't know the details of their new system though.
Tragedy of the commons. And that’s why we can’t have nice things.
Because people are people. And will always prioritize egotism over respect for the common good.
no - when fragile resources are abused by one endpoint out of one hundred thousand others, and the abuse is one hundred thousand times greater.. how is that a condemnation of the "ways" of "all people" .. what is justice?
But we have nice things. Wikipedia can deal with it just fine.
Anyone archiving the site. Wikipedia is, for its faults, one of the best-curated collections of summarized human knowledge, probably in history.
Replicating that knowledge helps build data resilience and protect it against all sorts of disasters. I used to seed their monthly data dump torrent for a while.
Oh no even now there is plenty of use for it outside of AI training. Just think of all the schools in villages all around the world that don't have access to the internet or have a very limited connection. I've worked with folks that would setup local "wikipedia servers" for schools so that kids could access Wikipedia via a local network connection. In other setups they just download all of wikipedia to a set of laptops and you use one of the offline readers to browse it.
This is essentially the modern version of having a library of encyclopedias.
There's already a project to serve the use case you're describing (school in a disconnected village): Internet in a Box
https://internet-in-a-box.org/
They provide offline access to Wikipedia, OpenStreetMap, Project Gutenberg, and many other resources.
I'm thinking less about AI training and more about having a source of (reasonably) reliable information from the net, in case AI generated fake images and generated cross referenced texts start making it too difficult to discern real history from malicious rewrites. It's bad enough now, but can get much worse with the proliferation of AI agents.
Pre-2022 Wikipedia dumps will be analyzed by future historians.
1 reply →
Also helps save Wikipedia if it gets shut down - which might happen!
True. Musk for example is publicly attacking it for spreading "left-wing lies" because in his wiki page there are statements like "He has been criticized for making unscientific and misleading statements, including COVID-19 misinformation and promoting conspiracy theories, and affirming antisemitic, racist, and transphobic comments." which are just pure facts.
It would be nice to have something like this more decentralized.
I was perusing some recent discussions on sources with interest. It seems that Wikipedia's intelligentsia have managed to "blacklist" (deprecate or declare "generally unreliable") practically every prominent source of news in the US that is not centrist or leftist.
I kid you not; through a process of attrition they've attacked the very reliability and reputation of every source, including Fox News and the like, and they've told editors sitewide that they simply can't be cited as a "Reliable Secondary Source", like at all.
I am not sure if that is an accurate assessment of the situation on the ground for mainstream media, but it certainly exposes some real systemic bias.
And this is the highest-order and most enduring method of ingraining systemic bias in the project: by weeding out sources with unfavorable viewpoints and perspectives, saying they publish lies and untruth, and being able to prohibit them globally from any use.
And I was pondering this state of affairs and just thinking about Karoline Leavitt's press room, and wondering what will the landscape be, if there is precious little intersection between press outlets who may be favorable or deferent to the present administration, and those which are allowed to be cited on Wikipedia? Ouch!
6 replies →
No idea when your pre-AI era begun, but I was much more excited to host Wikipedia locally 15 years ago than I am now.
> who needs to download the whole Wikipedia
Anyone who wants to have access while off-line, for whatever reason. This can be as simple as saving costs via more complicated as accessing content from regions with spotty and/or expensive connectivity (you're on a ship out of reach of shore-based mobile networks, you do not have access to Starlink or something similar, you're deep in the jungle, deep underground, etc) to some prepper scenario where connectivity ends at the cave entry because the 'net has ceased to exist.
I would like to have a less politically biased online encyclopedia for the latter scenario, it would be a shame to start a new society based on the same bad ideas which brought down the previous one. If ever a politically neutral LLM becomes available that'd be one of the first tasks I'd put it to: point out bias - any bias - in articles, encyclopedias and other 'sources' (yes, I know, WP is not an original source but for this purpose it is) of knowledge.
You don't need to be deep in the jungle. You might just not want to pay for mobile data. If your phone has an SD card slot, you can put in 1 TB of storage and have wikipedia, a lifetime of music, tons of books, an atlas of your country for GPS navigation, and plenty of room for taking photos/videos. Storage is cheap enough that mobile data should be basically pointless.
Is there a "politically neutral" human? And if there was, what could that person reasonably say about politics?
I suspect "politically neutral" is a meaningless phrase. It's just a way for people to tar their political opponents by inference.
The problem is: even if you report only facts, there is an editorial function in choosing which facts to report, because it is physically impossible to report all facts. So someone can always point to some sort of bias on choosing which facts to report.
3 replies →
There are no politically neutral humans but there can be politically neutral publications. All you have to do to be politically neutral is treat all legal political ideologies the same without favouring one over the others. Wikipedia does not achieve this goal, not by far.
Genuine question, can you provide multiple explicit examples of such bias? I heard a lot of people railing against bias in Wikipedia, but no one provides any blatant examples of it.
A genuine answer, how about looking up some studies on this subject? Not those done by Wikipedia of course, they claim to be politically neutral after all.
Here's a few, from https://www.allsides.com/blog/wikipedia-biased
Six studies, including two from Harvard researchers, have found a left-wing bias at Wikipedia:
A 2024 analysis [1] by researcher David Rozado that used AllSides Media Bias Ratings [2] found Wikipedia associates right-of-center public figures with more negative sentiment than left-wing figures, and tends to associate left-leaning news organizations with more positive sentiment than right-leaning ones.
A Harvard study [3] found Wikipedia articles are more left-wing than Encyclopedia Britannica.
Another paper [4] from the same Harvard researchers found left-wing editors are more active and partisan on the site.
A 2018 analysis [5] found top-cited news outlets on Wikipedia are mainly left-wing.
Another analysis [6] using AllSides Media Bias Ratings found that pages on American politicians cite mostly left-wing news outlets.
American academics found [7] conservative editors are 6 times more likely to be sanctioned in Wikipedia policy enforcement.
There are far more sources out there.
If I show examples of biased pages - the one on Antifa is a good example - this will just devolve into a quibble about this or that sentence.
[1] https://davidrozado.substack.com/p/is-wikipedia-politically-...
[2] https://www.allsides.com/media-bias/ratings
[3] https://www.semanticscholar.org/paper/Do-Experts-or-Collecti...
[4] https://www.hbs.edu/faculty/Publication%20Files/17-028_e7788...
[5] https://archive.md/v4TFn
[6] https://archive.is/dDr7X
[7] https://thecritic.co.uk/the-left-wing-bias-of-wikipedia/
1 reply →
[flagged]
You have bad politics. This is bad politics.
No, you have bad politics.
This is not kindergarten so let's no go down this path. Asking for a politically neutral (see my explanation elsewhere in this thread if you don't understand what that means) source of information is not 'bad politics' but intended to avoid bad politics. I suspect that you 'identify' as either 'liberal' or 'progressive' so I assume you'd be less than thrilled if Wikipedia had a conservative bias. The same goes for conservatives and (traditional) capital-L Liberals who are less than thrilled to see Wikipedia having a 'left-wing' or 'progressive' bias. It just makes WP end up being lumped together with the legacy media, known to be untrustworthy where it counts and that is a shame for a site which in many ways still is a valuable resource as long as you avoid any and all subjects which have been pulled into the polarised political discourse.
1 reply →
> based on the same bad ideas which brought down the previous one
I don’t think that’s fair. Not that Wikipedia is without bias, but that their ivory tower biases are worlds apart from the lying brutal animalistic Hollywood signals herding the masses in “our democracy”.