somebody said once we are mining "low-background tokens" like we are mining low-background (radiation) steel post WW2 and i couldnt shake the concept out of my head
(wrote up in https://www.latent.space/i/139368545/the-concept-of-low-back... - but ironically repeating something somebody else said online is kinda what i'm willingly participating in, and it's unclear why human-origin tokens should be that much higher signal than ai-origin ones)
besides for training future models, is this really such a big deal? most of the AI-gened text content is just replacing content-farm SEO-spam anyway. the same stuff that any half-awares person wouldn't have read in the past is now slightly better written, using more em dashes and instances of the word "delve". if you're consistently being caught out by this stuff then likely you need to improve your search hygiene, nothing so drastic as this
the only place I've ever had any issue with AI content is r/chess, where people love to ask ChatGPT a question and then post the answer as if they wrote it, half the time seemingly innocently, which, call me racist, but I suspect is mostly due to the influence of the large and young Indian contingent. otherwise I really don't understand where the issue lies. follow the exact same rules you do for avoiding SEO spam and you will be fine
Yes indeed, it is a problem. Now the old good sites have turned into AI-slop sites because they can't fight the spammers by writing slowly with humans.
They are the same. I was looking for something and tried AI. It gave me a list of stuff. When I asked for its sources, it linked me to some SEO/Amazon affiliate slop.
All AI is doing is making it harder to know what is good information and what is slop, because it obscures the source, or people ignore the source links.
Not affiliated, but I've been using kagi's date range filter to similar effect. The difference in results for car maintenance subjects is astounding (and slightly infuriating).
True, but there's probably many ways to do this and unless AI content starts falsifying tons of its metadata (which I'm sure would have other consequences), there's definitely a way.
Plus other sites that link to the content could also give away it's date of creation, which is out of the control of the AI content.
I have heard of a forum (I believe it was Physics Forums) which was very popular in the older days of the internet where some of the older posts were actually edited so that they were completely rewritten with new content. I forget what the reasoning behind it was, but it did feel shady and unethical. If I remember correctly, the impetus behind it was that the website probably went under new ownership and the new owners felt that it was okay to take over the accounts of people who hadn't logged on in several years and to completely rewrite the content of their posts.
If it's just using Google search "before <x date>" filtering I don't think there's a way to game it... but I guess that depends on whether Google uses the date that it indexed a page versus the date that a page itself declares.
None of these documents were actually published on the web by then, incl., a Watergate PDF bearing date of Nov 21, 1974 - almost 20 years before PDF format got released. Of course, WWW itself started in 1991.
Google Search's date filter is useful for finding documents about historical topics, but unreliable for proving when information actually became publicly available online.
"Gamed quite easily" seems like a stretch, given that the target is definitionally not moving. The search engine is fundamentally searching an immutable dataset that "just" needs to be cleaned.
You know what's almost worse than AI generated slop?
Every corner of the Internet now screaming about AI generated slop, whenever a single pixel doesn't line up.
It's just another generation of technology. And however much nobody might like it, it is here to stay. Same thing happened with airbrushing, and photoshop, and the Internet in general.
Point still stands. It’s not going anywhere. And the literal hate and pure vitriol I’ve seen towards people on social media, even when they say “oh yeah; this is AI”, is unbelievable.
So many online groups have just become toxic shitholes because someone once or twice a week posts something AI generated
Spam: "is the use of messaging systems to send multiple unsolicited messages (spam) to large numbers of recipients for the purpose of commercial advertising." - https://en.wikipedia.org/wiki/Spamming
Firstly: I sent one message, and it doesn't have a commercial advertisement in it, so not spam at all.
Secondly: The comment I was replying to, didn't specify that it was pre-ChatGPT, and it is not specified on the page that was linked to either.
somebody said once we are mining "low-background tokens" like we are mining low-background (radiation) steel post WW2 and i couldnt shake the concept out of my head
(wrote up in https://www.latent.space/i/139368545/the-concept-of-low-back... - but ironically repeating something somebody else said online is kinda what i'm willingly participating in, and it's unclear why human-origin tokens should be that much higher signal than ai-origin ones)
that was me swyx
besides for training future models, is this really such a big deal? most of the AI-gened text content is just replacing content-farm SEO-spam anyway. the same stuff that any half-awares person wouldn't have read in the past is now slightly better written, using more em dashes and instances of the word "delve". if you're consistently being caught out by this stuff then likely you need to improve your search hygiene, nothing so drastic as this
the only place I've ever had any issue with AI content is r/chess, where people love to ask ChatGPT a question and then post the answer as if they wrote it, half the time seemingly innocently, which, call me racist, but I suspect is mostly due to the influence of the large and young Indian contingent. otherwise I really don't understand where the issue lies. follow the exact same rules you do for avoiding SEO spam and you will be fine
Yes indeed, it is a problem. Now the old good sites have turned into AI-slop sites because they can't fight the spammers by writing slowly with humans.
Somewhat related, the leaderboard of em-dash users on HN before ChatGPT:
https://www.gally.net/miscellaneous/hn-em-dash-user-leaderbo...
They should include users who used a double hyphen, too -- not everyone has easy access to em dashes.
Oof, I feel like you'll accidentally capture a lot of getopt_long() fans. ;)
The low-background steel of the internet
https://en.wikipedia.org/wiki/Low-background_steel
Does this filter out traditional SEO blogfarms?
Yeah, might prefer AI-slop to marketing-slop.
They are the same. I was looking for something and tried AI. It gave me a list of stuff. When I asked for its sources, it linked me to some SEO/Amazon affiliate slop.
All AI is doing is making it harder to know what is good information and what is slop, because it obscures the source, or people ignore the source links.
1 reply →
Why use this when you can use the before: syntax on most search engines?
You should call it Predecember, referring to the eternal December.
September?
ChatGPT was released exactly 3 years ago (on the 30th of November) so December it is in this context.
1 reply →
Not affiliated, but I've been using kagi's date range filter to similar effect. The difference in results for car maintenance subjects is astounding (and slightly infuriating).
I don't know how this works under the hood but it seems like no matter how it works, it could be gamed quite easily.
True, but there's probably many ways to do this and unless AI content starts falsifying tons of its metadata (which I'm sure would have other consequences), there's definitely a way.
Plus other sites that link to the content could also give away it's date of creation, which is out of the control of the AI content.
I have heard of a forum (I believe it was Physics Forums) which was very popular in the older days of the internet where some of the older posts were actually edited so that they were completely rewritten with new content. I forget what the reasoning behind it was, but it did feel shady and unethical. If I remember correctly, the impetus behind it was that the website probably went under new ownership and the new owners felt that it was okay to take over the accounts of people who hadn't logged on in several years and to completely rewrite the content of their posts.
I believe I learned about it through HN, and it was this blog post: https://hallofdreams.org/posts/physicsforums/
It kind of reminds me of why some people really covet older accounts when they are trying to do a social engineering attack.
If it's just using Google search "before <x date>" filtering I don't think there's a way to game it... but I guess that depends on whether Google uses the date that it indexed a page versus the date that a page itself declares.
Date displayed in Google Search results is often the self-described date from the document itself. Take a look at this "FOIA + before Jan 1, 1990" search: https://www.google.com/search?q=foia&tbs=cdr:1,cd_max:1/1/19...
None of these documents were actually published on the web by then, incl., a Watergate PDF bearing date of Nov 21, 1974 - almost 20 years before PDF format got released. Of course, WWW itself started in 1991.
Google Search's date filter is useful for finding documents about historical topics, but unreliable for proving when information actually became publicly available online.
1 reply →
"Gamed quite easily" seems like a stretch, given that the target is definitionally not moving. The search engine is fundamentally searching an immutable dataset that "just" needs to be cleaned.
You know what's almost worse than AI generated slop?
Every corner of the Internet now screaming about AI generated slop, whenever a single pixel doesn't line up.
It's just another generation of technology. And however much nobody might like it, it is here to stay. Same thing happened with airbrushing, and photoshop, and the Internet in general.
"You know what's almost worse than something bad? People complaining about something bad."
Shrug. Sure.
Point still stands. It’s not going anywhere. And the literal hate and pure vitriol I’ve seen towards people on social media, even when they say “oh yeah; this is AI”, is unbelievable.
So many online groups have just become toxic shitholes because someone once or twice a week posts something AI generated
2 replies →
[flagged]
Besides this being spam, the linked leaderboard is pre-chatgpt, it doesnt care about comments made now
Spam: "is the use of messaging systems to send multiple unsolicited messages (spam) to large numbers of recipients for the purpose of commercial advertising." - https://en.wikipedia.org/wiki/Spamming
Firstly: I sent one message, and it doesn't have a commercial advertisement in it, so not spam at all.
Secondly: The comment I was replying to, didn't specify that it was pre-ChatGPT, and it is not specified on the page that was linked to either.
Lastly: Sorry you are butthurt.