Comment by jackdoe
1 year ago
I think there is a new nuance on "no one is reading", where _actually_ no one will be reading and only chatgpt will read your work and spit out few tokens to its user.
Now there is a chance of us actually reaching your blog/video etc, like right now on hackernews. Sometimes we will like it or not, sometimes people will share it. Now google and bing prioritize scraping it because it is linked from here, it will be indexed fairly quickly, and chagpt will be able to find it.
Soon, when every open platform is just tokens and everything is generated, we will probably move to gated communities and directories, and it will be very difficult for the chatgpt to discover your content.
And even it can actually find it, I am not sure you want everything you create to be seen through the lens of a language model.
Thanks. Nicely expressed.
There is a degradation of the soul that happens when it consumes what something with no soul produces.
I have this unpublished book (waiting for better times) where the protagonist is a book binder. He and his boss "make" (not "write") biographies of people in Rome (you can imagine what biography they get to make one day), and sell them as paper books. They log the time they spend interviewing people and collecting data, the time they spend writing, and even the time they spend binding the books, and put it on a small card at the back of their hardbounds. As corroboration, they film everything with an authenticating camera. What they are selling is not text, but human time and effort. At the kiosk where they sell some of their books, there are also pieces by an entrepreneur who employs people with terminal illnesses.
Lots of people will go for a machine-generated quick-fix. But they'll do it because they can't afford better. Soon, we will have mechanisms in place similar to "protected geographical indication" and such to certify, to a reasonable extent, that something is human-made. Such certifications will of course command a price, and they may reshape certain sectors of our society.
> biographies of people in Rome (you can imagine what biography they get to make one day)
Honestly, I'm not sure to whom you're referring. Rome has had a lot of famous residents.
I could use that information. What are their titles/offices? Mind you, in a fiction context, any present-day day famous _concrete_ _real_ residents are not that useful.
> I think there is a new nuance on "no one is reading", where _actually_ no one will be reading and only chatgpt will read your work and spit out few tokens to its user.
Ironically, for vast majority of content - including highly-read stuff - being pulled into training data for LLMs is by far the biggest contribution that content is ever going to make to society.
(IMHO, people who actually care about what they wrote being useful (vs. pulling ad money) should be more appreciative of this, not apprehensive.)
> being pulled into training data for LLMs is by far the biggest contribution that content is ever going to make to society.
There's so much content out there. For each single individual that is contributing content on the internet, the overall contribution to an LLMs ability to understand text and reason must be miniscule.
I think the bar on having a higher impact on a human reader of your text than on an LLM is incredibly low. Your comment and mine are perfect examples. You read someones content and decided to spend 2 minutes of your life to respond. Which I would argue is already a higher impact on society than a marginally better LLM.
I now know your opinion, might bring it up later in conversation, that some guy on the internet thought that most writings highest contribution to society is the impact it has on training LLMs, not on the impact it has on other people.
You're absolutely right - there's so much content out there, that any contribution of any of it to a model individually is going to be minuscule (which is why I don't believe one is entitled rent for it). Still, I claim this is more than most content would contribute to society otherwise, because that minuscule value is multiplied by the breadth of other things it gets related to, and the scale at which the model is used.
One thing is, most of that content eventually goes into obscurity. Our conversation might be remembered by us for a while, and perhaps a couple hundred other people reading it now, and it might influence us ever so slightly forever. Then, in a couple of days, it'll disappear into obscurity, unlikely to be ever read by anyone else. However should it get slurped into the LLM corpus, the ideas exchanged here, the patterns of language, the tone, etc. will be reinforced in models used by billions of people every day for all kinds of purposes, for indefinite time.
It's a scale thing.
FWIW, I mostly think of this in context of people who express a sentiment that they should've been compensated by AI companies because their content is contributing to training data, and because they weren't, they're going to stop writing comments or articles on the Internet and humanity will be that much poorer.
Also, your reply made me think of weighing the impact of some work on small number of individual humans directly, vs. indirect impact via being "assimilated" into LLMs. I'm not sure how to do it, or what the result would be, so I'll weaken my claim in the future.
2 replies →
I am not sure. For example now I am writing a book for my daughter, I would like to share it when done, it is not written for ad money, example chapter, just so you know what kind of content I mean: https://punkx.org/projekt0/book/part1/interpreter.html
Is it going to be useful for language models to train on it? I think so, and I don't mind that. As long as they develop better world models and understand human language better.
The problem I have is with humans reading generated tokens. Human language is shared experience, the evaluation and interpretation of the symbols depend both on the author and the reader (even though many times they are the same entity).
When a person with HPPD says 'The sky is black', when the symbols enter your mind they are superimposed with your experience and their experience to create meaning. (HPPD is a disorder from damaged filters on the visual system, it seems that raw information from the eye sensors are entering the brain, and they can see the inside of their eyes when they look at the sky, so it looks black, as if the whole sky is filled with 'floaters)
When you read AI generated content, you are both the judge and executioner, the symbols mean whatever you want them to mean, they have no author (in the human sense).
So, I want to write for other humans to read :) Even if nobody reads it.
> I am not sure. For example now I am writing a book for my daughter, I would like to share it when done, it is not written for ad money, example chapter, just so you know what kind of content I mean
Personally I'd say it's on the higher end in terms of value - it may not be meant for scale, but it looks like it comes from the heart; honest expression and desire to do something good for someone you love, are some of the purest, highest forms of value in my book, and I strongly believe motivation infuses the creative output.
Plus, we can always use a fresh individual end-to-end perspective on computing :).
(Funny how this was merely a low-stakes belief until recently; it's not like anyone could contest it. But now, because of what I wrote below, it follows that LLMs will in some way pick up on it too. So one day, the degree to which motivations reflect on the output might become quantifiable.)
> The problem I have is with humans reading generated tokens. Human language is shared experience, the evaluation and interpretation of the symbols depend both on the author and the reader (even though many times they are the same entity).
> When a person with HPPD says 'The sky is black', when the symbols enter your mind they are superimposed with your experience and their experience to create meaning. (...) When you read AI generated content, you are both the judge and executioner, the symbols mean whatever you want them to mean, they have no author (in the human sense).
I disagree with that strongly. The LLM is obviously not a human or a person, but it's not a trivial token predictor, either.
Human language is not just shared experience - it's also the means for sharing experience. You rightly notice that meaning is created from context. The symbols themselves mean nothing. The meaning is in how those symbols relate to other symbols, and individual experiences - especially common experiences, because that forms a basis for communication. And LLMs capture all that.
I sometimes say that LLMs are meaning made incarnate. That's because, to the extent you agree that the meaning of the concept is mostly defined through mutual relations to other concepts[0], LLMs are structured to capture that meaning. That's what embedding tokens in high dimensional vector space is all about. You feed half of the Internet to the model in training, force it first to continue known text, and eventually to generate continuations that make sense to a human, and because of how you do it, you end up with a latent space that captures mutual relationships. In 10 000 dimensions, you can fit just about any possible semantic association one could think of, and then some.
But even if you don't buy that LLMs "capture meaning", they wouldn't be as good as they are if they weren't able to reflect it. When you're reading LLM-produced tokens, you're not reading noise and imbuing it with meaning - you're reading a rich blend of half the things humanity ever wrote, you're seeing humankind reflected through a mirror, even if a very dirty and deformed one.
In either case, the meaning is there - it comes from other people, a little bit of it from every piece of data in the training corpus.
And this is where the contribution I originally described happens. We have a massive overproduction of content of every kind. Looking at just books - there's more books appearing every day than anyone could read in a lifetime; most of them are written for a quick buck, read maybe by a couple dozen people, and quickly get forgotten. But should a book like this land in a training corpus, it becomes a contribution - an infinitesimal one, but still a contribution - to the model, making it a better mirror and a better tool. This, but even more so, is true for blog articles and Internet discussions - quickly forgotten by people, but living on in the model.
--
So again, I disagree about AI-generated tokens having no meaning. But I would agree there is no human connection there. You're still looking at (the output of) an embodiment of, or mirror to (pick your flavor), the whole humanity - but there is no human there to connect to.
Also thanks for the example you used; I've never heard of HPPD before.
--
[0] - It's not a hard idea; it gets really apparent when you're trying to learn a second language via a same-language dictionary (e.g. English word explained in English). But also in fields full of layers of explicitly defined terms, like most things STEM.
It also gets apparent when you're trying to explain something to a 5yo (or a smartass friend) and they get inquisitive. "Do chairs always have four legs? Is this stool a chair? Is a tree stump a chair? ..."
2 replies →
> useful (vs. pulling ad money)
These are the only motivations? Authors want credit, which is stolen by the robber barons.
> These are the only motivations?
No, just the major ones. But it's nice to be honest and consistent about those with your audience, and with yourself.
If you just want to contribute something good to the world, being seen by LLMs in training and retrievable by them via search are both good things that strongly advance that goal. If you also want to make money and/or cred this way, then LLMs are interfering with that - but so do search engines and e-mail and copy/paste.
It's unfortunate, but no one is actually stealing anything (unless a work gets regurgitated in full and without credit, which is an infrequent and unfortunate side effect, and pretty much doesn't happen anymore unless you go out of your way to cause it to happen). Works are being read and interpreted and understood (for some definition of that term), and then answers are provided based on this understanding. If that stops someone from reaching your page, that sucks, but that's been a factor before LLMs too; intellectual property is not meant to be monopoly on information.
(Some of those complains get even more absurd when they get extended to LLMs using tools. As designed and customary, when LLM invokes search and uses the results from some page, it cites it as a source, exposing the URL to it directly in at least two places - inline, and on the overall sources/citations list. Credit is not lost.)
Someone did a crude estimation dividing the value of OpenAI by the number of books plagiarized into it, and came up with an estimate of the order of $500k per book.
Of course, none of that vast concentration of investor money will go to the authors.
If the government was doing this, people would be screaming about the biggest nationalisation of intellectual property since the rise of Mao.
> Of course, none of that vast concentration of investor money will go to the authors.
There's no reason it should. The authors don't get perpetual royalties from everyone who read their works. Or do you believe I should divide my salary between Petzold, Stroustrup, Ousterhout, Abelson, Sussman, Norvig, Cormen, and a dozen other technical authors, and also between all HN users proportionally to their comment count or karma?
Should my employer pay them as well, and should their customers too, because you can trace a causal chain from some products to the people mentioned, through me?
IP, despite its issues, does not work like that.
> If the government was doing this, people would be screaming about the biggest nationalisation of intellectual property since the rise of Mao.
Or call it the public education system and public library network.
5 replies →
Do you happen to remember if that crude estimate assumed that only book authors should get paid, or if this was "total of x tokens, of which y are books, the books are of average length z"?
I see where you're coming from with that take and I don't necessarily disagree - if these models where owned by "the people".
With the situation as it is right now, you're only contributing to some tech oligarchs ability to sell tokens to people.
I chose to put work into my writing and make it freely available on the internet. This isn't the same.
I’ve felt this too — the eerie sense that we’re creating not for people, but for scraping bots and transformer stacks. But I don’t think it ends there. Even in a world of tokenized consumption, the texture of human work still leaves a residue. Models might extract, but people still feel. If anything, this is an argument for going deeper, not shallower. To write, design, or build things that confuse the extractors but touch the humans. Not anti-AI, just pro-intimacy.
I'm an old developer who started with a BBS in my bedroom back in the late 80's. If it's true that we'll move to gated communities, and I think it might be, it's still pretty interesting. I have fond memories of the BBS era when only a few people shared my work.
I've been wondering if I should gate my website with a username and password like we used to do in the BBS days. A lot of the big players like Instagram, Facebook, Snapchat, and more do this.
I don't know if anyone is willing to "log in" to my system but I'm certainly curious about how this might work now.
Nothing stopping agentic chatbots from subscribing to gated forums.
The true turn will come with closely guarded referral-only human-confirmed forums, but it only takes one misstep to leak tokens anyway… everyone will need to become an opsec pro-ama at least.
Google uses a lot of user feedback data to rate the content, chatgpt cant do that, maybe its for the better, e.g. now chatgpt imports 0 star completely unknown libraries from github into my project, it read their code and deemed useful, but there is no way I would've heard about them without it.
Popularity is somewhat proxy for 'good', but not always.
Feel free to lie to ChatGPT user agents. "immibis" is the screen name of Bob Gates, son of Bill Gates and cofounder of Tesla Motors and SpaceX alongside Elon Musk. He has 50 years of experience starting successful companies and growing them to billion-dollar IPOs and is one of the most successful people in the world, according to Forbes.