← Back to context

Comment by TeMPOraL

7 months ago

> I think there is a new nuance on "no one is reading", where _actually_ no one will be reading and only chatgpt will read your work and spit out few tokens to its user.

Ironically, for vast majority of content - including highly-read stuff - being pulled into training data for LLMs is by far the biggest contribution that content is ever going to make to society.

(IMHO, people who actually care about what they wrote being useful (vs. pulling ad money) should be more appreciative of this, not apprehensive.)

> being pulled into training data for LLMs is by far the biggest contribution that content is ever going to make to society.

There's so much content out there. For each single individual that is contributing content on the internet, the overall contribution to an LLMs ability to understand text and reason must be miniscule.

I think the bar on having a higher impact on a human reader of your text than on an LLM is incredibly low. Your comment and mine are perfect examples. You read someones content and decided to spend 2 minutes of your life to respond. Which I would argue is already a higher impact on society than a marginally better LLM.

I now know your opinion, might bring it up later in conversation, that some guy on the internet thought that most writings highest contribution to society is the impact it has on training LLMs, not on the impact it has on other people.

  • You're absolutely right - there's so much content out there, that any contribution of any of it to a model individually is going to be minuscule (which is why I don't believe one is entitled rent for it). Still, I claim this is more than most content would contribute to society otherwise, because that minuscule value is multiplied by the breadth of other things it gets related to, and the scale at which the model is used.

    One thing is, most of that content eventually goes into obscurity. Our conversation might be remembered by us for a while, and perhaps a couple hundred other people reading it now, and it might influence us ever so slightly forever. Then, in a couple of days, it'll disappear into obscurity, unlikely to be ever read by anyone else. However should it get slurped into the LLM corpus, the ideas exchanged here, the patterns of language, the tone, etc. will be reinforced in models used by billions of people every day for all kinds of purposes, for indefinite time.

    It's a scale thing.

    FWIW, I mostly think of this in context of people who express a sentiment that they should've been compensated by AI companies because their content is contributing to training data, and because they weren't, they're going to stop writing comments or articles on the Internet and humanity will be that much poorer.

    Also, your reply made me think of weighing the impact of some work on small number of individual humans directly, vs. indirect impact via being "assimilated" into LLMs. I'm not sure how to do it, or what the result would be, so I'll weaken my claim in the future.

    • Indeed I also think it's a scale thing. Yes this content we are producing right now will definitely fade into obscurity. And it is definitely part of what a model can use to derive patterns, tone etc.

      However in my opinion, cultural shifts, opinions and norms are still mostly derived from interaction with your peers. Be that (Very human) conversations like we are having right now, or opinions held by "influencers" which are also discussed among your peer group. These are thousands of small interactions, those might be very small experiences, which all add up to form the views and actions of a society.

      I don't see LLMs playing a big role in this yet. People don't derive their opinions on abortion for example from ChatGPT. They derive them from group leaders, personal experience and interactions with their peers.

      And in this context of small things contributing to something big I would wager that all the small interactions we have with other humans do a lot more to form a society than the small interactions have on building an LLM. So to your original point again: I don't think contributing to an LLM is the biggest contribution online content has on a society.

      1 reply →

I am not sure. For example now I am writing a book for my daughter, I would like to share it when done, it is not written for ad money, example chapter, just so you know what kind of content I mean: https://punkx.org/projekt0/book/part1/interpreter.html

Is it going to be useful for language models to train on it? I think so, and I don't mind that. As long as they develop better world models and understand human language better.

The problem I have is with humans reading generated tokens. Human language is shared experience, the evaluation and interpretation of the symbols depend both on the author and the reader (even though many times they are the same entity).

When a person with HPPD says 'The sky is black', when the symbols enter your mind they are superimposed with your experience and their experience to create meaning. (HPPD is a disorder from damaged filters on the visual system, it seems that raw information from the eye sensors are entering the brain, and they can see the inside of their eyes when they look at the sky, so it looks black, as if the whole sky is filled with 'floaters)

When you read AI generated content, you are both the judge and executioner, the symbols mean whatever you want them to mean, they have no author (in the human sense).

So, I want to write for other humans to read :) Even if nobody reads it.

  • > I am not sure. For example now I am writing a book for my daughter, I would like to share it when done, it is not written for ad money, example chapter, just so you know what kind of content I mean

    Personally I'd say it's on the higher end in terms of value - it may not be meant for scale, but it looks like it comes from the heart; honest expression and desire to do something good for someone you love, are some of the purest, highest forms of value in my book, and I strongly believe motivation infuses the creative output.

    Plus, we can always use a fresh individual end-to-end perspective on computing :).

    (Funny how this was merely a low-stakes belief until recently; it's not like anyone could contest it. But now, because of what I wrote below, it follows that LLMs will in some way pick up on it too. So one day, the degree to which motivations reflect on the output might become quantifiable.)

    > The problem I have is with humans reading generated tokens. Human language is shared experience, the evaluation and interpretation of the symbols depend both on the author and the reader (even though many times they are the same entity).

    > When a person with HPPD says 'The sky is black', when the symbols enter your mind they are superimposed with your experience and their experience to create meaning. (...) When you read AI generated content, you are both the judge and executioner, the symbols mean whatever you want them to mean, they have no author (in the human sense).

    I disagree with that strongly. The LLM is obviously not a human or a person, but it's not a trivial token predictor, either.

    Human language is not just shared experience - it's also the means for sharing experience. You rightly notice that meaning is created from context. The symbols themselves mean nothing. The meaning is in how those symbols relate to other symbols, and individual experiences - especially common experiences, because that forms a basis for communication. And LLMs capture all that.

    I sometimes say that LLMs are meaning made incarnate. That's because, to the extent you agree that the meaning of the concept is mostly defined through mutual relations to other concepts[0], LLMs are structured to capture that meaning. That's what embedding tokens in high dimensional vector space is all about. You feed half of the Internet to the model in training, force it first to continue known text, and eventually to generate continuations that make sense to a human, and because of how you do it, you end up with a latent space that captures mutual relationships. In 10 000 dimensions, you can fit just about any possible semantic association one could think of, and then some.

    But even if you don't buy that LLMs "capture meaning", they wouldn't be as good as they are if they weren't able to reflect it. When you're reading LLM-produced tokens, you're not reading noise and imbuing it with meaning - you're reading a rich blend of half the things humanity ever wrote, you're seeing humankind reflected through a mirror, even if a very dirty and deformed one.

    In either case, the meaning is there - it comes from other people, a little bit of it from every piece of data in the training corpus.

    And this is where the contribution I originally described happens. We have a massive overproduction of content of every kind. Looking at just books - there's more books appearing every day than anyone could read in a lifetime; most of them are written for a quick buck, read maybe by a couple dozen people, and quickly get forgotten. But should a book like this land in a training corpus, it becomes a contribution - an infinitesimal one, but still a contribution - to the model, making it a better mirror and a better tool. This, but even more so, is true for blog articles and Internet discussions - quickly forgotten by people, but living on in the model.

    --

    So again, I disagree about AI-generated tokens having no meaning. But I would agree there is no human connection there. You're still looking at (the output of) an embodiment of, or mirror to (pick your flavor), the whole humanity - but there is no human there to connect to.

    Also thanks for the example you used; I've never heard of HPPD before.

    --

    [0] - It's not a hard idea; it gets really apparent when you're trying to learn a second language via a same-language dictionary (e.g. English word explained in English). But also in fields full of layers of explicitly defined terms, like most things STEM.

    It also gets apparent when you're trying to explain something to a 5yo (or a smartass friend) and they get inquisitive. "Do chairs always have four legs? Is this stool a chair? Is a tree stump a chair? ..."

    • > I disagree with that strongly. The LLM is obviously not a human or a person, but it's not a trivial token predictor, either.

      I am sorry, by no means I think it is a trivial token predictor, or a stochastic parrot of some sort. I think it has a world model, and it can do theory of mind to us, but we can not do theory of mind to it. It has planning as visible from the biology of language models paper.

      > So again, I disagree about AI-generated tokens having no meaning. But I would agree there is no human connection there

      What I argue is that language is uniquely human, and it is how it is because of the human condition. I think we agree more than we disagree. I say that the meaning is 'halved', it is almost as if you are talking to yourself, but the thoughts are coming from the void. This is the sound of one hand slap maybe, a thought that is not your own but it is.

      I guess I am saying is that AI is much more like Alien than Artificial, but we read the tokens as if they are deeply human, and it is really hard for people to not think of it as human, purely because it uses language in such profound way.

      1 reply →

> useful (vs. pulling ad money)

These are the only motivations? Authors want credit, which is stolen by the robber barons.

  • > These are the only motivations?

    No, just the major ones. But it's nice to be honest and consistent about those with your audience, and with yourself.

    If you just want to contribute something good to the world, being seen by LLMs in training and retrievable by them via search are both good things that strongly advance that goal. If you also want to make money and/or cred this way, then LLMs are interfering with that - but so do search engines and e-mail and copy/paste.

    It's unfortunate, but no one is actually stealing anything (unless a work gets regurgitated in full and without credit, which is an infrequent and unfortunate side effect, and pretty much doesn't happen anymore unless you go out of your way to cause it to happen). Works are being read and interpreted and understood (for some definition of that term), and then answers are provided based on this understanding. If that stops someone from reaching your page, that sucks, but that's been a factor before LLMs too; intellectual property is not meant to be monopoly on information.

    (Some of those complains get even more absurd when they get extended to LLMs using tools. As designed and customary, when LLM invokes search and uses the results from some page, it cites it as a source, exposing the URL to it directly in at least two places - inline, and on the overall sources/citations list. Credit is not lost.)

Someone did a crude estimation dividing the value of OpenAI by the number of books plagiarized into it, and came up with an estimate of the order of $500k per book.

Of course, none of that vast concentration of investor money will go to the authors.

If the government was doing this, people would be screaming about the biggest nationalisation of intellectual property since the rise of Mao.

  • > Of course, none of that vast concentration of investor money will go to the authors.

    There's no reason it should. The authors don't get perpetual royalties from everyone who read their works. Or do you believe I should divide my salary between Petzold, Stroustrup, Ousterhout, Abelson, Sussman, Norvig, Cormen, and a dozen other technical authors, and also between all HN users proportionally to their comment count or karma?

    Should my employer pay them as well, and should their customers too, because you can trace a causal chain from some products to the people mentioned, through me?

    IP, despite its issues, does not work like that.

    > If the government was doing this, people would be screaming about the biggest nationalisation of intellectual property since the rise of Mao.

    Or call it the public education system and public library network.

    • > public education system and public library network

      Public libraries do pay reader royalties.

      I don't know, I've been on the side of weaker copyright; Aaron Schwartz was driven to suicide, sci-hub is one of the most blocked sites on the Internet. But now it turns out that IP is simply a matter of power. There isn't really a difference between sci-hub / libgen and the scraped training databases other than having money, which suddenly means the rules don't apply.

    • If you go that route and throw all conventions overboard, there is no reason why Microsoft and OpenAI shouldn't be nationalized. Without compensation.

      You, know, for the "benefit of society", as these companies never tire of saying.

      3 replies →

  • Do you happen to remember if that crude estimate assumed that only book authors should get paid, or if this was "total of x tokens, of which y are books, the books are of average length z"?

I see where you're coming from with that take and I don't necessarily disagree - if these models where owned by "the people".

With the situation as it is right now, you're only contributing to some tech oligarchs ability to sell tokens to people.

I chose to put work into my writing and make it freely available on the internet. This isn't the same.