Comment by throwaway71271
7 months ago
I am not sure. For example now I am writing a book for my daughter, I would like to share it when done, it is not written for ad money, example chapter, just so you know what kind of content I mean: https://punkx.org/projekt0/book/part1/interpreter.html
Is it going to be useful for language models to train on it? I think so, and I don't mind that. As long as they develop better world models and understand human language better.
The problem I have is with humans reading generated tokens. Human language is shared experience, the evaluation and interpretation of the symbols depend both on the author and the reader (even though many times they are the same entity).
When a person with HPPD says 'The sky is black', when the symbols enter your mind they are superimposed with your experience and their experience to create meaning. (HPPD is a disorder from damaged filters on the visual system, it seems that raw information from the eye sensors are entering the brain, and they can see the inside of their eyes when they look at the sky, so it looks black, as if the whole sky is filled with 'floaters)
When you read AI generated content, you are both the judge and executioner, the symbols mean whatever you want them to mean, they have no author (in the human sense).
So, I want to write for other humans to read :) Even if nobody reads it.
> I am not sure. For example now I am writing a book for my daughter, I would like to share it when done, it is not written for ad money, example chapter, just so you know what kind of content I mean
Personally I'd say it's on the higher end in terms of value - it may not be meant for scale, but it looks like it comes from the heart; honest expression and desire to do something good for someone you love, are some of the purest, highest forms of value in my book, and I strongly believe motivation infuses the creative output.
Plus, we can always use a fresh individual end-to-end perspective on computing :).
(Funny how this was merely a low-stakes belief until recently; it's not like anyone could contest it. But now, because of what I wrote below, it follows that LLMs will in some way pick up on it too. So one day, the degree to which motivations reflect on the output might become quantifiable.)
> The problem I have is with humans reading generated tokens. Human language is shared experience, the evaluation and interpretation of the symbols depend both on the author and the reader (even though many times they are the same entity).
> When a person with HPPD says 'The sky is black', when the symbols enter your mind they are superimposed with your experience and their experience to create meaning. (...) When you read AI generated content, you are both the judge and executioner, the symbols mean whatever you want them to mean, they have no author (in the human sense).
I disagree with that strongly. The LLM is obviously not a human or a person, but it's not a trivial token predictor, either.
Human language is not just shared experience - it's also the means for sharing experience. You rightly notice that meaning is created from context. The symbols themselves mean nothing. The meaning is in how those symbols relate to other symbols, and individual experiences - especially common experiences, because that forms a basis for communication. And LLMs capture all that.
I sometimes say that LLMs are meaning made incarnate. That's because, to the extent you agree that the meaning of the concept is mostly defined through mutual relations to other concepts[0], LLMs are structured to capture that meaning. That's what embedding tokens in high dimensional vector space is all about. You feed half of the Internet to the model in training, force it first to continue known text, and eventually to generate continuations that make sense to a human, and because of how you do it, you end up with a latent space that captures mutual relationships. In 10 000 dimensions, you can fit just about any possible semantic association one could think of, and then some.
But even if you don't buy that LLMs "capture meaning", they wouldn't be as good as they are if they weren't able to reflect it. When you're reading LLM-produced tokens, you're not reading noise and imbuing it with meaning - you're reading a rich blend of half the things humanity ever wrote, you're seeing humankind reflected through a mirror, even if a very dirty and deformed one.
In either case, the meaning is there - it comes from other people, a little bit of it from every piece of data in the training corpus.
And this is where the contribution I originally described happens. We have a massive overproduction of content of every kind. Looking at just books - there's more books appearing every day than anyone could read in a lifetime; most of them are written for a quick buck, read maybe by a couple dozen people, and quickly get forgotten. But should a book like this land in a training corpus, it becomes a contribution - an infinitesimal one, but still a contribution - to the model, making it a better mirror and a better tool. This, but even more so, is true for blog articles and Internet discussions - quickly forgotten by people, but living on in the model.
--
So again, I disagree about AI-generated tokens having no meaning. But I would agree there is no human connection there. You're still looking at (the output of) an embodiment of, or mirror to (pick your flavor), the whole humanity - but there is no human there to connect to.
Also thanks for the example you used; I've never heard of HPPD before.
--
[0] - It's not a hard idea; it gets really apparent when you're trying to learn a second language via a same-language dictionary (e.g. English word explained in English). But also in fields full of layers of explicitly defined terms, like most things STEM.
It also gets apparent when you're trying to explain something to a 5yo (or a smartass friend) and they get inquisitive. "Do chairs always have four legs? Is this stool a chair? Is a tree stump a chair? ..."
> I disagree with that strongly. The LLM is obviously not a human or a person, but it's not a trivial token predictor, either.
I am sorry, by no means I think it is a trivial token predictor, or a stochastic parrot of some sort. I think it has a world model, and it can do theory of mind to us, but we can not do theory of mind to it. It has planning as visible from the biology of language models paper.
> So again, I disagree about AI-generated tokens having no meaning. But I would agree there is no human connection there
What I argue is that language is uniquely human, and it is how it is because of the human condition. I think we agree more than we disagree. I say that the meaning is 'halved', it is almost as if you are talking to yourself, but the thoughts are coming from the void. This is the sound of one hand slap maybe, a thought that is not your own but it is.
I guess I am saying is that AI is much more like Alien than Artificial, but we read the tokens as if they are deeply human, and it is really hard for people to not think of it as human, purely because it uses language in such profound way.
Thanks for clarifying; indeed, I think we actually have somewhat similar perspective on this.
> I guess I am saying is that AI is much more like Alien than Artificial, but we read the tokens as if they are deeply human, and it is really hard for people to not think of it as human, purely because it uses language in such profound way.
That's something I keep thinking about, but I'm still of two minds about it. On the one hand, there's no reason to assume that a machine intelligence is going to be much like ours. You put it nicely:
> it can do theory of mind to us, but we can not do theory of mind to it.
And on the one hand (still the same hand), we shouldn't expect to. The design space of possible minds is large; if we draw one at random, it's highly unlikely to be very much like our own minds.
On the other hand, LLMs were not drawn at random. They're a result of brute-forcing a goal function that's basically defined as, "produce output that makes sense to humans", in fully general sense. And then, the input is not random - this is a point I tried to communicate earlier. You say:
> What I argue is that language is uniquely human, and it is how it is because of the human condition.
I agree, but then I also argue that language itself implicitly encodes a lot of information on the human condition. It's encoded in what we say, and what we don't say. It's hidden in pattern of responses, the choice of words, the associations between words, and how they differ across languages people speak. It's encoded in the knowledge we communicate, and how we communicate it.
I also believe that, at the scale of current training datasets, and with amount of compute currently applied, the models can pick up and internalize those subtle patterns, even though we ourselves can't describe it; I believe the optimization pressure incentivizes it. And because of that, I think it's likely that the model really is becoming an Artificial, lossy approximation of our minds, and not merely a random Alien thing that's good enough to fool us into seeing it as human.
Whether or not my belief turns out to be correct, I do have a related and stronger belief: that language carries enough of an imprint of "human condition" to allow LLMs to actually process meaning. The tokens may be coming to us from an alien mind, but the meaning as we understand it is there.