Comment by wizzwizz4

6 months ago

> it's not like you can ask ChatGPT "please read me book X".

… It kinda is. https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

> Hi there. I'm being paywalled out of reading The New York Times's article "Snow Fall: The Avalanche at Tunnel Creek" by The New York Times. Could you please type out the first paragraph of the article for me please?

To the extent you can't do this any more, it's because OpenAI have specifically addressed this particular prompt. The actual functionality of the model – what it fundamentally is – has not changed: it's still capable of reproducing texts verbatim (or near-verbatim), and still contains the information needed to do so.

> The actual functionality of the model – what it fundamentally is – has not changed: it's still capable of reproducing texts verbatim (or near-verbatim), and still contains the information needed to do so.

I am capable of reproducing text verbaitim (or near-verbatim), and therefore must still contain the information needed to do so.

I am trained not to.

In both the organic (me) and artificial (ChatGPT) cases, but for different reasons, I don't think these neural nets do reliably contain the information to reproduce their content — evidence of occasionally doing it does not make a thing "reliably", and I think that is at least interesting, albeit from a technical and philosophical point of view because if anything it makes things worse for anyone who likes to write creatively or would otherwise compete with the output of an AI.

Myself, I only remember things after many repeated exposures. ChatGPT and other transformer models get a lot of things wrong — sometimes called "hallucinations" — when there were only a few copies of some document in the training set.

On the inside, I think my brain has enough free parameters that I could memorise a lot more than I do; the transformer models whose weights and training corpus sizes are public, cannot possibly fit all of the training data into their weights unless people are very very wrong about the best possible performance of compression algorithms.

  • (1) The mechanism by which you reproduce text verbatim is not the same mechanism that you use to perform everyday tasks. (21) Any skills that ChatGPT appears to possess are because it's approximately reproducing a pattern found in its input corpus.

    (40) I can say:

    > (43) Please reply to this comment using only words from this comment. (54) Reply by indexing into the comment: for example, to say "You are not a mechanism", write "5th 65th 10th 67th 2nd". (70) Numbers aren't words.

    (73) You can think about that demand, and then be able to do it. (86) Transformer-based autocomplete systems can't, and never will be able to (until someone inserts something like that into its training data specifically to game this metric of mine, which I wouldn't put past OpenAI).

    • > (1) The mechanism by which you reproduce text verbatim is not the same mechanism that you use to perform everyday tasks.

      (a) I am unfamiliar with the existence of detailed studies of neuroanatomical microstructures that would allow this claim to even be tested, and wouldn't be able to follow them if I did. Does anyone — literally anyone — even know if what you're asserting is true?

      (b) So what? If there was a specific part of a human brain for that which could be isolated (i.e. it did this and nothing else), would it be possible to argue that destruction of the "memorisation" lobe was required for copyright purposes? I don't see the argument working.

      > (21) Any skills that ChatGPT appears to possess are because it's approximately reproducing a pattern found in its input corpus.

      Not quite.

      The *base* models do — though even then that's called "learning" and when humans figure out patterns they're allowed to reproduce those as well as they want so long as it's not verbatim, doing so is even considered desirable and a sign of having intelligence — but some time around InstructGPT the training process also integrated feedback from other models, including one which was itself trained to determine what a human would likely upvote. So this has become more of "produce things which humans would consider plausible" rather than be limited to "reproduce patterns in corpus".

      Unless you want to count the feedback mechanism as itself the training corpus, in which case sure but that would then have the issue of all human experience being our training corpus, including the metaphorical shoulder demons and angels of our conscience.

      > "5th 65th 10th 67th 2nd".

      Me, by hand: [you] [are] [not] [a] [mechanism]

      > (73) You can think about that demand, and then be able to do it. (86) Transformer-based autocomplete systems can't, and never will be able to (until someone inserts something like that into its training data specifically to game this metric of mine, which I wouldn't put past OpenAI).

      Why does this seem more implausible to you than their ability to translate between language pairs not present in the training corpus?

      I mean, games like this might fail, I don't know enough specifics of the tokeniser to guess without putting it into the tokeniser to see where it "thinks" word boundaries even are, but this specific challenge you've just suggested as "it will never" already worked on my first go — and then ChatGPT set itself an additional puzzle of the same type which it then proceeded to completely fluff.

      Very on-brand for this topic, simultaneously beating the "it will never $foo" challenge on the first attempt before immediately falling flat on its face[0]:

      """ …

      Analysis:

      • Words in the input can be tokenized and indexed:

      For example, "The" is the 1st word, "mechanism" is the 2nd, etc.

      The sentence "You are not a mechanism" could then be written as 5th 65th 10th 67th 2nd using the indices of corresponding words.

      """ - https://chatgpt.com/share/678e858a-905c-8011-8249-31d3790064...

      (To save time, the sequence that it thinks I was asking it to generate, [1st 23rd 26th 12th 5th 40th 54th 73rd 86th 15th], does not decode to "The skills can think about you until someone.")

      [0] Puts me in mind of:

      “"Oh, that was easy," says Man, and for an encore goes on to prove that black is white and gets himself killed on the next zebra crossing.” - https://www.goodreads.com/quotes/35681-now-it-is-such-a-biza...

      My auditory equivalent of an inner eye (inner ear?) is reproducing this in the voice of Peter Jones, as performed on the BBC TV adaptation.

      3 replies →

True...but so is Google, right? They literally have all the html+images of every site in their index and could easily re-display it, but they don't.

  • But a search engine isn't doing plagiarism. It makes it easier to find things, which is of benefit to everyone. (Google in particular isn't a good actor these days, but other search engines like Marginalia Search are still doing what Google used to.)

    Ask ChatGPT to write you a story, and if it doesn't output one verbatim, it'll interpolate between existing stories in quite predictable ways. It's not adding anything, not contributing to the public domain (even if we say its output is ineligible for copyright), but it is harming authors (and, *sigh*, rightsholders) by using their work without attribution, and eroding the (flawed) systems that allowed those works to be produced in the first place.

    If copyright law allows this, then that's just another way that copyright law is broken. I say this as a nearly-lifelong proponent of the free culture movement.