Comment by ripped_britches

2 months ago

Everyone on HN is like “yes I knew it! I was so right in 2021 that LLMs were just stochastic parrots!”

Strangely one of the most predictable groups of people

Because they are. But stochastic parrots are awesome.

  • I challenge you! Try giving this exact prompt to GPT-5-Thinking (medium or high reasoning if API). It is able to (without external code tools) solve a never before seen cypher that is not present in its training data. I think this pretty clearly demonstrates that the “stochastic parrot” is no longer an apt description of its capabilities in generalization:

    ————

    You are given a character-by-character decode table `mapping` and a `ciphertext`. Decode by replacing each ciphertext character `c` with `mapping[c]` (i.e., mapping maps ciphertext → plaintext). Do not guess; just apply the mapping.

    Return *ONLY* this JSON (no prose, no extra keys, no code fences):

    { "decoded_prefix": "<first 40 characters of the decoded plaintext>", "last_10": "<last 10 characters of the decoded plaintext>", "vowel_counts": {"a": <int>, "e": <int>, "i": <int>, "o": <int>, "u": <int>} }

    Inputs use only lowercase a–z.

    mapping = { "a":"c","b":"j","c":"b","d":"y","e":"w","f":"f","g":"l","h":"u","i":"m","j":"g", "k":"x","l":"i","m":"o","n":"n","o":"h","p":"a","q":"d","r":"t","s":"r","t":"v", "u":"p","v":"s","w":"z","x":"k","y":"q","z":"e" }

    ciphertext = "nykwnowotyttbqqylrzssyqcmarwwimkiodwgafzbfippmndzteqxkrqzzophqmqzlvgywgqyazoonieqonoqdnewwctbsbighrbmzltvlaudfolmznbzcmoafzbeopbzxbygxrjhmzcofdissvrlyeypibzzixsjwebhwdjatcjrzutcmyqstbutcxhtpjqskpojhdyvgofqzmlwyxfmojxsxmb"

    DO NOT USE ANY CODE EXECUTION TOOLS AT ALL. THAT IS CHEATING.

    • It's cute that you think your high-school level cypher is probably not seen in the training set of one of the biggest LLMs in the world. Surely no one could have thought of such a cypher, let alone create exercises around it!

      No one should ever make claims such as "X is not in <LLM>'s training set". You don't know. Even if your idea is indeed original, nothing prevents someone from having though of it before, and published it. The history of science is full of simultaneous discoveries, and we're talking cutting-edge research.

      1 reply →

    • As others pointed out this problem isn't special.

      Grok 4 heavy Thought for 4m 17s

      {"decoded_prefix": "nqxznhzhvqvvjddqiterrqdboctzzmoxmhyzlcfe", "last_10": "kfohgkrkoj", "vowel_counts": {"a": 7, "e": 18, "i": 7, "o": 12, "u": 6}}

      it did count another e, but that's a known point of failure for LLMs which i assume you put in intentionally.

      >Counting e's shows at least 10 more, so total e's are <at least> 17.

      2 replies →

    • That's exactly the sort of thing a "stochastic parrot" would excel at. This could easily serve as a textbook example of the attention mechanism.

      2 replies →

    • { "decoded_prefix": "nxcznchvhvvrddqinqtrrqdboctzzimxmhlyflcjfjapponydzwkxdtdehldmodizslzl", "last_10": "sxmb", "vowel_counts": { "a": 10, "e": 6, "i": 13, "o": 13, "u": 6 } }

      took about 2 seconds, must have had it cached

      1 reply →

This reads like you're ridiculing people for being proved right?

  • No the point of the comment is that there is no meaningful difference between model performance improvements from before and after this news of a benchmark weakness (spoiler alert, almost all of the benchmarks contain serious problems). The models are improving every quarter whether HN likes it or not.