Comment by SpicyLemonZest

2 months ago

It clearly is an artifact of tokenization, but I don’t think it’s a “just”. The point is precisely that the GPT system architecture cannot reliably close the gap here; it’s almost able to count the number of Bs in a string, there’s no fundamental reason you could not build a correct number-of-Bs mapping for tokens, and indeed it often gets the right answer. But when it doesn’t you can’t always correct it with things like chain of thought reasoning.

This matters because it poses a big problem for the (quite large) category of things where people expect LLMs to be useful when they get just a bit better. Why, for example, should I assume that modern LLMs will ever be able to write reliably secure code? Isn’t it plausible that the difference between secure and almost secure runs into some similar problem?

It's like someone has given a bunch of young people hundreds of billions of dollars to build a product that parses HTML documents with regular expressions.

It's not in their interest to write off the scheme as provably unworkable at scale, so they keep working on the edge cases until their options vest.

> cannot reliably close the gap here

Have you got any proof they're even trying? It's unlikely that's something their real customers are paying for.

  • I tried to reproduce it again just now, and ChatGPT 5 seems to be a lot more meticulous about running a python script to double-check its work, which it tells me is because it has a warning in its system prompt telling it to. I don't know if that's proof (or even if ChatGPT reliably tells the truth about what's in its system prompt), but given what OpenAI does and doesn't publish it's the closest I could reasonably expect.