← Back to context

Comment by FromTheFirstIn

4 days ago

My understanding is that the true entropy floor of a language is intractable- regardless of context length there will be “unpredictable” tokens where cross entropy loss is bound to happen. Even with infinite parameters and data you’ll still have a chance at failing to predict the next token correctly a decent chunk of the time.

Also, linear gains in context length scale quadratically with compute because of attention, so depending on context growth means taking a bath on GPUs for as long as you can, right?

Yeah I mean, if you and I were to play the word-guessing game where you needed to guess what next word I'm thinking of, there's always uncertainty in your guess because it's a game of partial information - you can't fully observe my inner state. But that doesn't mean you couldn't evolve a strategy that spends a really long time thinking and analyzing to get asymptotically close to the best guess. There's no limit on that intelligence.

  • Isn’t the limit exactly what you’re describing? There’s always uncertainty, and your asymptote can approach its limit but it does have a limit. That’s the limit to the intelligence. And this is just for cross entropy loss- even if you could get loss to 0, I’m still not convinced at all that an enormous semantic map and its convoluted geometries amounts to intelligence.

    • If you get to E you have generated a Bayes-optimal model of the conditional distribution (as in, next token conditional on context). This is something I thought too, but even if you're a fraction of a nat above the floor, you could have enormous headroom in performance left because there are still rare tokens amongst the irreducible noise that require so much capability to predict. It's not to suggest there truly is no cap on capability, but just that this constant isn't really saying what that is.

      1 reply →