Comment by Majromax

4 months ago

> When I(short proof)=I(long proof), per-token average surprisal must be lower for the long proof than for the short proof. But since surprisal for a single token is simply -log P, that would mean that, on average, the shorter proof is made out of less probable tokens.

This assertion is intuitive, but it isn't true. Per-token entropy of the long proof can be larger if the long proof is not minimal.

For example, consider the "proof" of "list the natural numbers from 1 to 3, newline-delimited." The 'short proof' is:

"1\n2\n3\n" (Newlines escaped because of HN formatting)

Now, consider the alternative instruction to give a "long proof", "list the natural numbers from 1 to 3, newline-delimited using # for comments. Think carefully, and be verbose." Trying this just now with Gemini 2.5-pro (Google AI Studio) gives me:

"# This is the first and smallest natural number in the requested sequence.\n 1

# Following the first, this is the second natural number, representing the concept of a pair.\n 2

# This is the third natural number, concluding the specified range from 1 to 3.\n 3"

We don't have access to Gemini's per-token logits, but repeating the prompt gives different comments so we can conclude that there is 'information' in the irrelevant commentary.

The author's point regains its truth, however, if we consider the space of all possible long proofs. The trivial 'long' proof has higher perplexity than the short proof, but that's because there are so many more possible long proofs of approximately equal value. The shortest possible proof is a sharp minimum, but and longer proofs are shallower and 'easier'.

The author also misses a trick with:

> Prompted with “Respond only with code: How do you increment i by 1 in Python?”, I compared the two valid outputs: i += 1 has a perplexity of approximately 38.68, while i = i + 1 has a perplexity of approximately 10.88.

… in that they ignore the equally-valid 'i = 1 + i'.

1 comment

Majromax

gabiteodoru 4 months ago

Thanks so much for this challenge! I just ran the experiment with i = 1 + i and you're absolutely right - it breaks my theoretical framework (same semantic information, but much higher perplexity).

While setting this up, I realized I hadn't used chat templates in my original measurements (rookie mistake with an Instruct model!). Re-running with proper methodology completely flips the results - the terse version actually wins.

I'll add a correction note to the article once AWS/Medium comes back online, and will write a proper follow-up with all the corrected experiments. Your comment literally made the research better - thank you!