Comment by throwawaymaths
1 month ago
this is probably because the thinking tokens have the opportunity to store higher level/summarized contextual reasoning (lookup table based associations) in those token's KV caches. so an "Ok so" in position X may contain summarization vibes that are distinct from that in position Y.
No comments yet
Contribute on Hacker News ↗