Comment by e1g

16 hours ago

Context rot. My use case is iterating over a large codebase which quickly grows context. All LLMs degrade with larger context sizes, well below their published limits, but pro models degrade the least. R1 gets confused relatively quickly, despite their published numbers.

I think Fiction LiveBench captures some of those differences via a standardized benchmark that spreads interconnected facts through an increasingly large context to see how models can continue connecting the dots (similar to how in codebases you often have related ideas spread across many files)

https://fiction.live/stories/Fiction-liveBench-May-22-2025/o...

0 comments

e1g

No comments yet

Contribute on Hacker News ↗