Comment by e1g
16 hours ago
Context rot. My use case is iterating over a large codebase which quickly grows context. All LLMs degrade with larger context sizes, well below their published limits, but pro models degrade the least. R1 gets confused relatively quickly, despite their published numbers.
I think Fiction LiveBench captures some of those differences via a standardized benchmark that spreads interconnected facts through an increasingly large context to see how models can continue connecting the dots (similar to how in codebases you often have related ideas spread across many files)
https://fiction.live/stories/Fiction-liveBench-May-22-2025/o...
No comments yet
Contribute on Hacker News ↗