Comment by cyanydeez

2 hours ago

there are benchmarks that have nothing to do with the training material, but with how the models are capable of things like reading code: https://needle-bench.cc/

Generally, you give them a document and you ask them to retrieve some subsection of the document then rate them on what they retrieved.

You can always find enough random documents, or create your own, to always run these and you can make it arbitrarily long. It's definitely a valid non-maxxable context test.