← Back to context

Comment by antichronology

15 hours ago

I watched an interview with one of the co-founders of Anthropic where his point is that although benchmarks saturate they're still an important signal for model development.

We think the situation is similar here - one the challenges is aligning the benchmark with the function of the models. Genomic benchmarks for gLMs and RNA foundation models have been very resistant to staturation.

I think in NLP the problem is that they are victims of their own success where the models can be overfit to particular benchmarks really fast.

In genomics we're a bit behind. A good paper on this is DartEval where they provide levels of complexity https://arxiv.org/abs/2412.05430

in RNA the models work much better than DNA prediction but it's key to have benchmarks to measure progress.