Comment by tennysont
19 hours ago
Fun to see talk of "a compiler for DNA"---I've been hoping for that for a long time.
I have to admit, at a _glance_ this feels like a promising idea with few results and lots of marketing. I'll try to be clear about my confusion, feel free to explain if I'm off base.
- There's not a lot of talk of your "ground truth" for evaluations. Are you using mRNABench?
- Has you mRNABench paper been peer reviewed? You linked a preprint. (I know paper submission can be touch or stressful, and it's a superficial metric to be judged on!)
- Do any of your results suggest that this foundation model might be any good on out of sequence mRNA sequences? If not, then is the (current) model supposed to predict properties of natural mRNA sequences rather than of synthetic mRNA sequences?
- Did a lot mRNA sequences have experimental verification of their predicted properties? At a quick glance, I see this 66 number in the paper---but I truly have no idea.
I'm super happy to praise both incremental progress and putting forth a vision, I just also want to have a clear understanding of the current state-of-the-art as well!
> ground truth
Hey yes, the ground truth for our evaluations is measured experimental data. Our models are benchmarked using mRNABench, which aggregates results from high-throughput wet lab experiments.
Our goal, however, is to move beyond predicting existing experimental outcomes. We intend to design novel sequences and validate their function in our own lab. At that stage, the functional success of the RNA we design will become the ground truth.
> peer reviewed?
Both mRNA bench and Orthrus are in submission (at a big ML conference and a big name journal) - unfortunately the academic systems move slow but we're working on getting them out there.
> synthetic mRNA sequences
I think you're asking on generalizing out of distribution to unnatural sequences. There are two ways that we do this: (1) There are these screens called Massively Parallel Reporter Assays (MPRAs) and we eval for example on https://pubmed.ncbi.nlm.nih.gov/31267113/
Here all the sequences are synthetic and randomly designed and we do observe generalization. Ultimately it depends on the problem that we're tackling: some tasks like gene therapy design require endogenous sequences.
(2) The other angle is variant effect prediction (VEP). It can be thought of as a counterfactual prediction problem where you ask the model whether a small change in the input predicts a large change in the output. This is a good example of the study (https://www.biorxiv.org/content/10.1101/2025.02.11.637758v2)
> experimental verification of their predicted properties
all our model evaluations are predictions of experimental results! The datasets we use are collections of wet lab measurements, so the model is constantly benchmarked against ground-truth biology.
The evaluation method involves fitting a linear probe on the model's learned embeddings to predict the experimental signal. This directly tests whether the model's learned representation of an RNA sequence contains a linear combination of features that can predict its measured biological properties.
Thanks for the feedback I understand the caution around pre-prints. We believe a self-supervised learning approach is well-suited for this problem because it allows the model to first learn patterns from millions of unlabeled sequences before being fine-tuned on specific, and often smaller, experimental datasets.
Hi, I'm the lead author of the human 5' UTR paper. It was a nice surprise seeing it linked on HN and I'm happy to see that it's providing value for you all. Looking forward to watching your team's progress!
Huge fan of the work! I'm a big fan of papers from Seelig lab :)
> mRNABench
Just curious, in other areas of ML, I think it's widely acknowledged that benchmarks have pretty limited real world value, just end up getting saturated, and (my view) are all pretty correlated, regardless of their ostensible speciality and don't really tell you that much.
Do you think mRNABench is different, or where do you see the limitations? Do you imagine this or any benchmark will be useful for anything beyond comparing how different models do on the benchmark?
I watched an interview with one of the co-founders of Anthropic where his point is that although benchmarks saturate they're still an important signal for model development.
We think the situation is similar here - one the challenges is aligning the benchmark with the function of the models. Genomic benchmarks for gLMs and RNA foundation models have been very resistant to staturation.
I think in NLP the problem is that they are victims of their own success where the models can be overfit to particular benchmarks really fast.
In genomics we're a bit behind. A good paper on this is DartEval where they provide levels of complexity https://arxiv.org/abs/2412.05430
in RNA the models work much better than DNA prediction but it's key to have benchmarks to measure progress.
Here is the link for benchmarks and their utility: https://youtu.be/JdT78t1Offo?t=1444
"We have internal benchmarks. Yeah. But we don't we don't publish them."
"we have internal benchmarks that the team focuses on and improving and then we also have a bunch of tasks like I think that accelerating our own engineers is like a top top priority for us"
The equivalent for us would be to ultimate looking to improve experimental results. Benchmarks are a good intermediate point but not the ultimate goal