Comment by forgotpwagain
18 hours ago
I am totally onboard with the premise (as a TechBio-adjacent person), and some of the approaches you're taking (focused domain-specific models like Orthrus, rather than massive foundation models like Evo2).
I'm curious about what your strategy is for data collection to fuel improved algorithmic design. Are you building out experimental capacity to generate datasets in house, or is that largely farmed out to partners?
We think that Orthrus can be applied in a bunch of ways to non-coding and coding RNA sequences but it's definitely fair we're a bit more focused on RNA sequences currently instead of non-coding parts of the genome like promoters and intergenic sequences.
For the data - Orthrus is trained on non experimentally collected data so our pre-training dataset is large by biological standards. It adds up to about 45 million unique sequences and assuming 1k tokens per sequence it's about 50b tokens.
We're thinking about this as large pre-training run on a bunch of annotation data from Refseq and Gencode in conjunction with more specialized Orthology datasets that are pooling data across 100s of species.
Then for specific applications we are fine tuning or doing linear probing for experimental prediction. For example we can predict half life using publicly available data collected by the awesome paper from: https://genomebiology.biomedcentral.com/articles/10.1186/s13...
Or translation efficency: https://pubmed.ncbi.nlm.nih.gov/39149337/
Eventually as we ramp up out wet lab data generation we're thinking about what does post-training look like? There is an RL analog here that we can use on these generalizable embeddings to demonstrate "high quality samples".
There are some early attempts at post-training in bio and I think it's a really exciting direction