I wish there's some breakthrough in cell simulation that would allow us to create simulations that are similarly useful to molecular dynamics but feasible on modern supercomputers. Not being able to see what's happening inside cells seems like the main blocker to biological research.
Molecular dynamics describes very short, very small dynamics, like on the scale of nanoseconds and angstroms (.1nm)
What you’re describing is more like whole cell simulation. Whole cells are thousands of times larger than a protein and cellular processes can take days to finish. Cells contain millions of individual proteins.
So that means that we just can’t simulate all the individual proteins, it’s way too costly and might permanently remain that way.
The problem is that biology is insanely tightly coupled across scales. Cancer is the prototypical example. A single mutated letter in DNA in a single cell can cause a tumor that kills a blue whale. And it works the other way too. Big changes like changing your diet gets funneled down to epigenetic molecular changes to your DNA.
Basically, we have to at least consider molecular detail when simulating things as large as a whole cell. With machine learning tools and enough data we can learn some common patterns, but I think both physical and machine learned models are always going to smooth over interesting emergent behavior.
Also you’re absolutely correct about not being able to “see” inside cells. But, the models can only really see as far as the data lets them. So better microscopes and sequencing methods are going to drive better models as much as (or more than) better algorithms or more GPUs.
Scales can also decouple from each other. Complex trait genetic variation at the whole genome level acts predominantly in an additive fashion even though individual genes and variants have clearly non-linear epistatic interactions.
Simulating the real world at increasingly accurate scales is not that useful, because in biology - more than any other field - our assumptions are incorrect/flawed most of the time. The most useful thing simulations allow us to do is directly test those assumptions and in these cases, the simpler the model the better. Jeremy Gunawardena wrote a great piece on this: https://bmcbiol.biomedcentral.com/articles/10.1186/1741-7007...
Plenty of simple models in biology that don't model the underlying details provide profoundly generalizable insights across scales. The percolation threshold model explains phase transition behavior from the savanna-forest transition to the complement immune system to epidemics to morphogenesis to social networks.
And the extremely difficult, expensive, and often resultless process of confirming/denying these assumptions is one of the greatest uses of tax dollars and university degrees I can think of, yet, the current admin has taken the perspective that it's all Miasma but also cut the EPA, which by their logic, would stop the Miasma
STATE is not a simulation. It's a trained graphical model that does property prediction as a result of a perturbation. There is no physical model of a cell.
Personally, I think arc's approach is more likely to produce usable scientific results in a reasonable amount of time. You would have to make a very coarse model of the cell to get any reasonable amount of sampling and you would probably spend huge amounts of time computing things which are not relevant to the properties you care amount. An embedding and graphical model seems well-suited to problems like this, as long as the underlying data is representative and comprehensive.
How can you simulate what is not yet reliably known? Ugh it's so frustrating to hear AI 'thought leaders' going on and on about this being a pancea, especially when a majority of funding for the research even needed to train models has been substantially cut so Elon could have more rocket dollars
In my field, we're always wanting to see what will happen when DNA is changed in a human pancreatic beta cell. We kind of have a protocol for producing things that look like human pancreatic beta cells from human stem cells, but we're not really sure that they are really going to behave like real human pancreatic beta cells for any particular DNA change, and we have examples of cases where they definitely do not behave the same.
The functional predictions related to "non-coding" variants are big here. Non-coding regions, referred to as the dark genome, produce regulatory non-coding RNA's that determine the level of gene expression in a given cell type. There are more regulatory RNA's than there are genes. Something like 75% of expression by volume is ncRNA.
There is a big long-running argument about what "functional" means in "non-coding" parts of the genome. The deeper I pushed into learning about the debate the less confident I became of my own understanding of genomics and evolution. See https://www.sciencedirect.com/science/article/pii/S096098221... for one perspective.
It's possible that the "functional" aspect of non-coding RNA exists on a time scale much larger that what we can assay in a lab. The sort of "junk DNA/RNA" hypothesis: the ncRNA part of the genome is material that increases fitness during relative rare events where it's repurposed into something else.
On a millions or billions of year time frame, the organisms with the flexibility of ncRNA would have an advantage, but this is extremely hard to figure out with a "single point in time" view point.
Anyway, that was the basic lesson I took from studying non-coding RNA 10 years ago. Projects like ENCODE definitely helped, but they really just exposed transcription of elements that are noisy, without providing the evidence that any of it is actually "functional". Therefore, I'm skeptical that more of the same approach will be helpful, but I'd be pleasantly surprised if wrong.
Such an advantage that is rare and across such long time scales would be so small on average that it would be effectively neutral. Natural selection can only really act on fitness advantages greater than on the order of the inverse of effective population size, which for large multicellular organisms such as animals, is low. Most of this is really just noisy transcription/binding/etc.
For example, we don't keep transposons in general because they're useful, which are almost half of our genomes, and are a major source of disruptive variation. They persist because we're just not very good at preventing them from spreading, we have some suppressive mechanisms but they don't work all the time, and there's a bit of an arms race between transposons and host. Nonetheless, they can occasionally provide variation that is beneficial.
Understanding the genome has always felt like trying to solve a massive puzzle with pieces constantly shifting. Tools like AlphaGenome are changing that—offering a more focused way to interpret complex genetic data. In the lab I worked in, precision was everything. We relied heavily on uv spectrophotometry for DNA quantification, and the systems from https://www.berthold.com/en/ stood out for their consistency, even under demanding conditions. Their devices helped streamline processes where accuracy couldn’t be compromised, especially when dealing with fragile or low-concentration samples. Founded back in 1949, they’ve become a global reference for reliable measuring technology. From radiation detection to life sciences and industrial process control, their solutions cover diverse fields. For anyone navigating genomics or analytical research, choosing the right tools isn’t just about features—it’s about long-term dependability and clarity in results.
"To ensure consistent data interpretation and enable robust aggregation across experiments, metadata were standardized using established ontologies."
Can't emphasize enough about how DNA requires human data curation to make things work, even from day one alignments models were driven based on biological observations. Glad to see UBERON, which represents a massive amount of human insight and data curation of what is for all intents and purposes a semantic-web product (OWL based RDF at the heart) playing a significant role.
I don't think DM is the only lab doing high-impact AI applications research, but they really seem to punch above their weight in it. Why is that or is it just that they have better technical marketing for their work?
Agreed, there’s been some interesting developments in this space recently (e.g. AgroNT). Very excited for it, particularly as genome sequencing gets cheaper and cheaper!
I’d pitch this paper as a very solid demonstration of the approach, and im sure it will lead to some pretty rapid developments (similar to what Rosettafold/alphafold did)
They have been at it for a long time and have a lot of resources courtesy of Google. Asking perplexity it says the alphafold 2 database took "several million GPU hours".
DeepMind/Google does a lot more than the other places that most HN readers would think about first (Amazon, Meta, etc). But there is a lot of excellent work with equal ambition and scale happening in pharma and biotech, that is less visible to the average HN reader. There is also excellent work happening in academic science as well (frequently as a collaboration with industry for compute). NVIDIA partners with whoever they can to get you committed to their tech stack.
For instance, Evo2 by the Arc Institute is a DNA Foundation Model that can do some really remarkable things to understand/interpret/design DNA sequences, and there are now multiple open weight models for working with biomolecules at a structural level that are equivalent to AlphaFold 3.
Money and resources are only a partial explanation. There’s some equally and more valuable companies that aren’t having nearly as much success in applied AI.
this is such an interesting problem. Imagine expanding the input size to 3.2Gbp, the size of human genome. I wonder if previously unimaginable interactions would occur. Also interesting how everything revolves around U-nets and transformers these days.
You would not need much more than 2 megabases. The genome is not one contiguous sequence. It is organized (physically segregated) into chromosomes and topologically associated domains. IIRC 2 megabases is like the 3 sd threshold for interactions between cis regulatory elements / variants and their effector genes.
Or to a man with a wheel and some magnets and copper wire...
There are technologies applicable broadly, across all business segments. Heat engines. Electricity. Liquid fuels. Gears. Glass. Plastics. Digital computers. And yes, transformers.
So very similar approach to Conformer - convolution head for downsampling and transformer for time dependencies. Hmm, surprising that this idea works across application domains.
I'm somewhat a noob here, but does this model have good understanding of things like OvRFs, methylation, etc, or is it strictly a sequence pattern matching thingy?
These are the type of advances in AI models that I'm excited about because of their potentially beneficial high impact for mankind. Not models that are a better (but less reliable) search engine or coding assistant and email writer. I wish more effort/money was going into this.
When I went to work at Google in 2008 I immediately advocated for spending significant resources on the biological sciences (this was well before DM started working on biology). I reasoned that Google had the data mangling and ML capabilities required to demonstrate world-leading results (and hopefully guide the way so other biologists could reproduce their techniques). We made some progress- we used exacycle to demonstrate some exciting results in protein folding and design, and later launched Cloud Genomics to store and process large datasets for analytics.
I parted ways with Google a while ago (sundar is a really uninspiring leader), and was never able to transfer into DeepMind, but I have to say that they are executing on my goals far better than I ever could have. It's nice to see ideas that I had germinating for decades finally playing out, and I hope these advances lead to great discoveries in biology.
It will take some time for the community to absorb this most recent work. I skimmed the paper and it's a monster, there's just so much going on.
I understand, but he made google a cash machine. Last quarter BEFORE he was CEO in 2015, google made a quarterly profit of around 3B. Q1 2025 was 35B. a 10x profit growth at this scale well, its unprecedented, the numbers are inspiring themselves, that's his job. He made mistakes sure, but he stuck to google's big gun, ads, and it paid off. The transition to AI started late but gemini is super competitive overall. Deepmind has been doing great as well.
Sundar is not a hypeman like Sam or Cook, but he delivers. He is very underrated imo.
Like Ballmer, he was set up for success by his predecessor(s), and didn't derail strong growth in existing businesses but made huge fumbles elsewhere. The question is, who is Google's Satya Nadella? Demis?
He might have delivered a lot of revenue growth yea, but Google culture is basically gone. Internally we're not very far from Amazon style "performance management"
Their brand is almost cooked though. At least the legacy search part. Maybe they'll morph into AI center of the future, but "Google" has been washed away.
> The transition to AI started late but gemini is super competitive overall.
If by competitive you mean "We spent $75 Billion dollars and now have a middle of the pack model somewhere between Anthropic and Chinese startup", that's a generous way to put it.
I have incredibly mixed feelings on Sundar. Where I can give him credit is really investing in AI early on, even if they were late to productize it, they were not late to invest in the infra and tooling to capitalize on it.
I also think people are giving maybe a little too much credit to Demis and not enough to Jeff Dean for the massive amount of AI progress they've made.
Did you ride the Santa Cruz shuttle, by any chance? We might have had conversations about this a long while ago. It sounded so exciting then, and still does with AlphaGenome.
A charitable view is that they intended "ideas that I had germinating for decades" to be from their own perspective, and not necessarily spurred inside Google by their initiative. I think that what they stated prior to this conflated the two, so it may come across as bragging. I don't think they were trying to brag.
I don't find it rude or pretentious. Sometimes it's really hard to express yourself in hmm acceptable neutral way when you worked on truly cool stuff. It may look like bragging, but that's probably not the intention. I often face this myself, especially when talking to non-tech people - how the heck do I explain what I work on without giving a primer on computer science!? Often "whenever you visit any website, it eventually uses my code" is good enough answer (worked on aws ec2 hypervisor, and well, whenever you visit any website, some dependency of it eventually hits aws ec2)
I found it disappointing that they ignored one of the biggest problems in the field, i.e. distinguishing between causal and non-causal variants among highly correlated DNA loci. In genetics jargon, this is called fine mapping. Perhaps, this is something for the next version, but it is really important to design effective drugs that target key regulatory regions.
One interesting example of such a problem and why it is important to solve it was recently published in Nature and has led to interesting drug candidates for modulating macrophage function in autoimmunity: https://www.nature.com/articles/s41586-024-07501-1
Does this get us closer? Pretty uninformed but seems that better functional predictions make it easier to pick out which variants actually matter versus the ones just along for the ride. Step 2 probably is integrating this with proper statistical fine mapping methods?
Yes, but it's not dramatically different from what is out there already.
There is a concerning gap between prediction and causality. In problems, like this one, where lots of variables are highly correlated, prediction methods that only have an implicit notion of causality don't perform well.
Right now, SOTA seems to use huge population data to infer causality within each linkage block of interest in the genome. These types of methods are quite close to Pearl's notion of causal graphs.
There are existing frameworks for integrating functional and statistical fine mapping methods (e.g. polyfun + susie/finemap). They use annotation overlaps like epigenetic or conservation tracks but can be extended to variant effect predictions from models like this. They essentially modify the prior probability of a variant being causal from uniform to one that depends on the functional annotation.
You know the corporate screws are coming down hard, when the model (which can be run off a single A100) doesn't get a code release or a weight release, but instead sits behind an API, and the authors say fuck it and copy-paste the entirety of the model code in pseudocode on page 31 of the white paper.
Please Google/Demis/Sergei, just release the darn weights. This thing ain't gonna be curing cancer sitting behind an API and it's not gonna generate that much GCloud revenue when the model is this tiny.
This is a strange take because this is consistent with what Google has been doing for a decade with AI. AlphaGo never had the weights released. Nor has any successor (not muzero, the StarCraft one, the protein folding alphafold, nor any other that could reasonably be claimed to be in the series afaik)
You can state as a philosophical ideal that you prefer open source or open weights, but that's not something deepmind has prioritized ever.
I think it's worth discussing:
* What are the advantages or disadvantages of bestowing a select few with access?
* What about having an API that can be called by anyone (although they may ban you)?
* Vs finally releasing the weights
But I think "behind locked down API where they can monitor usage" makes sense from many perspectives. It gives them more insight into how people use it (are there things people want to do that it fails at?), and it potentially gives them additional training data
All of what you said makes sense from the perspective of a product manager working for a for-profit company trying to maximize profit either today or eventually.
But the submission blog post writes:
> To advance scientific research, we’re making AlphaGenome available in preview via our AlphaGenome API for non-commercial research, and planning to release the model in the future. We believe AlphaGenome can be a valuable resource for the scientific community, helping scientists better understand genome function, disease biology, and ultimately, drive new biological discoveries and the development of new treatments.
And at that point, they're painting this release as something they did in order to "advance scientific research" and because they believe "AlphaGenome can be a valuable resource".
So now they're at a cross-point, is this release actually for advancing scientific research and if so, why aren't they doing it in a way so it actually maximizes advancing scientific research, which I think is the point parent's comment.
Even the most basic principle for doing research, being able to reproduce something, goes out the window when you put it behind an API, so personally I doubt their ultimate goal here is to serve the scientific community.
Edit: Reading further comments it seems like they've at least claimed they want to do a model+weights release of this though (from the paper: "The model source code and weights will also be provided upon final publication.") so remains to be seen if they'll go through with it or not.
I think that from a research/academic view of the landscape, building off a mutable API is much less preferred than building of a set of open weights. It would be even better if we had the training data, along with all code and open weights. However, I would take open weights over almost anything else in the current landscape.
For AlphaFold3 (vs. AlphaFold2 which was 100% public), they released the weights if you are an affiliate with an academic institution. I hope they do the same with AlphaGenome. I don't even care about the commercial reasons or licensing fees, it's more of a practical reason that every research institution will have an HPC cluster which is already configured to run deep learning stuff can run these jobs faster than the Google API.
And if they don't, I'm not sure how this will gain adoption. There are tons of well-maintained and established workflows out there in the cloud and on-prem that do all of these things AlphaGenome claim to do very well - many that Google promotes on their own platform (e.g., GATK on GCP).
(People in tech think people in science are like people in tech just jump on the latest fads from BigTech marketing - when it's quite opposite it's all about whether your results/methods will please the reviewers in your niche community)
Such a strange position to take indeed. I dont see people clamoring behind Apple when they do proprietary things - iMessage protocol, Bluetooth improvements for Airpods, private APIs in Apple Watch.
Apple's "life saving" Apple watch features are only accessible on premium devices. "Privacy is a human right" is also only possible if you buy their devices. It doesnt go around making it free to everyone, and nobody seem to be saying "if you believe in that, then why dont you make it accessible for people from all socio-economic classes?"
> Once the model is fully released, scientists will be able to adapt and fine-tune it on their own datasets to better tackle their unique research questions.
This is in the press release, so they are going to release the weights.
EDIT: I should have read the paper more thoroughly and been more kind.
On page 59, they mention there will be a source and code release.
Thank you Deepmind :)
I can guarantee you that some smart person actually thinks that the opportunity size is measured as a fraction of the pharmaceutical industry market cap.
When I was restudying biology a few years ago, it was making me a little crazy trying to understand the structural geometry that gives rise to the major and minor grooves of DNA. I looked through several of the standard textbooks and relevant papers. I certainly didn't find any good diagrams or animations.
So out of my own frustration, I drew this. It's a cross-section of a single base pair, as if you are looking straight down the double helix.
Aka, picture a double-strand of DNA as an earthworm. If one of the earthworms segments is a base-pair, and you cut the earthworm in half, and turn it 90 degrees, and look into the body of the worm, you'd see this cross-sectional perspective.
Apologies for overly detailed explanation; it's for non-bio and non-chem people. :)
It's not really just base pairs forcing groove structure. The repulsion of the highly charged phosphates, the specific chemical nature of the dihedral bonds making up the backbone and sugar/base bond, the propensity of the sugar to pucker, the pi-pi stacking of adjacent pairs, salt concentration, and water hydration all contribute.
My graduate thesis was basically simulating RNA and DNA duplexes in boxes of water for long periods of time (if you can call 10 nanoseconds "long") and RNA could get stuck for very long periods of time in the "wrong" (IE, not what we see in reality) conformation, due to phosphate/ 2' sugar hydroxyl interactions.
I bet the internal pitch is that genome will help deliver better advertisement, like if you are at risk of colon cancer they sell you "colon supplements", its likely they will be able to infer a bit about your personality just with your genome, "these genes are correlated with liking dark humor, use them to promote our new movie"
I wish there's some breakthrough in cell simulation that would allow us to create simulations that are similarly useful to molecular dynamics but feasible on modern supercomputers. Not being able to see what's happening inside cells seems like the main blocker to biological research.
Molecular dynamics describes very short, very small dynamics, like on the scale of nanoseconds and angstroms (.1nm)
What you’re describing is more like whole cell simulation. Whole cells are thousands of times larger than a protein and cellular processes can take days to finish. Cells contain millions of individual proteins.
So that means that we just can’t simulate all the individual proteins, it’s way too costly and might permanently remain that way.
The problem is that biology is insanely tightly coupled across scales. Cancer is the prototypical example. A single mutated letter in DNA in a single cell can cause a tumor that kills a blue whale. And it works the other way too. Big changes like changing your diet gets funneled down to epigenetic molecular changes to your DNA.
Basically, we have to at least consider molecular detail when simulating things as large as a whole cell. With machine learning tools and enough data we can learn some common patterns, but I think both physical and machine learned models are always going to smooth over interesting emergent behavior.
Also you’re absolutely correct about not being able to “see” inside cells. But, the models can only really see as far as the data lets them. So better microscopes and sequencing methods are going to drive better models as much as (or more than) better algorithms or more GPUs.
> A single mutated letter in DNA in a single cell can cause a tumor that kills a blue whale.
Side note: whales rarely get cancer.
https://en.wikipedia.org/wiki/Peto's_paradox
https://www.youtube.com/watch?v=1AElONvi9WQ
Scales can also decouple from each other. Complex trait genetic variation at the whole genome level acts predominantly in an additive fashion even though individual genes and variants have clearly non-linear epistatic interactions.
[dead]
Simulating the real world at increasingly accurate scales is not that useful, because in biology - more than any other field - our assumptions are incorrect/flawed most of the time. The most useful thing simulations allow us to do is directly test those assumptions and in these cases, the simpler the model the better. Jeremy Gunawardena wrote a great piece on this: https://bmcbiol.biomedcentral.com/articles/10.1186/1741-7007...
Plenty of simple models in biology that don't model the underlying details provide profoundly generalizable insights across scales. The percolation threshold model explains phase transition behavior from the savanna-forest transition to the complement immune system to epidemics to morphogenesis to social networks.
And the extremely difficult, expensive, and often resultless process of confirming/denying these assumptions is one of the greatest uses of tax dollars and university degrees I can think of, yet, the current admin has taken the perspective that it's all Miasma but also cut the EPA, which by their logic, would stop the Miasma
The folks at Arc are trying to build this! https://arcinstitute.org/news/virtual-cell-model-state
STATE is not a simulation. It's a trained graphical model that does property prediction as a result of a perturbation. There is no physical model of a cell.
Personally, I think arc's approach is more likely to produce usable scientific results in a reasonable amount of time. You would have to make a very coarse model of the cell to get any reasonable amount of sampling and you would probably spend huge amounts of time computing things which are not relevant to the properties you care amount. An embedding and graphical model seems well-suited to problems like this, as long as the underlying data is representative and comprehensive.
You may enjoy this, from a top-down experimental perspective (https://www.nikonsmallworld.com/galleries/small-world-in-mot...). Only a few entries so far show intracellular dynamics (like this one: https://www.nikonsmallworld.com/galleries/2024-small-world-i...), but I always enjoy the wide variety of dynamics some groups have been able to capture, like nervous system development (https://www.nikonsmallworld.com/galleries/2018-small-world-i...); absolutely incredible.
Very interesting, thanks.
How can you simulate what is not yet reliably known? Ugh it's so frustrating to hear AI 'thought leaders' going on and on about this being a pancea, especially when a majority of funding for the research even needed to train models has been substantially cut so Elon could have more rocket dollars
'Seeing' inside cells/tissues/organs/organisms is pretty much most modern biological research.
It's a main aim at DeepMind. I hope they succeed as it could be very useful.
Do they specifically state that it's their main aim anywhere?
Edit: Never mind, I've googled the answer.
1 reply →
I believe this is where quantum computing comes in but could be a decade out, but AI acceleration is hard to predict
What's missing feels like the equivalent of a "fast-forward" button for cell-scale dynamics
Why simulate? We can already do it experimentally
In my field, we're always wanting to see what will happen when DNA is changed in a human pancreatic beta cell. We kind of have a protocol for producing things that look like human pancreatic beta cells from human stem cells, but we're not really sure that they are really going to behave like real human pancreatic beta cells for any particular DNA change, and we have examples of cases where they definitely do not behave the same.
You can't see what's going on in most cases.
I wish there were more interest in general in building true deterministic simulations than black boxes that hallucinate and can't show their work.
The functional predictions related to "non-coding" variants are big here. Non-coding regions, referred to as the dark genome, produce regulatory non-coding RNA's that determine the level of gene expression in a given cell type. There are more regulatory RNA's than there are genes. Something like 75% of expression by volume is ncRNA.
There is a big long-running argument about what "functional" means in "non-coding" parts of the genome. The deeper I pushed into learning about the debate the less confident I became of my own understanding of genomics and evolution. See https://www.sciencedirect.com/science/article/pii/S096098221... for one perspective.
It's possible that the "functional" aspect of non-coding RNA exists on a time scale much larger that what we can assay in a lab. The sort of "junk DNA/RNA" hypothesis: the ncRNA part of the genome is material that increases fitness during relative rare events where it's repurposed into something else.
On a millions or billions of year time frame, the organisms with the flexibility of ncRNA would have an advantage, but this is extremely hard to figure out with a "single point in time" view point.
Anyway, that was the basic lesson I took from studying non-coding RNA 10 years ago. Projects like ENCODE definitely helped, but they really just exposed transcription of elements that are noisy, without providing the evidence that any of it is actually "functional". Therefore, I'm skeptical that more of the same approach will be helpful, but I'd be pleasantly surprised if wrong.
Such an advantage that is rare and across such long time scales would be so small on average that it would be effectively neutral. Natural selection can only really act on fitness advantages greater than on the order of the inverse of effective population size, which for large multicellular organisms such as animals, is low. Most of this is really just noisy transcription/binding/etc.
For example, we don't keep transposons in general because they're useful, which are almost half of our genomes, and are a major source of disruptive variation. They persist because we're just not very good at preventing them from spreading, we have some suppressive mechanisms but they don't work all the time, and there's a bit of an arms race between transposons and host. Nonetheless, they can occasionally provide variation that is beneficial.
Understanding the genome has always felt like trying to solve a massive puzzle with pieces constantly shifting. Tools like AlphaGenome are changing that—offering a more focused way to interpret complex genetic data. In the lab I worked in, precision was everything. We relied heavily on uv spectrophotometry for DNA quantification, and the systems from https://www.berthold.com/en/ stood out for their consistency, even under demanding conditions. Their devices helped streamline processes where accuracy couldn’t be compromised, especially when dealing with fragile or low-concentration samples. Founded back in 1949, they’ve become a global reference for reliable measuring technology. From radiation detection to life sciences and industrial process control, their solutions cover diverse fields. For anyone navigating genomics or analytical research, choosing the right tools isn’t just about features—it’s about long-term dependability and clarity in results.
"To ensure consistent data interpretation and enable robust aggregation across experiments, metadata were standardized using established ontologies."
Can't emphasize enough about how DNA requires human data curation to make things work, even from day one alignments models were driven based on biological observations. Glad to see UBERON, which represents a massive amount of human insight and data curation of what is for all intents and purposes a semantic-web product (OWL based RDF at the heart) playing a significant role.
I don't think DM is the only lab doing high-impact AI applications research, but they really seem to punch above their weight in it. Why is that or is it just that they have better technical marketing for their work?
This one seems like well done research but in no way revolutionary. People have been doing similar stuff for a while...
Agreed, there’s been some interesting developments in this space recently (e.g. AgroNT). Very excited for it, particularly as genome sequencing gets cheaper and cheaper!
I’d pitch this paper as a very solid demonstration of the approach, and im sure it will lead to some pretty rapid developments (similar to what Rosettafold/alphafold did)
They have been at it for a long time and have a lot of resources courtesy of Google. Asking perplexity it says the alphafold 2 database took "several million GPU hours".
It's also a core interest of Demis.
DeepMind/Google does a lot more than the other places that most HN readers would think about first (Amazon, Meta, etc). But there is a lot of excellent work with equal ambition and scale happening in pharma and biotech, that is less visible to the average HN reader. There is also excellent work happening in academic science as well (frequently as a collaboration with industry for compute). NVIDIA partners with whoever they can to get you committed to their tech stack.
For instance, Evo2 by the Arc Institute is a DNA Foundation Model that can do some really remarkable things to understand/interpret/design DNA sequences, and there are now multiple open weight models for working with biomolecules at a structural level that are equivalent to AlphaFold 3.
Well, they are a Google organization. Being backed by a $2T company gives you more benefits than just marketing.
Money and resources are only a partial explanation. There’s some equally and more valuable companies that aren’t having nearly as much success in applied AI.
1 reply →
Other labs are definitely doing amazing work too, but often it's either more niche or less public-facing
In biology, Arc Institute is doing great novel things.
Some pharmas like Genentech or GSK also have excellent AI groups.
Arc have just released a perturbation model btw. If it reliably beats linear benchmarks as claimed it is a big step
https://arcinstitute.org/news/virtual-cell-model-state
[dead]
this is such an interesting problem. Imagine expanding the input size to 3.2Gbp, the size of human genome. I wonder if previously unimaginable interactions would occur. Also interesting how everything revolves around U-nets and transformers these days.
You would not need much more than 2 megabases. The genome is not one contiguous sequence. It is organized (physically segregated) into chromosomes and topologically associated domains. IIRC 2 megabases is like the 3 sd threshold for interactions between cis regulatory elements / variants and their effector genes.
> Also interesting how everything revolves around U-nets and transformers these days.
To a man with a hammer…
Or to a man with a wheel and some magnets and copper wire...
There are technologies applicable broadly, across all business segments. Heat engines. Electricity. Liquid fuels. Gears. Glass. Plastics. Digital computers. And yes, transformers.
Soon we’ll be able to get the whole genome up on the blockchain. (I thought the /s was obvious)
Even just modeling 3D genome organization or ultra-long-range enhancers more realistically could open up new insights
So very similar approach to Conformer - convolution head for downsampling and transformer for time dependencies. Hmm, surprising that this idea works across application domains.
I'm somewhat a noob here, but does this model have good understanding of things like OvRFs, methylation, etc, or is it strictly a sequence pattern matching thingy?
These are the type of advances in AI models that I'm excited about because of their potentially beneficial high impact for mankind. Not models that are a better (but less reliable) search engine or coding assistant and email writer. I wish more effort/money was going into this.
With the huge jump in RNA prediction seems like it could be a boon for the wave of mRNA labs
Those outside the US at least ...
I've been saying we need a rebranding of mRNA in the USA its coming.
1 reply →
When I went to work at Google in 2008 I immediately advocated for spending significant resources on the biological sciences (this was well before DM started working on biology). I reasoned that Google had the data mangling and ML capabilities required to demonstrate world-leading results (and hopefully guide the way so other biologists could reproduce their techniques). We made some progress- we used exacycle to demonstrate some exciting results in protein folding and design, and later launched Cloud Genomics to store and process large datasets for analytics.
I parted ways with Google a while ago (sundar is a really uninspiring leader), and was never able to transfer into DeepMind, but I have to say that they are executing on my goals far better than I ever could have. It's nice to see ideas that I had germinating for decades finally playing out, and I hope these advances lead to great discoveries in biology.
It will take some time for the community to absorb this most recent work. I skimmed the paper and it's a monster, there's just so much going on.
> Sundar is a really uninspiring leader
I understand, but he made google a cash machine. Last quarter BEFORE he was CEO in 2015, google made a quarterly profit of around 3B. Q1 2025 was 35B. a 10x profit growth at this scale well, its unprecedented, the numbers are inspiring themselves, that's his job. He made mistakes sure, but he stuck to google's big gun, ads, and it paid off. The transition to AI started late but gemini is super competitive overall. Deepmind has been doing great as well.
Sundar is not a hypeman like Sam or Cook, but he delivers. He is very underrated imo.
Like Ballmer, he was set up for success by his predecessor(s), and didn't derail strong growth in existing businesses but made huge fumbles elsewhere. The question is, who is Google's Satya Nadella? Demis?
25 replies →
He might have delivered a lot of revenue growth yea, but Google culture is basically gone. Internally we're not very far from Amazon style "performance management"
1 reply →
> Last quarter BEFORE he was CEO in 2015, google made a quarterly profit of around 3B. Q1 2025 was 35B.
Google's revenue in 2014 was $75B and in 2024 it was $348B, that's 4.64 times growth in 10 years or 3.1 times if corrected for the inflation.
And during this time, Google failed to launch any significant new revenue source.
I like that you are writing as a defense of Google and Sundar.
Tim Cook is the opposite of a hypeman.
He delivered revenue growth by enshittifying Goog's products. Gemini is catching up because Demis is a boss and TPUs are a real competitive advantage.
11 replies →
Their brand is almost cooked though. At least the legacy search part. Maybe they'll morph into AI center of the future, but "Google" has been washed away.
5 replies →
> The transition to AI started late but gemini is super competitive overall.
If by competitive you mean "We spent $75 Billion dollars and now have a middle of the pack model somewhere between Anthropic and Chinese startup", that's a generous way to put it.
5 replies →
Googler here ---^
I have incredibly mixed feelings on Sundar. Where I can give him credit is really investing in AI early on, even if they were late to productize it, they were not late to invest in the infra and tooling to capitalize on it.
I also think people are giving maybe a little too much credit to Demis and not enough to Jeff Dean for the massive amount of AI progress they've made.
Nice wow 20% of the credit goes to you for thinking of this years ago. Kudos
Did you ride the Santa Cruz shuttle, by any chance? We might have had conversations about this a long while ago. It sounded so exciting then, and still does with AlphaGenome.
It's easy to forget how early some of these ideas were being pushed internally
[flagged]
A charitable view is that they intended "ideas that I had germinating for decades" to be from their own perspective, and not necessarily spurred inside Google by their initiative. I think that what they stated prior to this conflated the two, so it may come across as bragging. I don't think they were trying to brag.
I don't find it rude or pretentious. Sometimes it's really hard to express yourself in hmm acceptable neutral way when you worked on truly cool stuff. It may look like bragging, but that's probably not the intention. I often face this myself, especially when talking to non-tech people - how the heck do I explain what I work on without giving a primer on computer science!? Often "whenever you visit any website, it eventually uses my code" is good enough answer (worked on aws ec2 hypervisor, and well, whenever you visit any website, some dependency of it eventually hits aws ec2)
1 reply →
FWIW, I interpreted more as "This is something I wanted to see happen, and I'm glad to see it happening even if I'm not involved in it."
8 replies →
From Marx to Zizek to Fukuyama^1, 200 years of Leftist thinking nobody has ever came close to say "we can fix capitalism".
What makes you think that LLMs can do it?
[1] relapsed capitalist, at best, check the recent Doomscroll interview
Yeah it comes off as braggy, but it’s only natural to be proud of your foresight
2 replies →
[dead]
[dead]
Curious how it'll perform when people start fine-tuning on smaller, specialized datasets
Let’s figure out introns pls
Demis to be the first to get 4 consecutive nobels
I found it disappointing that they ignored one of the biggest problems in the field, i.e. distinguishing between causal and non-causal variants among highly correlated DNA loci. In genetics jargon, this is called fine mapping. Perhaps, this is something for the next version, but it is really important to design effective drugs that target key regulatory regions.
One interesting example of such a problem and why it is important to solve it was recently published in Nature and has led to interesting drug candidates for modulating macrophage function in autoimmunity: https://www.nature.com/articles/s41586-024-07501-1
Does this get us closer? Pretty uninformed but seems that better functional predictions make it easier to pick out which variants actually matter versus the ones just along for the ride. Step 2 probably is integrating this with proper statistical fine mapping methods?
Yes, but it's not dramatically different from what is out there already.
There is a concerning gap between prediction and causality. In problems, like this one, where lots of variables are highly correlated, prediction methods that only have an implicit notion of causality don't perform well.
Right now, SOTA seems to use huge population data to infer causality within each linkage block of interest in the genome. These types of methods are quite close to Pearl's notion of causal graphs.
3 replies →
There are existing frameworks for integrating functional and statistical fine mapping methods (e.g. polyfun + susie/finemap). They use annotation overlaps like epigenetic or conservation tracks but can be extended to variant effect predictions from models like this. They essentially modify the prior probability of a variant being causal from uniform to one that depends on the functional annotation.
Just add startofficial intel.
You know the corporate screws are coming down hard, when the model (which can be run off a single A100) doesn't get a code release or a weight release, but instead sits behind an API, and the authors say fuck it and copy-paste the entirety of the model code in pseudocode on page 31 of the white paper.
Please Google/Demis/Sergei, just release the darn weights. This thing ain't gonna be curing cancer sitting behind an API and it's not gonna generate that much GCloud revenue when the model is this tiny.
This is a strange take because this is consistent with what Google has been doing for a decade with AI. AlphaGo never had the weights released. Nor has any successor (not muzero, the StarCraft one, the protein folding alphafold, nor any other that could reasonably be claimed to be in the series afaik)
You can state as a philosophical ideal that you prefer open source or open weights, but that's not something deepmind has prioritized ever.
I think it's worth discussing:
* What are the advantages or disadvantages of bestowing a select few with access?
* What about having an API that can be called by anyone (although they may ban you)?
* Vs finally releasing the weights
But I think "behind locked down API where they can monitor usage" makes sense from many perspectives. It gives them more insight into how people use it (are there things people want to do that it fails at?), and it potentially gives them additional training data
All of what you said makes sense from the perspective of a product manager working for a for-profit company trying to maximize profit either today or eventually.
But the submission blog post writes:
> To advance scientific research, we’re making AlphaGenome available in preview via our AlphaGenome API for non-commercial research, and planning to release the model in the future. We believe AlphaGenome can be a valuable resource for the scientific community, helping scientists better understand genome function, disease biology, and ultimately, drive new biological discoveries and the development of new treatments.
And at that point, they're painting this release as something they did in order to "advance scientific research" and because they believe "AlphaGenome can be a valuable resource".
So now they're at a cross-point, is this release actually for advancing scientific research and if so, why aren't they doing it in a way so it actually maximizes advancing scientific research, which I think is the point parent's comment.
Even the most basic principle for doing research, being able to reproduce something, goes out the window when you put it behind an API, so personally I doubt their ultimate goal here is to serve the scientific community.
Edit: Reading further comments it seems like they've at least claimed they want to do a model+weights release of this though (from the paper: "The model source code and weights will also be provided upon final publication.") so remains to be seen if they'll go through with it or not.
8 replies →
The predecessor to this model Enformer, which was developed in collaboration with Calico had a weight release and a source release.
The precedent I'm going with is specifically in the gene regulatory realm.
Furthermore, a weight release would allow others to finetune the model on different datasets and/or organisms.
I think that from a research/academic view of the landscape, building off a mutable API is much less preferred than building of a set of open weights. It would be even better if we had the training data, along with all code and open weights. However, I would take open weights over almost anything else in the current landscape.
4 replies →
> The model source code and weights will also be provided upon final publication.
Page 59 from the preprint[1]
Seems like they do intend to publish the weights actually
[1]: https://storage.googleapis.com/deepmind-media/papers/alphage...
Thank you for this. I did not notice this at the end of the paper.
For AlphaFold3 (vs. AlphaFold2 which was 100% public), they released the weights if you are an affiliate with an academic institution. I hope they do the same with AlphaGenome. I don't even care about the commercial reasons or licensing fees, it's more of a practical reason that every research institution will have an HPC cluster which is already configured to run deep learning stuff can run these jobs faster than the Google API.
And if they don't, I'm not sure how this will gain adoption. There are tons of well-maintained and established workflows out there in the cloud and on-prem that do all of these things AlphaGenome claim to do very well - many that Google promotes on their own platform (e.g., GATK on GCP).
(People in tech think people in science are like people in tech just jump on the latest fads from BigTech marketing - when it's quite opposite it's all about whether your results/methods will please the reviewers in your niche community)
Such a strange position to take indeed. I dont see people clamoring behind Apple when they do proprietary things - iMessage protocol, Bluetooth improvements for Airpods, private APIs in Apple Watch.
Apple's "life saving" Apple watch features are only accessible on premium devices. "Privacy is a human right" is also only possible if you buy their devices. It doesnt go around making it free to everyone, and nobody seem to be saying "if you believe in that, then why dont you make it accessible for people from all socio-economic classes?"
> Once the model is fully released, scientists will be able to adapt and fine-tune it on their own datasets to better tackle their unique research questions.
This is in the press release, so they are going to release the weights.
EDIT: I should have read the paper more thoroughly and been more kind. On page 59, they mention there will be a source and code release. Thank you Deepmind :)
The bean counters rule. There is no corporate vision, no long-term plan. The numbers for the next quarter are driving everything.
I can guarantee you that some smart person actually thinks that the opportunity size is measured as a fraction of the pharmaceutical industry market cap.
1 reply →
[dead]
[dead]
[dead]
[dead]
Naturally, the (AI-generated?) hero image doesn't properly render the major and minor grooves. :-)
When I was restudying biology a few years ago, it was making me a little crazy trying to understand the structural geometry that gives rise to the major and minor grooves of DNA. I looked through several of the standard textbooks and relevant papers. I certainly didn't find any good diagrams or animations.
So out of my own frustration, I drew this. It's a cross-section of a single base pair, as if you are looking straight down the double helix.
Aka, picture a double-strand of DNA as an earthworm. If one of the earthworms segments is a base-pair, and you cut the earthworm in half, and turn it 90 degrees, and look into the body of the worm, you'd see this cross-sectional perspective.
Apologies for overly detailed explanation; it's for non-bio and non-chem people. :)
https://www.instagram.com/p/CWSH5qslm27/
Anyway, I think the way base pairs bond forces this major and minor grove structure observed in B-DNA.
It's not really just base pairs forcing groove structure. The repulsion of the highly charged phosphates, the specific chemical nature of the dihedral bonds making up the backbone and sugar/base bond, the propensity of the sugar to pucker, the pi-pi stacking of adjacent pairs, salt concentration, and water hydration all contribute.
My graduate thesis was basically simulating RNA and DNA duplexes in boxes of water for long periods of time (if you can call 10 nanoseconds "long") and RNA could get stuck for very long periods of time in the "wrong" (IE, not what we see in reality) conformation, due to phosphate/ 2' sugar hydroxyl interactions.
1 reply →
For anyone wondering: https://www.mun.ca/biology/scarr/MGA2_02-07.html
Maybe they were depicting RNA? (probably not)
No; what they drew doesn't look like real DNA or (duplex double stranded) RNA. Both have differently sized/spaced grooves (see https://www.researchgate.net/profile/Matthew-Dunn-11/publica...).
At least they got the handedness right.
when a human does it, its style! when ai does it, you cry about your job.
And yet still manages to be 4MB over the wire.
That's only on high-resolution screens. On lower resolution screens it can go as low as 178,820 bytes. Amazing.
Maybe "Release" requires a bit more context, as it clearly means different things to different people:
> AlphaGenome will be available for non-commercial use via an online API at http://deepmind.google.com/science/alphagenome
So, essentially the paper is a sales pitch for a new Google service.
[flagged]
Cant await for people to use it for CRISPR an it hallucinate some weird mutation
I bet the internal pitch is that genome will help deliver better advertisement, like if you are at risk of colon cancer they sell you "colon supplements", its likely they will be able to infer a bit about your personality just with your genome, "these genes are correlated with liking dark humor, use them to promote our new movie"