Before my comment gets dismissed, I will disclaim I am a professional structural biologist that works in this field every day.
These threads are always the same: lots of comments about protein folding, how amazing DeepMind is, how AlphaFold is a success story, how it has flipped an entire field on it's head, etc. The language from Google is so deceptive about what they've actually done, I think it's actually intentionally disingenuous.
At the end of the day, AlphaFold is amazing homology modeling. I love it, I think it's an awesome application of machine learning, and I use it frequently. But it's doing the same thing we've been doing for 2 decades: pattern matching sequences of proteins with unknown structure to sequences of proteins with known structure, and about 2x as well as we used to be able to.
That's extremely useful, but it's not knowledge of protein folding. It can't predict a fold de novo, it can't predict folds that haven't been seen (EDIT: this is maybe not strictly true, depending on how you slice it), it fails in a number of edge cases (remember, in biology, edge cases are everything) and again, I can't stress this enough, we have no new information on how proteins fold. We know all the information (most of at least) for a proteins final fold is in the sequence. But we don't know much about the in-between.
I like AlphaFold, it's convenient and I use it (although for anything serious or anything interacting with anything else, I still need a real structure), but I feel as though it has been intentionally and deceptively oversold. There are 3-4 other deep learning projects I think have had a much greater impact on my field.
Not sure if you should be reminded of how alpha fold started, it started by winning a competition thought un winnable by academics. Top labs working in protein structure prediction have fundamentally changed direction after alpha fold and are working to do the same even better.
This is not the first (or even tenth) time I’m seeing an academic trying to undermine genuine progress almost to the level of gaslighting. Comparing alphafold to conventional homology modeling is disingenuous at its most charitable interpretation.
Not sure what else to say. Structural biology has always been the weirdest field I’ve seen, the way students are abused (crystallize and publish in nature or go bust), and how every nature issue will have three structure papers as if that cures cancer every day. I suppose it warps one’s perception of outsiders after being in such a bubble?
signed, someone with a PhD in biomedical engineering, did a ton of bio work.
> Not sure if you should be reminded of how alpha fold started, it started by winning a competition thought un winnable by academics. Top labs working in protein structure prediction have fundamentally changed direction after alpha fold and are working to do the same even better.
Not sure what part of "it does homology modeling 2x better" you didn't see in my comment? AlphaFold scored something like 85% in CASP in 2020, in CASP 2016, I-TASSER had I think 42%? So it's ~2x as good as I-TASSER which is exactly what I said in my comment.
>This is not the first (or even tenth) time I’m seeing an academic trying to undermine genuine progress almost to the level of gaslighting. Comparing alphafold to conventional homology modeling is disingenuous at its most charitable interpretation.
It literally is homology modeling. The deep learning aspect is to boost otherwise unnoticed signal that most homology modeling software couldn't tease out. Also, I don't think I'm gaslighting, but maybe I'm wrong? If anything, I felt gaslit by the language around AlphaFold.
>Not sure what else to say. Structural biology has always been the weirdest field I’ve seen, the way students are abused (crystallize and publish in nature or go bust), and how every nature issue will have three structure papers as if that cures cancer every day. I suppose it warps one’s perception of outsiders after being in such a bubble?
What on earth are you even talking about? The vast, VAST majority of structures go unpublished ENTIRELY, let alone published in nature. There are almost 200,000 structures on deposit in the PDB.
> Comparing alphafold to conventional homology modeling is disingenuous at its most charitable interpretation.
It's really not - have you played around with AF at all? Made mutations to protein structures and asked it to model them? Go look up the crystal structures for important proteins like FOXA1 [1], AR [2], EWSR1 [3], etc (i.e. pretty much any protein target we really care about and haven't previously solved) and tell me with a straight face that AF has "solved" protein folding - it's just a fancy language model that's pattern matching to things it's already seen solved before.
This isn’t a good use of the term gaslighting. Accusing someone of gaslighting takes what we used to call a ‘difference of opinion’ and mutates it into deliberate and wicked psychological warfare.
Incidentally, accusing someone of gaslighting is itself a form of gaslighting.
Not only is CASP not "unwinnable," it's not even a contest. The criteria involved are rated as "moderately difficult." Alphafold is a significant achievement but it sure as hell hasn't "revealed the structure of the protein universe," whatever that means.
Which top labs have changed direction? Because Alphafold can't predict folds, just identify ones it's seen.
I've directly communicated with the leaders of CASP and at DM that they should stop representing this as a form of protein folding and just call it "crystal/cryoEM structure prediction" (they filter out all the NMR structures from PDB since they aren't good for prediction). They know it's disingenuous and they do it on purpose to give it more impact than it really deserves.
I would like to correct somethign here- it does predict structures de novo and predict folds that haven't been seen before. That's because of the design of the NN- it uses sequence information to create structural constraints. If those constraints push the modeller in the direction of a novel fold, it will predict that.
To me what's important about this is that it demonstrated the obvious (I predicted this would happen eventually, shortly after losing CASP in 2000).
>I would like to correct somethign here- it does predict structures de novo and predict folds that haven't been seen before. That's because of the design of the NN- it uses sequence information to create structural constraints. If those constraints push the modeller in the direction of a novel fold, it will predict that.
Could you expand on this? Basically it looks at the data, and figures out what's an acceptable position in 3D space for residues to occupy, based on what's known about other structure?
I will update my original post to point out I may be not entirely correct there.
The distinction I'm trying to make is that there's a difference between looking at pre-existing data and modeling (ultimately homology modeling, but maybe slightly different) and understanding how protein folding works, being able to predict de novo how an amino acid sequence will become a 3D structure.
1) Isonet - takes low SNR cryo-electron tomography images (that are extremely dose limited, so just incredibly blurry and frequently useless) and does two things:
* Deconvolutes some image aberrations and "de-noises" the images
* Compensates for missing wedge artifacts (missing wedge is the fact that the tomography isn't done -90° --> +90°, but usually instead -60° --> +60°, leaving a 30° wedge on the top and bottom of basically no information) which usually are some sort of directionality in image density. So if you have a sphere, the top and bottom will be extremely noisy and stretched up and down (in Z).
2) Topaz, but topaz really counts as 2 or 3 different algorithms. Topaz has denoising of tomograms and of flat micrographs (i.e. images taken with a microscope, as opposed to 3D tomogram volumes). That denoising is helpful because it increases contrast (which is the fundamental problem in Cryo-EM for looking at biomolecules). Topaz also has a deep learning particle picker which is good at finding views of your protein that are under-represented, or otherwise missing, which again, normally results in artifacts when you build your 3D structure.
3) EMAN2 convolutional neural network for tomogram segmentation/Amira CNN for segmentation/flavor of the week CNN for tomogram segmentation. Basically, we can get a 3D volume of a cell or virus or whatever, but then they are noisy. To do anything worthwhile with it, even after denoising, we have to say "this is cell membrane, this is virus, this is nucleic acid" etc. CNNs have proven to be substantially better at doing this (provided you have an adequate "ground truth") than most users.
I asked a structural biologist friend of mine (world class lab) about the impact of alphafold.
They said it's minimal.
In most cases, having a "probably" isn't good enough. They use alphafold to get early insights, but then they still use crystallography to confirm the structure. Because at the end of the day, you need to know for sure.
I'm not a biologist, but that doesn't sound minimal if crystallography is expensive.
It sounds like how we model airplanes in computers, but still test the real thing - i wouldn't call the impact of computer modelling on airplane design to be minimal.
This seems strange to me. The entire point of these types of models is to predict things on unseen data. Are you saying Deepmind is completely lying about their model?
Deepmind solved CASP, isn't the entire point of that competition to predict unseen structures?
If AlphaFold doesn't predict anything then what are you using it to do?
AlphaFold figures out that my input sequence (which has no structural data) is similar to this other protein that has structural data. Or maybe different parts of different proteins. It does this extremely well.
Disclaimer: I'm a professional (computational) structural biologist. My opinion is slightly different.
The problem with the structure prediction problem is not a loss/energy function problem, even if we had an accurate model of all the forces involved we'd still not have an accurate protein structure prediction algorithm.
Protein folding is a chaotic process (similar to the 3 body problem). There's an enormous number of interactions involved - between different amino acids, solvent and more. Numerical computation can't solve chaotic systems because floating point numbers have a finite representation, which leads to rounding errors and loss of accuracy.
Besides, Short range electro static and van der waals interactions are pretty well understood and before alphafold many algorithms (like Rosetta) were pretty successful in a lot of protein modeling tasks.
Therefore, we need a *practical* way to look at protein structure determination that is akin to AlphaFold2.
as an outsider learning more about protein folding, could you elaborate on the assertion that the sequence is (mostly) all you need (transformer/ML reference intended).
doesn't this assume the final fold is static and invariant of environmental and protein interactions?
put another way, how do we know that a protein does not fold differently under different environmental conditions or with different molecular interactions?
i realize this is a long-held assumption, but after studying scientific research for the past year, i realize many long-held assumptions aren't supported by convincing evidence.
These threads are always the same: lots of comments about protein folding, how amazing DeepMind is, how AlphaFold is a success story, how it has flipped an entire field on it's head, etc.
I don't think that's necessarily so - there is a lot of justified scepticism about the wilder claims of ML in this forum; it is in fact quite difficult at times to know as an outsider to the field in question how kneejerk it is.
Additionally, folding doesn't focus on what matters. Generally you want to understand the active site, you already know the context (globular, membrane, embedded, conjugated) of the protein. It is interesting whether the folding could help identify active sites for further analysis. But -- I don't think alphago is identifying new active sites or improving our understanding of their nuances.
Right, but even a speed up / quality increase can flip workflows on their head. Take ray tracing for example, when you speed it up by an order of magnitude, you can suddenly go from taking a break every time you want to render a scene, vs being able to iteratively work on a scene and preview it as you work.
I got a lot of shit (still do) when the news first broke for pushing back against the notion that AlphaFold "solved" protein folding. People really wanted to attach that word to the achievement. Thank you for providing a nuanced take on exactly why that doesn't make any sense.
I'm curious to read more on the 3-4 other deep learning projects you mentioned that have had a larger impact on your fields. Can you share some links to those works?
It has template structures. AlphaFold uses the following databases:
BFD,
MGnify,
PDB70,
PDB (structures in the mmCIF format),
PDB seqres – only for AlphaFold-Multimer,
Uniclust30,
UniProt – only for AlphaFold-Multimer,
UniRef90.
Can we just chill on the whole “using this single word incorrectly breaks your whole argument” thing?
A lot of folks on HN end posts about a company with a sentence like “Disclaimer: I used to work for X”. This language (probably taken from contract law or something) is meant an admission of possible bias but in practice is also a signal that this person may know what they’re talking about more-so than the average person. After reading a lot of posts like this, it might feel reasonable for someone to flip the word around say something like “I need to disclaim…” when beginning a post, in order to signal their proximity to a topic or field as well as any sort of insider bias they may possess.
So sure, “I need to disclose” would’ve been the better word choice, but we all knew what GP was saying. It seems pedantic to imply otherwise.
I got a 5th grader question about how proteins are used/represented graphically that I've never been able to find a satisfying answer for.
Basically, you see these 3D representations of specific proteins as a crumple of ribbons-- literally like someone ran multi-colored ribbons though scissors to make curls and dumped it on the floor (like a grade school craft project).
So... I understand that proteins are huge organic molecules composed of thousands of atoms, right? Their special capabilities arise from their structure/shape. So basically the molecule contorts itself to a low energy state which could be very complex but which enables it to "bind?" to other molecules expressly because of this special shape and do the special things that proteins do-- that form the basis of living things. Hence the efforts, like Alphafold, to compute what these shapes are for any given protein molecule.
But what does one "do" with such 3D shapes?
They seem intractably complex. Are people just browsing these shapes and seeing patterns in them? What do the "ribbons" signify? Are they just some specific arrangement of C,H,O? Why are some ribbons different colors? Why are there also thread-like things instead of all ribbons?
Also, is that what proteins would really look like if you could see at sub-optical wavelength resolutions? Are they really like that? I recall from school the equipartition theorem-- 1/2 KT of kinetic energy for each degree of freedom. These things obviously have many degrees of freedom. So wouldn't they be "thrashing around" like rag doll in a blender at room temperature? It seems strange to me that something like that could be so central to life, but it is.
Just trying to get myself a cartoonish mental model of how these shapes are used! Anyone?
The ribbons and helices you see in those pictures are abstract representations of the underlying positions of specific arrangements of carbon atoms along the backbone.
There are tools such as DSSP https://en.wikipedia.org/wiki/DSSP_(hydrogen_bond_estimation... which will take out the 3d structure determined by crystallography and spit out hte ribbons and helices- for example, for helices, you can see a specific arrangement of carbons along the protein's backbone in 3d space (each carbon interacts with a carbon 4 amino acids down the chain).
Protein motion at room temperature varies depending on the protein- some proteins are rocks that stay pretty much in the same single conformation forever once they fold, while others do thrash around wildly and others undergo complex, whole-structure rearrangements that almost seem magical if you try to think about them using normal physics/mechanical rules.
Having a magical machine that could output the full manifold of a protein during the folding process at subatomic resolution would be really nice! but there would be a lot of data to process.
Thanks, awesome! So what do molecular biologists do with these 3D representations once they have them? Do they literally just see how they fit to other proteins?
All of the loops and swirls are summary representations of known atomic positions: really, knowing a protein structure means knowing the position of every atomic nucleus, relative to the nuclei, down to some small resolution, and assuming a low temperature.
The atoms do wiggle around a bit at room temperature (and even more at body temperature), which means that simulating them usefully typically requires sampling from a probability distribution defined by the protein structure and some prior knowledge about how atoms move (often a potential energy surface fitted to match quantum mechanics).
There are many applications of these simulations. One of the most important is drug design: knowing the structure of the protein, you can zoom in on a binding pocket and design a set of drug molecules which might disable it. Within the computer simulation, you can mutate a known molecule into each of your test molecules and measure the change in binding affinity, which tells you pretty accurately which ones will work. Each of these simulations requires tens of millions of samples from the atomic probability distribution, which typically takes a few hours on a GPU given a good molecular dynamics program.
Some proteins have 3D structures that look like abstract art only because we don't have an intuitive understanding of what shape and amino acids are necessary to convert chemical A to chemical B, which is the main purpose of many enzymes in the body. If you look at structural proteins or motor proteins, on the other hand, their function is clear from their shape.
There are a lot of other things you can do with the shape. If it has a pore, you can estimate the size and type of small molecule that could travel through it. You can estimate whether a binding site is accessible to the environment around it. You can determine if it forms a multimer or exists as a single unit. You can see if protein A and protein B have drastically different shapes given similar sequences, which might have implications for its druggability or understanding its function.
> Are people just browsing these shapes and seeing patterns in them
That's one approach.
The thing to understand is that proteins form "binding sites": areas that are more likely to attract other particular regions of proteins or other molecules, or even atoms. Think about hemoglobin. The reason it holds onto oxygen atoms is because it has binding sites.
Binding sites are great because they represent more freedom to do things than molecules typically have. Normal chemistry consists of forming strong electronic bonds between atoms, or forming rigid lattices/crystals.
Binding sites allow molecules to do things like temporarily attach to each other and let each other go under certain circumstances, for instance when another binding site is active/inactive. This can happen through "conformation change", where a molecule bound/unbound on some binding site makes the protein change shape slightly. This is how proteins can act like machines.
> What do the "ribbons" signify
Different regions of the protein have different sequences of amino acids. Amino Acids have somewhat different shapes from each other. The ribbons are actually broader than the spindles (or threads), and less flexible. Not sure about the different colors, maybe someone else can fill in.
> Also, is that what proteins would really look like if you could see at sub-optical wavelength resolutions?
Not really, it's an abstraction. They're big molecules, so if you look closely they're made of atoms, which are (kinda, sorta not really, quantum stuff) spherical.
> So wouldn't they be "thrashing around" like rag doll in blender at room temperature?
Yes, but the attractions between the different parts of the molecule keeps it somewhat under control. So more like an undulating little creature, jellyfish perhaps.
> It seems strange to me that something like that could be so central to life
Yep, gotta remember that it's all statistical. These things are getting made, do their job, breaking, and getting degraded some insane number of times per second. Swarm behavior, sort of.
Short answer is that the ribbon representation is a visual simplification based on known structures -- they are actually composed of atoms.
They certainly do "thrash around", but that thrashing is constrained by the bonds that are formed, which greatly limits the degrees of freedom. Here's a short video of a simulation to demonstrate:
I've been going through MIT's online Introduction to Biology course[0] that answers some of your questions here with regards to the shapes and what they signify - specifically the "Proteins and Protein Structure" lessons in the second unit, although some of the previous lectures are helpful setup as well - really interesting and engaging stuff, taught by Eric Lander (who ended up being one of the CRISPR pioneers featured in Isaacson's latest book)
That's cool, I just happened to have picked up a used copy of the text on which the course based... "Molecular Biology of the Cell" -- the huge grey book. Geez, there's a lot of material in there!
Back in the day, I had steered away from chemistry in college because I didn't like to memorize stuff. Now I realize I missed out on some amazing knowledge.
> I recall from school the equipartition theorem-- 1/2 KT of kinetic energy for each degree of freedom. These things obviously have many degrees of freedom. So wouldn't they be "thrashing around" like rag doll in a blender at room temperature?
It's funny you say that, because the first image on the English Wikipedia page for Equipartition Theorem[1] is an animation of the thermal motion of a peptide.
BTW, in terms of protein dynamics, before you even think about the thrashing around- 1.2kt at room temperature is enough to form and break hydrogen bonds in real time (around 1-2kcal) so presumably, protein h-bonds are breaking and reforming spontaneously at scale.
Your "now what?" question is legitimate and reminiscent of reactions after the completion of the Human Genome Project.
Just like having a human genome sequence, this is not a magic key that solves all problems of biology but a building block for use by researchers. An investigator may look up the folded structure of a protein and use that information to glean certain context-specific insights from it such as how exactly two interacting proteins interact mechanically.
The other significant benefit is that this frees up resources that were spent having to figure out the structure in other ways. It's an efficiency improvement.
Watch this video on DNA polymerase [1]. Obviously it’s an illustration, but I think it helps answer you question because cartoons are great. (MD, not PhD biologist)
The ability for another molecule (probably another protein) to "react" or interact with the protein depends not only on the chemistry but also the shape. An otherwise compatible sequence of atoms might not be able to react because it and the binding site are just incompatibly shaped.
This is hugely important for developing drugs and vaccines.
To see the effect of this look no further than prions. Prions are the exact same protein that are folded in weird ways. Worse, they can "transmit" this misfolded shape to other otherwise normal proteins. Prions behave differently just because of the different shape and can lead to disease. This is exactly what Mad Cow's Disease (BSE) is.
What we get taught in high school about chemistry is incredibly oversimplified.
One example of this I like is the geometry of a water molecule. When we first learn about atoms, we learn the "solar system" model (aka Bohr). The reality is instead that we have 3D probability distributions of where electrons might be. These clouds are in pairs. I believe this is to do with the inverted wavefunction really we're getting beyond my knowledge of quantum mechanics here so that's just a guess.
Well those clouds additionally form valence shells. We learn about these and how atoms want to form completely valence shells. So Oxygen has 8 electrons ie 4 pairs of electrons. When bonding with 2 hydrogen atoms we end up with a weird geometry of ~104.5 degrees between the two hydrogen atoms because of how these pairs interact. The naive assumption might expect that the two hydrogen atoms are 180 degree apart.
So back to proteins, you may have learned about hydrogen bonds. This affects molecular shape because when a hydrogen atom shares an electron, it is often positively charged. That positive charge pushes away other positive charges. This is the realy difficulty in protein folding because with a molecule of thousands of atoms and weird geometry you may find distant parts of the molecule interacting with hydrogen bonds.
So a single cell consists of thousands (IIRC) of different proteins. Figuring out those interactions is important but incredibly difficult.
This is probably one of the best applications of AI in science in terms of impact so far. I can't think of any other problem with the same potential impact.
AlphaFold is the best counterpoint to tech cynics.
One of the largest public tech companies in the world funded a multi-year scientific project, executed the research flawlessly and moved forward an entire scientific field. They then went on to openly release the code _and_ data, working with a publicly funded organization (EMBL-EBI) to ensure researchers across the globe can easily access the outputs.
I'm not arguing that every tech company is a net positive for humanity. Google itself isn't perfect. Google + DeepMind is setting a bloody high bar though.
This is definitely one of the most exciting spaces in AI right now. Another somewhat-related startup is PostEra (medicinal chemistry for drug discovery via AI) https://postera.ai/about/
You are right and when thinking about it I can see 2 problems which I hope in the future can have even more impact:
1. Using AI to determine the most efficient methods of doing mathematical expressions, transformations and computation algorithms - division, square root, maybe traveling salesman - these which take relatively high amount of CPU cycles to compute and are used everywhere. If inputs and outputs can be assigned to it, AI can eventually build a transformation which can be reproduced using a silicon.
2. Physics phenomena in general, not only organic protein, can be measured and with sufficient ability to quantize them to inputs and experimentally obtained outputs to train the network, we could in theory establish new formulas or constants and progress the understanding of the Universe.
jarenmf said "in science" - but it is an interesting question how much automated translation has helped scientists translate papers from other languages.
Can someone put AlphaFold's problem space into perspective for me?
Why is protein folding important? Theoretical importance? Can we do something with protein folding knowledge? If so, what?
I've been hearing about AlphaFold from the CS side. There they seem to focus on protein folding primarily as an interesting space to apply their CS efforts.
(a) the structure of every protein (what DeepMind is doing here)
(b) how different protein structures interact (i.e. protein complexes - DeepMind is working on this but not there yet)
Then we could use those two building blocks to design new proteins (drugs) that do what we want. If we solve those two problems with very high accuracy, we can also reduce the time it takes to go from starting a drug discovery programme to approved medicine.
Obtaining all protein structures and determining how they interact is a key step towards making biology more predictable. Previously, solving the structure of a protein was very time consuming. As a result, we didn’t know the structure for a majority of proteins. Now that it’s much faster, downstream research can move faster.
Caveat: we should remember that these are all computational predictions. AlphaFold’s predictions can be wrong and protein structures will still need to be validated. Having said that, lots of validation has already occurred and confidence in the predictions grows with every new iteration of AlphaFold.
> Then we could use those two building blocks to design new proteins (drugs) that do what we want. If we solve those two problems with very high accuracy, we can also reduce the time it takes to go from starting a drug discovery programme to approved medicine.
Drugs are usually not proteins, but instead small molecules that are designed to help or interfere with the operation of proteins instead.
You are basically made of proteins, which are basically folded sequences of amino acids, proteins are molecular machines that are the fundamental building block of animals, plants, bacteria, fungi, viruses etc.
So yeah the applications are enormous, from medicine to better industrial chemical processes, from warfare to food manufacturing.
Does that imply proteins have some dynamics that need to be predicted too? I remember seeing animations of molecular machines that appeared to be "walking" inside the body - are those proteins or more complex structures?
As others have already mentioned, proteins are the machinery of the cell. They perform an immense array of functions and they must fold in a certain way to perform these functions. This is part of what's known as the structure-function relationship.
Misfolded proteins are contributors to numerous pathological conditions and the more we can understand about how and why this folding happens, the better we can treat these conditions.
Another aspect is that while we can at least partially determine the primary structure (the amino acid sequence) of proteins from DNA and RNA, we don't necessarily know their secondary or tertiary structures (3 dimensional conformation). This is a key piece of the puzzle for figuring out how these proteins do their proteiny things and how they interact with other proteins and even how they form quaternary structures with other proteins (an assembly of multiple proteins that perform some function, many pores are assemblies like this). Once we know these structures and understand how they work on a structural and chemical level, we can manipulate them far more easily.
In order to do rational drug design, which is designing a drug for a specific target or active site on a protein, we need to understand these structures. Working to solve protein folding is a key step in treating disease states and understanding how cells work on a fundamental level. The impact is hard to understate.
My understanding is that protein folding is a major cost bottleneck in drug design.
Researchers can come up with candidate molecule formulas that might work as good drugs, but the problem is that these proteins organize/fold themselves physically in a hard-to-predict way. And how they fold directly affects their properties as drugs.
If AlphaFold can accurately predict folding, it’ll allow researchers to prioritize drug candidates more accurately which will reduce research time and costs. Supposedly the major pharmaceutical companies can spend up to billions when designing a single drug. Optimistically, predicting protein folding better will allow for much more rapid and cheaper drug development
I love AlphaFold, but this is a big misconception. The biggest cost bottle neck in drug development and design, by orders of magnitude, is associated with assaying (and potentially reducing) off-target binding or toxicity and assaying (and potentially increasing) efficacy. Determining a protein structure empirically with cryoEM, NMR, or crystallography will generally cost less than $1M (sometimes far less), which is tiny compared to the many millions or billions of dollars that get poured into clinical trials for a single drug. AF2 is useful in some basic research cases but isn't really that useful for traditional drug design and development.
A machine learning approach for predicting toxicity would have a far greater impact on public health than AF2 does.
My understanding is that protein folding is not a bottleneck in drug design.
Yes, once you identified a target protein, its structure is useful to selectively target it. But the main bottleneck is identifying such targets. In other words, the main difficulty is to figure out what to hit, not how to hit it, and protein folding mostly helps with how at the moment.
Proteins are what makes everything in a cell work. They are produced as a "linear" structure that must fold into a proper shape to execute its function, such as acting as a pore that only lets a specific chemical through the cell membrane.
The importance here is to figure out potential targets for treatments that take into account particularities of certain proteins. That could produce better drugs with less side effects.
The genome, all of our DNA combined, is just a bunch of 1D strings like "cgtattctgcttgta". Those strings encode proteins, which fold up into a 3D shape once created. This 3D shape is what determines what the protein actually does inside the cell. Without understanding protein folding we don't understand what the DNA actually does.
The applications and importance has been discussed, but let me explain why what we are doing right now does not work which will also emphasize the importance of this.
At this time, we create drugs, test them on animals, and see what the side effects and results actually are. We are very limited in our capabilities and basically throw mud at the wall and see what sticks. This would allow us to try potential drug candidates without so much randomness.
There are a million articles and podcasts explaining exactly your question. Those will be better than HN responses. I suggest you take 15 seconds to Google it.
Yes there are a million articles. That is why asking a question here on HN is useful. The HN community more often than not offers intelligent insight as well as curated recommended links for learning more about a topic. Yes, the signal-to-noise ratio isn't perfect on HN, but it is a lot better than random Google searches. If nothing else, it often leads to enough basic understanding so that someone can then perform more refined and therefore more productive Google searches. I appreciate the basic questions and the generous nature of many HN commenters who offer thoughtful responses.
The endgame of protein folding is nanotechnology. Life is also nanotechnology. So basically the end game is to take full control of life: change existing life forms, design new life forms, new ecosystems. Eat the earth first, then the universe. Not yet there, but getting much closer.
Proteins are the molecular machines of all living beings on this planet. They do almost everything. We need to understand their structure and how that structure changes when they perform some function. Essentially we have to reverse engineer nature's machines.
Biology has traditionally studied these things from the outside in, top down. Advances in biology are correlated with advances in microscope technology. We went from seeing with the naked eye to being able to magnify tissue to see cells and even what's inside of them.
Studying biology from the inside out, bottom up is significantly harder. You have chemistry which is hard science. On top of that you have biochemistry which is equally rigorous. Above that level of abstraction things start to get a lot less well defined. We go from organic chemistry (carbon, hydrogen, oxygen, etc.) to more complex things found in life (aminoacids, carbohydrates, lipids) to polymers of those things (proteins, polysaccharides, triglycerides) to a point where they get so big and complex their tridimensional structure starts influencing how they interact with other molecules then we start seeing aggregates (quaternary proteins) of these complex things!
Important proteins like ATP synthase are well understood (it's a friggin' turbine like in a hydroelectric power plant!) but there are so many others haven't received the same attention. Bioinformatics technology is changing that. The human genome's been mapped. Now it's time to map the proteome.
AlphaFold is a phenomenal tool that demonstrates how AI can already outclass humans for certain tasks. It is a prime example of a problem space where conventional approaches are simply inferior, and that AI is not just a fancy name but can be extraordinarily powerful.
Write a short abstract for an article on why AlphaFold isn't just a tremendous technological achievement, but an example of how Artificial Intelligence will allow humanity to continue to achieve technological progress in the future:
Response:
When DeepMind’s AlphaFold was announced as the winner of the CASP13 protein folding prediction competition, it was hailed as a major breakthrough in both artificial intelligence (AI) and biology.
While AlphaFold is undoubtedly a remarkable achievement, it is also a clear example of how AI can be used to solve complex problems that have traditionally been seen as beyond the reach of machines.
In particular, AlphaFold represents a significant advance in the ability of AI systems to learn from limited data. Unlike previous AI approaches to protein folding, which relied on large datasets of known protein structures, AlphaFold was able to produce accurate predictions with only a small number of known protein structures to work from.
This ability to learn from limited data is likely to be increasingly important as we face problems that are too complex for us to fully understand and collect all the data we would need to solve them using traditional methods.
In the future, AI systems like AlphaFold will become increasingly important in helping us to make progress on a wide range of pressing challenges, from developing new medicines to tackling climate change.
> demonstrates how AI can already outclass humans for certain tasks
I'm not sure how clear the edge over humans in this case is. There were some attempts at machine assisted human solving like Foldit that did produce results: https://en.wikipedia.org/wiki/Foldit#Accomplishments
Many thanks to Deepmind for releasing predicted structures of all known protein monomers. What I'd like next is for Alphafold (or some other software) to be able to show us multimeric structures based on the single monomer/subunit predictions and protein-protein interactions (i.e. docking). For example the one I helped work on back in my structural biology days was the circadian clock protein KaiC : https://www.rcsb.org/structure/2GBL, that's the "complete" hexameric structure that shows how each of the subunits pack. The prediction for the single monomer that forms a hexamer is very close to the experimental https://alphafold.ebi.ac.uk/entry/Q79PF4 and in fact shows the correct structure of AA residues 500 - 519 which we were never able to validate until 12 years later (https://www.rcsb.org/structure/5C5E) when we expressed those residues along with another protein called KaiA which we knew binds to the "top" CII terminal (AAs 497-519) of KaiC. If we would have had this data then, it would have allowed us to not only make better predictions about biological function and protein-protein interactions but would have helped better guide future experiments.
What we can do with this data now is use methods such as cryo-em to see the "big picture", i.e. multi-subunit protein-protein interactions where we can plug in the Alphafold predicted structure into the cryo-em 3d density map and get predicted angstrom level views of what's happening without necessarily having to resort to slower methods such as NMR or x-ray crystallography to elucidate macromolecular interactions.
A small gripe about the alphafold ebi website: it doesn't seem to show the known experimental structure, it just shows "Experimental structures: None available in PDB". For example the link to the alphafold structure above should link to the 2GBL, 1TF7, or any of the other kaic structures from organism PCC7942 at RCSB. This would require merging/mapping data from RCSB with EBI and at least doing some string matching, hopefully they're working on it!
It's impossible to really put a number on it, because the task itself was impossible. PHDs and the field's top scientists simply couldn't figure out many complicated protein structures after years of attempts, and the fact that there's so many (200M+) mean that the problem space is vast.
It doesn't make any sense on multiple levels. This is a computational prediction and there was no computational alternative- for many of these proteins would never have had a structure solved even if you spent the money. They are just taking $cost_per_structure_solved * number_of_remaining_structures and assuming that things scale linearly like that.
Note that crystallographers are now using these predicftions to bootstrap models of proteins they've struggled to work with, which indicates the level of trust in the structural community for these predictions is pretty high.
(200 trillion cost) / (200 million structures predicted) = 1 million per structure.
That reflects the personnel cost (5 Yr PHP scholarship, PostDoc/Prof mentorship; inverstment+depreciation for the lab equipment). All this to crystallize 1 structure and characterize its folding behavior.
I don't know if this calculation is too simplistic, just coming up with something.
Disclaimer: I work in Google, organizationally far away from Deep Mind and my PhD is in something very unrelated.
They can't possibly know that. What they know is that their guesses are very significantly better than the previous best and that they could do this for the widest range in history. Now, verifying the guess for a single (of the hundreds of millions in the db) protein is up to two years of expensive project. Inevitably some will show discrepancies. These will be fed to regression learning, giving us a new generation of even better guesses at some point in the future. That's what I believe to be standard operating practice.
A more important question is: is today's db good enough to be a breakthrough for something useful, e.g. pharma or agriculture? I have no intuition here, but the reporting claims it will be.
The press release reads like an absurdity. It's not the "protein universe", it's the "list of presumed globular proteins Google found and some inferences about their structure as given by their AI platform".
Proteins don't exist as crystals in a vacuum, that's just how humans solved the structure. Many of the non-globular proteins were solved using sequence manipulation or other tricks to get them to crystallize. Virtually all proteins exist to have their structures interact dynamically with the environment.
Google is simply supplying a list of what it presumes to be low RMSD models based on their tooling, for some sequences they found, and the tooling is based itself on data mostly from X-ray studies that may or may not have errors. Heck, we've barely even sequenced most of the DNA on this planet, and with methods like alternative splicing the transcriptome and hence proteome has to be many orders of magnitude larger than what we have knowledge of.
But sure, Google has solved the structure of the "protein universe", whatever that is.
Same as any other prediction I'd presume. Run it against a known protein and see how the answer lines up. Predict the structure of an unknown protein, then use traditional methods (x-ray crystallography, maybe STEM, etc) to verify.
"Verify" is almost correct. The crystallography data is taken to be "ground truth" and the predicted protein structure from AlphaFold is taken to be a good guess starting point. Then other software can produce a model that is a best fit to the ground truth data starting from the good guess. So even if the guess is wrong in detail it's still useful to reduce the search space.
As we solve viewability into the complex coding of proteins, we need to be right. Next, hopefully, comes causal effect identification, then construction ability.
If medicine can use broad capacity to create bespoke proteins, our world becomes both weird and wonderful.
they don't but they are more correct than what others have predicted. Some of their predictions can be compared with structures determined with x-ray crystallography
They won a decades-long standing challenge predicting the protein structures of a much smaller (yet significantly quite large) set of proteins using a model (AlphaFold).
Then they use the model to predict more.
Although we don't know if they are correct, these structures are the best (or the least bad) we have for now.
We know the structure of some proteins. It's not that it's impossible to measure, it's just very expensive. This is why having a model that can "predict" it is so useful.
They compare the predicted structure (computed) to a known structure (physical x-ray crystallography). There's an annual competition CASP (Crtical Assessment of protein Structure Prediction) that does X-Ray crystallography on a protein. The identity of this protein is held secret by the organizers. Then research teams across the world present their models and attempt to predict without advance knowledge, the structure of the protein from their amino acid sequence. Think of CASP as a validation data set used to evaluate a machine learning model.
DeepMind crushes everyone else at this competition.
The worry is about dataset shifting. Previously, the data were collected for a few hundreds thousands structures, now it is 200m. I think there could be doubts on distributions and how that could play a role in prediction accuracy.
>we’re now releasing predicted structures for nearly all catalogued proteins known to science
is the result that researchers will now much more quickly 'manually' validate or invalidate the predicted structures for proteins they are working with? i understand it is traditionally a long and complex process, but i imagine it is expedited by having a predicted structure to test as the baseline?
Some people are using "AI wins a Nobel price" as the new Turing test. Maybe that is going to happen sooner than they expect. Or maybe the owners of the AI will always claim it on its behalf.
there's no AI here. This is just ML. All deepmind did here was use multiple excellent resources- large numbers of protein sequences, and small numbers of protein structures, to create an approximation function of protein structure, without any of the deep understanding of "why". Interestingly, the technology they used to do this didn't exist 5 years ago!
Today I learned that there are bacteria that have a protein helping to form ice on plants [1] to destroy them and extract nutrients (however I didn't understand how bacteria themselves survive this).
Machine learning typically uses existing data to predict new data. Please explain: Does it mean that AlphaFold can only use known types of interactions between atoms and will mispredict the structure of proteins that use not yet known
interactions?
And why we cannot just simulate protein behaviour and interactions using quantum mechanics?
>And why we cannot just simulate protein behaviour and interactions using quantum mechanics?
If you wanted to simulate the behaviour of an entire protein using quantum mechanics, the sheer number of calculations required would be infeasible.
For what it's worth, I have a background in computational physics and am studying a PhD in structural biology. For any system (of any size) that you want to simulate, you have to consider how much information you're willing to 'ignore' in order to focus on the information you would like to 'get out' of a set of simulations. Being aware of the approximations you make and how this impacts your results is crucial.
For example, if I am interested in how the electrons of a group of Carbon atoms (radius ~ 170 picometres) behave, I may want to use Density Functional Theory (DFT), a quantum mechanical method.
For a single, small protein (e.g. ubiquitin, radius ~ 2 nanometres), I may want to use atomistic molecular dynamics (AMD), which models the motion of every single atom in response to thermal motion, electrostatic interactions, etc using Newton's 2nd law. Electron/proton detail has been approximated away to focus on overall atomic motion.
In my line of work, we are interested in how big proteins (e.g. the dynein motor protein, ~ 40 nanometres in length) move around and interact with other proteins at longer time (micro- to millisecond) and length (nano- to micrometre) scales than DFT or AMD. We 'coarse-grain' protein structures by representing groups of atoms as tetrahedra in a continuous mesh (continuum mechanics). We approximate away atomic detail to focus on long-term motion of the whole protein.
Clearly, it's not feasible to calculate the movement of dynein for hundreds of nanoseconds using DFT! The motor domain alone in dynein contains roughly one million atoms (and it has several more 'subunits' attached to it). Assuming these are mostly Carbon, Oxygen or Nitrogen, then you're looking at around ten million electons in your DFT calculations, for a single step in time (rounding up). If you're dealing with the level of atomic bonds, you're probably going to a use time steps between a femto- (10^-15 s) or picosecond (10^-12 s). The numbers get a bit ridiculous. There are techniques that combine QM and AMD, although I am not too knowledgeable in this area.
Some further reading, if you're interested (I find Wikipedia articles on these topics to generally be quite good):
To add to this comment (from someone who used to engineer proteins, and long ago DFT as well): DFT is only really decent at ground state predictions, computational chemists often have to resort to even more expensive methods to capture "chemistry", i.e. correlated electron-pair physics and higher-state details. Simulating catalysis is extremely challenging!
> And why we cannot just simulate protein behaviour and interactions using quantum mechanics?
QM calculations have been done in proteins, but they’re computationally very expensive. IIRC, there are hybrid approaches where only a small portion of interest in the protein structure is modelled by QM and the rest by classical molecular mechanics.
The press release is a bit difficult to place into historical context. I believe that the first AlphaFold release was mostly human and mouse proteins, and this press release marks the release of structures for additional species.
> I believe that the first AlphaFold release was mostly human and mouse proteins,
More than that. The press release actually contains an infographic comparing the amount of published protein models for different clades of organisms. The infographic shows that the previous release (~1mln proteins) contained proteins of some animal, plant, bacterial, and fungal species.
A fun way I've been thinking about all this is what nanotech/nanobots are actually going to look like. Tiny little protein machines doing what they've been doing since the dawn of life. We now have a library of components, and as we start figuring out what they can do, and how to stack them, we can start building truly complex machinery for whatever crazy tasks we can imagine. The impact goes so far beyond drugs and treatments.
I had a dream about this a few days ago. About complexly wrinkled/crumpled/convolved things.
Like a fresh crepe stuffed into the toe of a boot. Bewilderingly complex.
But I have a question. Does such contortion work for 3d "membranes" in a 4d space? It's something I'm chewing on. Hard to casually visualize, obviously.
Of course! The term you might wanna start off googling is "curvature of manifolds". What's even neater than "3d thing curving in 4d space" is that these notions can be made precise also without the "in [whatever] space" part (see "intrinsic curvature" and "Riemannian manifold").
I haven’t had a chance to look through some of the new predictions, but I know there were some issues with predicting the structure for membrane bound proteins previously. PDB hardly contains any.
Does the new set of predictions contain a bunch of membrane bound protiens?
Folding@home answers a related but different question. While AlphaFold returns the picture of a folded protein in its most energetically stable conformation, Folding@home returns a video of the protein undergoing folding, traversing its energy landscape.
How do you know that the predicted structure will be correct? I presume researchers will need to validate the structure empirically. Do we know how good the model has been at predicting so far?
Just imagine if the tech world puts all programatic advertising development on hold for a year and the collective brain power is channeled to science instead…
To answer my own question it looks like for folks who don’t want to wait 21 months for 21 terabytes, that it might cost approximately 1600 USD to download the full approx 20TB dataset assuming egress costs of .08 USD per GB as mentioned here: https://cloud.google.com/storage/pricing#network-egress
It’s a pity it’s so expensive to download
> Today, I’m incredibly excited to share the next stage of this journey. In partnership with EMBL’s European Bioinformatics Institute (EMBL-EBI), we’re now releasing predicted structures for nearly all catalogued proteins known to science, which will expand the AlphaFold DB by over 200x - from nearly 1 million structures to over 200 million structures - with the potential to dramatically increase our understanding of biology.
And later:
> Today’s update means that most pages on the main protein database UniProt will come with a predicted structure. All 200+ million structures will also be available for bulk download via Google Cloud Public Datasets, making AlphaFold even more accessible to scientists around the world.
This is the actual announcement.
UniProt is a large database of protein structure and function. The inclusion of the predicted structures alongside the experimental data makes it easier to include the predictions in workflows already set up to work with the other experimental and computed properties.
It's not completely clear from the article whether any of the 200+ million predicted structures deposited to UniProt have not be previously released.
Protein structure determines function. Before AlphaFold, experimental structure determination was the only option, and that's very costly. AlphaFold's predictions appears to be good enough to jumpstart investigations without an experimental structure determination. That has the potential to accelerate many areas of science and could percolate up to therapeutics.
One area that doesn't get much discussion in the press is the difference between solid state structure and solution state structure. It's possible to obtain a solid state structure determination (x-ray) that has nothing to do with actual behavior in solution. Given that AlhpaFold was trained to a large extent on solid state structures, it could be propagating that bias into its predicted structures.
This paper talks about that:
> In the recent Critical Assessment of Structure Prediction (CASP) competition, AlphaFold2 performed outstandingly. Its worst predictions were for nuclear magnetic resonance (NMR) structures, which has two alternative explanations: either the NMR structures were poor, implying that Alpha-Fold may be more accurate than NMR, or there is a genuine difference between crystal and solution structures. Here, we use the program Accuracy of NMR Structures Using RCI and Rigidity (ANSURR), which measures the accuracy of solution structures, and show that one of the NMR structures was indeed poor. We then compare Alpha-Fold predictions to NMR structures and show that Alpha-Fold tends to be more accurate than NMR ensembles. There are, however, some cases where the NMR ensembles are more accurate. These tend to be dynamic structures, where Alpha-Fold had low confidence. We suggest that Alpha-Fold could be used as the model for NMR-structure refinements and that Alpha-Fold structures validated by ANSURR may require no further refinement.
Before my comment gets dismissed, I will disclaim I am a professional structural biologist that works in this field every day.
These threads are always the same: lots of comments about protein folding, how amazing DeepMind is, how AlphaFold is a success story, how it has flipped an entire field on it's head, etc. The language from Google is so deceptive about what they've actually done, I think it's actually intentionally disingenuous.
At the end of the day, AlphaFold is amazing homology modeling. I love it, I think it's an awesome application of machine learning, and I use it frequently. But it's doing the same thing we've been doing for 2 decades: pattern matching sequences of proteins with unknown structure to sequences of proteins with known structure, and about 2x as well as we used to be able to.
That's extremely useful, but it's not knowledge of protein folding. It can't predict a fold de novo, it can't predict folds that haven't been seen (EDIT: this is maybe not strictly true, depending on how you slice it), it fails in a number of edge cases (remember, in biology, edge cases are everything) and again, I can't stress this enough, we have no new information on how proteins fold. We know all the information (most of at least) for a proteins final fold is in the sequence. But we don't know much about the in-between.
I like AlphaFold, it's convenient and I use it (although for anything serious or anything interacting with anything else, I still need a real structure), but I feel as though it has been intentionally and deceptively oversold. There are 3-4 other deep learning projects I think have had a much greater impact on my field.
EDIT: See below: https://news.ycombinator.com/item?id=32265662 for information on predicting new folds.
Not sure if you should be reminded of how alpha fold started, it started by winning a competition thought un winnable by academics. Top labs working in protein structure prediction have fundamentally changed direction after alpha fold and are working to do the same even better.
This is not the first (or even tenth) time I’m seeing an academic trying to undermine genuine progress almost to the level of gaslighting. Comparing alphafold to conventional homology modeling is disingenuous at its most charitable interpretation.
Not sure what else to say. Structural biology has always been the weirdest field I’ve seen, the way students are abused (crystallize and publish in nature or go bust), and how every nature issue will have three structure papers as if that cures cancer every day. I suppose it warps one’s perception of outsiders after being in such a bubble?
signed, someone with a PhD in biomedical engineering, did a ton of bio work.
> Not sure if you should be reminded of how alpha fold started, it started by winning a competition thought un winnable by academics. Top labs working in protein structure prediction have fundamentally changed direction after alpha fold and are working to do the same even better.
Not sure what part of "it does homology modeling 2x better" you didn't see in my comment? AlphaFold scored something like 85% in CASP in 2020, in CASP 2016, I-TASSER had I think 42%? So it's ~2x as good as I-TASSER which is exactly what I said in my comment.
>This is not the first (or even tenth) time I’m seeing an academic trying to undermine genuine progress almost to the level of gaslighting. Comparing alphafold to conventional homology modeling is disingenuous at its most charitable interpretation.
It literally is homology modeling. The deep learning aspect is to boost otherwise unnoticed signal that most homology modeling software couldn't tease out. Also, I don't think I'm gaslighting, but maybe I'm wrong? If anything, I felt gaslit by the language around AlphaFold.
>Not sure what else to say. Structural biology has always been the weirdest field I’ve seen, the way students are abused (crystallize and publish in nature or go bust), and how every nature issue will have three structure papers as if that cures cancer every day. I suppose it warps one’s perception of outsiders after being in such a bubble?
What on earth are you even talking about? The vast, VAST majority of structures go unpublished ENTIRELY, let alone published in nature. There are almost 200,000 structures on deposit in the PDB.
24 replies →
> Comparing alphafold to conventional homology modeling is disingenuous at its most charitable interpretation.
It's really not - have you played around with AF at all? Made mutations to protein structures and asked it to model them? Go look up the crystal structures for important proteins like FOXA1 [1], AR [2], EWSR1 [3], etc (i.e. pretty much any protein target we really care about and haven't previously solved) and tell me with a straight face that AF has "solved" protein folding - it's just a fancy language model that's pattern matching to things it's already seen solved before.
signed, someone with a PhD in biochemistry.
[1] https://alphafold.ebi.ac.uk/entry/P55317 [2] https://alphafold.ebi.ac.uk/entry/P10275 [3] https://alphafold.ebi.ac.uk/entry/Q01844
2 replies →
This isn’t a good use of the term gaslighting. Accusing someone of gaslighting takes what we used to call a ‘difference of opinion’ and mutates it into deliberate and wicked psychological warfare.
Incidentally, accusing someone of gaslighting is itself a form of gaslighting.
1 reply →
Not only is CASP not "unwinnable," it's not even a contest. The criteria involved are rated as "moderately difficult." Alphafold is a significant achievement but it sure as hell hasn't "revealed the structure of the protein universe," whatever that means.
Which top labs have changed direction? Because Alphafold can't predict folds, just identify ones it's seen.
I've directly communicated with the leaders of CASP and at DM that they should stop representing this as a form of protein folding and just call it "crystal/cryoEM structure prediction" (they filter out all the NMR structures from PDB since they aren't good for prediction). They know it's disingenuous and they do it on purpose to give it more impact than it really deserves.
I would like to correct somethign here- it does predict structures de novo and predict folds that haven't been seen before. That's because of the design of the NN- it uses sequence information to create structural constraints. If those constraints push the modeller in the direction of a novel fold, it will predict that.
To me what's important about this is that it demonstrated the obvious (I predicted this would happen eventually, shortly after losing CASP in 2000).
>I would like to correct somethign here- it does predict structures de novo and predict folds that haven't been seen before. That's because of the design of the NN- it uses sequence information to create structural constraints. If those constraints push the modeller in the direction of a novel fold, it will predict that.
Could you expand on this? Basically it looks at the data, and figures out what's an acceptable position in 3D space for residues to occupy, based on what's known about other structure?
I will update my original post to point out I may be not entirely correct there.
The distinction I'm trying to make is that there's a difference between looking at pre-existing data and modeling (ultimately homology modeling, but maybe slightly different) and understanding how protein folding works, being able to predict de novo how an amino acid sequence will become a 3D structure.
Also thank you for contacting CASP about this.
4 replies →
> There are 3-4 other deep learning projects I think have had a much greater impact on my field.
Don't leave us hanging... which projects?
1) Isonet - takes low SNR cryo-electron tomography images (that are extremely dose limited, so just incredibly blurry and frequently useless) and does two things:
* Deconvolutes some image aberrations and "de-noises" the images
* Compensates for missing wedge artifacts (missing wedge is the fact that the tomography isn't done -90° --> +90°, but usually instead -60° --> +60°, leaving a 30° wedge on the top and bottom of basically no information) which usually are some sort of directionality in image density. So if you have a sphere, the top and bottom will be extremely noisy and stretched up and down (in Z).
https://www.biorxiv.org/content/10.1101/2021.07.17.452128v1
2) Topaz, but topaz really counts as 2 or 3 different algorithms. Topaz has denoising of tomograms and of flat micrographs (i.e. images taken with a microscope, as opposed to 3D tomogram volumes). That denoising is helpful because it increases contrast (which is the fundamental problem in Cryo-EM for looking at biomolecules). Topaz also has a deep learning particle picker which is good at finding views of your protein that are under-represented, or otherwise missing, which again, normally results in artifacts when you build your 3D structure.
https://emgweb.nysbc.org/topaz.html
3) EMAN2 convolutional neural network for tomogram segmentation/Amira CNN for segmentation/flavor of the week CNN for tomogram segmentation. Basically, we can get a 3D volume of a cell or virus or whatever, but then they are noisy. To do anything worthwhile with it, even after denoising, we have to say "this is cell membrane, this is virus, this is nucleic acid" etc. CNNs have proven to be substantially better at doing this (provided you have an adequate "ground truth") than most users.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5623144/
I asked a structural biologist friend of mine (world class lab) about the impact of alphafold.
They said it's minimal.
In most cases, having a "probably" isn't good enough. They use alphafold to get early insights, but then they still use crystallography to confirm the structure. Because at the end of the day, you need to know for sure.
I'm not a biologist, but that doesn't sound minimal if crystallography is expensive.
It sounds like how we model airplanes in computers, but still test the real thing - i wouldn't call the impact of computer modelling on airplane design to be minimal.
> it can't predict folds that haven't been seen
This seems strange to me. The entire point of these types of models is to predict things on unseen data. Are you saying Deepmind is completely lying about their model?
Deepmind solved CASP, isn't the entire point of that competition to predict unseen structures?
If AlphaFold doesn't predict anything then what are you using it to do?
AlphaFold figures out that my input sequence (which has no structural data) is similar to this other protein that has structural data. Or maybe different parts of different proteins. It does this extremely well.
13 replies →
Disclaimer: I'm a professional (computational) structural biologist. My opinion is slightly different.
The problem with the structure prediction problem is not a loss/energy function problem, even if we had an accurate model of all the forces involved we'd still not have an accurate protein structure prediction algorithm.
Protein folding is a chaotic process (similar to the 3 body problem). There's an enormous number of interactions involved - between different amino acids, solvent and more. Numerical computation can't solve chaotic systems because floating point numbers have a finite representation, which leads to rounding errors and loss of accuracy.
Besides, Short range electro static and van der waals interactions are pretty well understood and before alphafold many algorithms (like Rosetta) were pretty successful in a lot of protein modeling tasks.
Therefore, we need a *practical* way to look at protein structure determination that is akin to AlphaFold2.
as an outsider learning more about protein folding, could you elaborate on the assertion that the sequence is (mostly) all you need (transformer/ML reference intended).
doesn't this assume the final fold is static and invariant of environmental and protein interactions?
put another way, how do we know that a protein does not fold differently under different environmental conditions or with different molecular interactions?
i realize this is a long-held assumption, but after studying scientific research for the past year, i realize many long-held assumptions aren't supported by convincing evidence.
These threads are always the same: lots of comments about protein folding, how amazing DeepMind is, how AlphaFold is a success story, how it has flipped an entire field on it's head, etc.
I don't think that's necessarily so - there is a lot of justified scepticism about the wilder claims of ML in this forum; it is in fact quite difficult at times to know as an outsider to the field in question how kneejerk it is.
Additionally, folding doesn't focus on what matters. Generally you want to understand the active site, you already know the context (globular, membrane, embedded, conjugated) of the protein. It is interesting whether the folding could help identify active sites for further analysis. But -- I don't think alphago is identifying new active sites or improving our understanding of their nuances.
Right, but even a speed up / quality increase can flip workflows on their head. Take ray tracing for example, when you speed it up by an order of magnitude, you can suddenly go from taking a break every time you want to render a scene, vs being able to iteratively work on a scene and preview it as you work.
I got a lot of shit (still do) when the news first broke for pushing back against the notion that AlphaFold "solved" protein folding. People really wanted to attach that word to the achievement. Thank you for providing a nuanced take on exactly why that doesn't make any sense.
I'm curious to read more on the 3-4 other deep learning projects you mentioned that have had a larger impact on your fields. Can you share some links to those works?
Yup. It’s great, but there are still many aspects to unpack and work on. Hence why Rosetta is a thing.
Rosetta methods are also moving towards ML. Here’s an article from last week: https://www.science.org/doi/10.1126/science.abn2100
> AlphaFold is amazing homology modeling
If it is homology modelling, then how can it work without input template structures?
It has template structures. AlphaFold uses the following databases:
7 replies →
“Disclaim” stopped me.
Disclaim means to deny or renounce.
Can we just chill on the whole “using this single word incorrectly breaks your whole argument” thing?
A lot of folks on HN end posts about a company with a sentence like “Disclaimer: I used to work for X”. This language (probably taken from contract law or something) is meant an admission of possible bias but in practice is also a signal that this person may know what they’re talking about more-so than the average person. After reading a lot of posts like this, it might feel reasonable for someone to flip the word around say something like “I need to disclaim…” when beginning a post, in order to signal their proximity to a topic or field as well as any sort of insider bias they may possess.
So sure, “I need to disclose” would’ve been the better word choice, but we all knew what GP was saying. It seems pedantic to imply otherwise.
3 replies →
Or to make a disclaimer.. like the OP post did?
Merriam webster[1]: " Definition of disclaim
intransitive verb 1 : to make a disclaimer ... "
[1]: https://www.merriam-webster.com/dictionary/disclaim
1 reply →
I mean like whats this about AlphaFold is gone
I got a 5th grader question about how proteins are used/represented graphically that I've never been able to find a satisfying answer for.
Basically, you see these 3D representations of specific proteins as a crumple of ribbons-- literally like someone ran multi-colored ribbons though scissors to make curls and dumped it on the floor (like a grade school craft project).
So... I understand that proteins are huge organic molecules composed of thousands of atoms, right? Their special capabilities arise from their structure/shape. So basically the molecule contorts itself to a low energy state which could be very complex but which enables it to "bind?" to other molecules expressly because of this special shape and do the special things that proteins do-- that form the basis of living things. Hence the efforts, like Alphafold, to compute what these shapes are for any given protein molecule.
But what does one "do" with such 3D shapes?
They seem intractably complex. Are people just browsing these shapes and seeing patterns in them? What do the "ribbons" signify? Are they just some specific arrangement of C,H,O? Why are some ribbons different colors? Why are there also thread-like things instead of all ribbons?
Also, is that what proteins would really look like if you could see at sub-optical wavelength resolutions? Are they really like that? I recall from school the equipartition theorem-- 1/2 KT of kinetic energy for each degree of freedom. These things obviously have many degrees of freedom. So wouldn't they be "thrashing around" like rag doll in a blender at room temperature? It seems strange to me that something like that could be so central to life, but it is.
Just trying to get myself a cartoonish mental model of how these shapes are used! Anyone?
The ribbons and helices you see in those pictures are abstract representations of the underlying positions of specific arrangements of carbon atoms along the backbone.
There are tools such as DSSP https://en.wikipedia.org/wiki/DSSP_(hydrogen_bond_estimation... which will take out the 3d structure determined by crystallography and spit out hte ribbons and helices- for example, for helices, you can see a specific arrangement of carbons along the protein's backbone in 3d space (each carbon interacts with a carbon 4 amino acids down the chain).
Protein motion at room temperature varies depending on the protein- some proteins are rocks that stay pretty much in the same single conformation forever once they fold, while others do thrash around wildly and others undergo complex, whole-structure rearrangements that almost seem magical if you try to think about them using normal physics/mechanical rules.
Having a magical machine that could output the full manifold of a protein during the folding process at subatomic resolution would be really nice! but there would be a lot of data to process.
Thanks, awesome! So what do molecular biologists do with these 3D representations once they have them? Do they literally just see how they fit to other proteins?
9 replies →
All of the loops and swirls are summary representations of known atomic positions: really, knowing a protein structure means knowing the position of every atomic nucleus, relative to the nuclei, down to some small resolution, and assuming a low temperature.
The atoms do wiggle around a bit at room temperature (and even more at body temperature), which means that simulating them usefully typically requires sampling from a probability distribution defined by the protein structure and some prior knowledge about how atoms move (often a potential energy surface fitted to match quantum mechanics).
There are many applications of these simulations. One of the most important is drug design: knowing the structure of the protein, you can zoom in on a binding pocket and design a set of drug molecules which might disable it. Within the computer simulation, you can mutate a known molecule into each of your test molecules and measure the change in binding affinity, which tells you pretty accurately which ones will work. Each of these simulations requires tens of millions of samples from the atomic probability distribution, which typically takes a few hours on a GPU given a good molecular dynamics program.
If you want something that leaves a little less to the imagination, check out https://en.wikipedia.org/wiki/Staphylococcus_aureus_alpha_to... . It looks just like what it does: drill a giant hole in cell membranes.
Some proteins have 3D structures that look like abstract art only because we don't have an intuitive understanding of what shape and amino acids are necessary to convert chemical A to chemical B, which is the main purpose of many enzymes in the body. If you look at structural proteins or motor proteins, on the other hand, their function is clear from their shape.
There are a lot of other things you can do with the shape. If it has a pore, you can estimate the size and type of small molecule that could travel through it. You can estimate whether a binding site is accessible to the environment around it. You can determine if it forms a multimer or exists as a single unit. You can see if protein A and protein B have drastically different shapes given similar sequences, which might have implications for its druggability or understanding its function.
https://alphafold.ebi.ac.uk/entry/W6KDG8
The ribbon shape for GFP is a very cool barrel thing
1 reply →
> Are people just browsing these shapes and seeing patterns in them
That's one approach.
The thing to understand is that proteins form "binding sites": areas that are more likely to attract other particular regions of proteins or other molecules, or even atoms. Think about hemoglobin. The reason it holds onto oxygen atoms is because it has binding sites.
Binding sites are great because they represent more freedom to do things than molecules typically have. Normal chemistry consists of forming strong electronic bonds between atoms, or forming rigid lattices/crystals.
Binding sites allow molecules to do things like temporarily attach to each other and let each other go under certain circumstances, for instance when another binding site is active/inactive. This can happen through "conformation change", where a molecule bound/unbound on some binding site makes the protein change shape slightly. This is how proteins can act like machines.
> What do the "ribbons" signify
Different regions of the protein have different sequences of amino acids. Amino Acids have somewhat different shapes from each other. The ribbons are actually broader than the spindles (or threads), and less flexible. Not sure about the different colors, maybe someone else can fill in.
> Also, is that what proteins would really look like if you could see at sub-optical wavelength resolutions?
Not really, it's an abstraction. They're big molecules, so if you look closely they're made of atoms, which are (kinda, sorta not really, quantum stuff) spherical.
> So wouldn't they be "thrashing around" like rag doll in blender at room temperature?
Yes, but the attractions between the different parts of the molecule keeps it somewhat under control. So more like an undulating little creature, jellyfish perhaps.
> It seems strange to me that something like that could be so central to life
Yep, gotta remember that it's all statistical. These things are getting made, do their job, breaking, and getting degraded some insane number of times per second. Swarm behavior, sort of.
Short answer is that the ribbon representation is a visual simplification based on known structures -- they are actually composed of atoms.
They certainly do "thrash around", but that thrashing is constrained by the bonds that are formed, which greatly limits the degrees of freedom. Here's a short video of a simulation to demonstrate:
https://www.youtube.com/watch?v=fggqPtaZj8g
I've been going through MIT's online Introduction to Biology course[0] that answers some of your questions here with regards to the shapes and what they signify - specifically the "Proteins and Protein Structure" lessons in the second unit, although some of the previous lectures are helpful setup as well - really interesting and engaging stuff, taught by Eric Lander (who ended up being one of the CRISPR pioneers featured in Isaacson's latest book)
[0]https://learning.edx.org/course/course-v1:MITx+7.00x+2T2022/...
That's cool, I just happened to have picked up a used copy of the text on which the course based... "Molecular Biology of the Cell" -- the huge grey book. Geez, there's a lot of material in there!
Back in the day, I had steered away from chemistry in college because I didn't like to memorize stuff. Now I realize I missed out on some amazing knowledge.
> I recall from school the equipartition theorem-- 1/2 KT of kinetic energy for each degree of freedom. These things obviously have many degrees of freedom. So wouldn't they be "thrashing around" like rag doll in a blender at room temperature?
It's funny you say that, because the first image on the English Wikipedia page for Equipartition Theorem[1] is an animation of the thermal motion of a peptide.
[1]: https://en.wikipedia.org/wiki/Equipartition_theorem
BTW, in terms of protein dynamics, before you even think about the thrashing around- 1.2kt at room temperature is enough to form and break hydrogen bonds in real time (around 1-2kcal) so presumably, protein h-bonds are breaking and reforming spontaneously at scale.
Your "now what?" question is legitimate and reminiscent of reactions after the completion of the Human Genome Project.
Just like having a human genome sequence, this is not a magic key that solves all problems of biology but a building block for use by researchers. An investigator may look up the folded structure of a protein and use that information to glean certain context-specific insights from it such as how exactly two interacting proteins interact mechanically.
The other significant benefit is that this frees up resources that were spent having to figure out the structure in other ways. It's an efficiency improvement.
Watch this video on DNA polymerase [1]. Obviously it’s an illustration, but I think it helps answer you question because cartoons are great. (MD, not PhD biologist)
[1] https://youtu.be/sKe3UgH1AKg
The ability for another molecule (probably another protein) to "react" or interact with the protein depends not only on the chemistry but also the shape. An otherwise compatible sequence of atoms might not be able to react because it and the binding site are just incompatibly shaped.
This is hugely important for developing drugs and vaccines.
To see the effect of this look no further than prions. Prions are the exact same protein that are folded in weird ways. Worse, they can "transmit" this misfolded shape to other otherwise normal proteins. Prions behave differently just because of the different shape and can lead to disease. This is exactly what Mad Cow's Disease (BSE) is.
What we get taught in high school about chemistry is incredibly oversimplified.
One example of this I like is the geometry of a water molecule. When we first learn about atoms, we learn the "solar system" model (aka Bohr). The reality is instead that we have 3D probability distributions of where electrons might be. These clouds are in pairs. I believe this is to do with the inverted wavefunction really we're getting beyond my knowledge of quantum mechanics here so that's just a guess.
Well those clouds additionally form valence shells. We learn about these and how atoms want to form completely valence shells. So Oxygen has 8 electrons ie 4 pairs of electrons. When bonding with 2 hydrogen atoms we end up with a weird geometry of ~104.5 degrees between the two hydrogen atoms because of how these pairs interact. The naive assumption might expect that the two hydrogen atoms are 180 degree apart.
So back to proteins, you may have learned about hydrogen bonds. This affects molecular shape because when a hydrogen atom shares an electron, it is often positively charged. That positive charge pushes away other positive charges. This is the realy difficulty in protein folding because with a molecule of thousands of atoms and weird geometry you may find distant parts of the molecule interacting with hydrogen bonds.
So a single cell consists of thousands (IIRC) of different proteins. Figuring out those interactions is important but incredibly difficult.
In addition to /u/dekhn 's excellent description, this phenomenon is referred to as a protein's "secondary structure" [0]
[0] https://en.m.wikipedia.org/wiki/Protein_secondary_structure
This is probably one of the best applications of AI in science in terms of impact so far. I can't think of any other problem with the same potential impact.
EDIT: grammar
AlphaFold is the best counterpoint to tech cynics.
One of the largest public tech companies in the world funded a multi-year scientific project, executed the research flawlessly and moved forward an entire scientific field. They then went on to openly release the code _and_ data, working with a publicly funded organization (EMBL-EBI) to ensure researchers across the globe can easily access the outputs.
I'm not arguing that every tech company is a net positive for humanity. Google itself isn't perfect. Google + DeepMind is setting a bloody high bar though.
Elon Musk moved EVs to the mainstream. Starlink. Has the vision to go to Mars.
Amazon basically put malls out of business which are hugely environmentally destructive.
Bill Gates is doing stuff too I think.
Big tech does some good things.
2 replies →
This is definitely one of the most exciting spaces in AI right now. Another somewhat-related startup is PostEra (medicinal chemistry for drug discovery via AI) https://postera.ai/about/
You are right and when thinking about it I can see 2 problems which I hope in the future can have even more impact:
1. Using AI to determine the most efficient methods of doing mathematical expressions, transformations and computation algorithms - division, square root, maybe traveling salesman - these which take relatively high amount of CPU cycles to compute and are used everywhere. If inputs and outputs can be assigned to it, AI can eventually build a transformation which can be reproduced using a silicon.
2. Physics phenomena in general, not only organic protein, can be measured and with sufficient ability to quantize them to inputs and experimentally obtained outputs to train the network, we could in theory establish new formulas or constants and progress the understanding of the Universe.
the groundworks, at least partially, happen as you typed this: https://www.nature.com/articles/d41586-021-01627-2
AI translate has probably a bigger worldwide impact so far.
jarenmf said "in science" - but it is an interesting question how much automated translation has helped scientists translate papers from other languages.
1 reply →
Can someone put AlphaFold's problem space into perspective for me?
Why is protein folding important? Theoretical importance? Can we do something with protein folding knowledge? If so, what?
I've been hearing about AlphaFold from the CS side. There they seem to focus on protein folding primarily as an interesting space to apply their CS efforts.
If we knew:
(a) the structure of every protein (what DeepMind is doing here)
(b) how different protein structures interact (i.e. protein complexes - DeepMind is working on this but not there yet)
Then we could use those two building blocks to design new proteins (drugs) that do what we want. If we solve those two problems with very high accuracy, we can also reduce the time it takes to go from starting a drug discovery programme to approved medicine.
Obtaining all protein structures and determining how they interact is a key step towards making biology more predictable. Previously, solving the structure of a protein was very time consuming. As a result, we didn’t know the structure for a majority of proteins. Now that it’s much faster, downstream research can move faster.
Caveat: we should remember that these are all computational predictions. AlphaFold’s predictions can be wrong and protein structures will still need to be validated. Having said that, lots of validation has already occurred and confidence in the predictions grows with every new iteration of AlphaFold.
> Then we could use those two building blocks to design new proteins (drugs) that do what we want. If we solve those two problems with very high accuracy, we can also reduce the time it takes to go from starting a drug discovery programme to approved medicine.
Drugs are usually not proteins, but instead small molecules that are designed to help or interfere with the operation of proteins instead.
2 replies →
How are the predictions validated? Waiting for the old fashioned way for... very difficult crystal structure experiments? Or something else?
5 replies →
You are basically made of proteins, which are basically folded sequences of amino acids, proteins are molecular machines that are the fundamental building block of animals, plants, bacteria, fungi, viruses etc.
So yeah the applications are enormous, from medicine to better industrial chemical processes, from warfare to food manufacturing.
> proteins are molecular machines
Does that imply proteins have some dynamics that need to be predicted too? I remember seeing animations of molecular machines that appeared to be "walking" inside the body - are those proteins or more complex structures?
11 replies →
As others have already mentioned, proteins are the machinery of the cell. They perform an immense array of functions and they must fold in a certain way to perform these functions. This is part of what's known as the structure-function relationship.
Misfolded proteins are contributors to numerous pathological conditions and the more we can understand about how and why this folding happens, the better we can treat these conditions.
Another aspect is that while we can at least partially determine the primary structure (the amino acid sequence) of proteins from DNA and RNA, we don't necessarily know their secondary or tertiary structures (3 dimensional conformation). This is a key piece of the puzzle for figuring out how these proteins do their proteiny things and how they interact with other proteins and even how they form quaternary structures with other proteins (an assembly of multiple proteins that perform some function, many pores are assemblies like this). Once we know these structures and understand how they work on a structural and chemical level, we can manipulate them far more easily.
In order to do rational drug design, which is designing a drug for a specific target or active site on a protein, we need to understand these structures. Working to solve protein folding is a key step in treating disease states and understanding how cells work on a fundamental level. The impact is hard to understate.
My understanding is that protein folding is a major cost bottleneck in drug design.
Researchers can come up with candidate molecule formulas that might work as good drugs, but the problem is that these proteins organize/fold themselves physically in a hard-to-predict way. And how they fold directly affects their properties as drugs.
If AlphaFold can accurately predict folding, it’ll allow researchers to prioritize drug candidates more accurately which will reduce research time and costs. Supposedly the major pharmaceutical companies can spend up to billions when designing a single drug. Optimistically, predicting protein folding better will allow for much more rapid and cheaper drug development
I love AlphaFold, but this is a big misconception. The biggest cost bottle neck in drug development and design, by orders of magnitude, is associated with assaying (and potentially reducing) off-target binding or toxicity and assaying (and potentially increasing) efficacy. Determining a protein structure empirically with cryoEM, NMR, or crystallography will generally cost less than $1M (sometimes far less), which is tiny compared to the many millions or billions of dollars that get poured into clinical trials for a single drug. AF2 is useful in some basic research cases but isn't really that useful for traditional drug design and development.
A machine learning approach for predicting toxicity would have a far greater impact on public health than AF2 does.
My understanding is that protein folding is not a bottleneck in drug design.
Yes, once you identified a target protein, its structure is useful to selectively target it. But the main bottleneck is identifying such targets. In other words, the main difficulty is to figure out what to hit, not how to hit it, and protein folding mostly helps with how at the moment.
Proteins are what makes everything in a cell work. They are produced as a "linear" structure that must fold into a proper shape to execute its function, such as acting as a pore that only lets a specific chemical through the cell membrane.
The importance here is to figure out potential targets for treatments that take into account particularities of certain proteins. That could produce better drugs with less side effects.
The genome, all of our DNA combined, is just a bunch of 1D strings like "cgtattctgcttgta". Those strings encode proteins, which fold up into a 3D shape once created. This 3D shape is what determines what the protein actually does inside the cell. Without understanding protein folding we don't understand what the DNA actually does.
This might be an interesting resource for you: https://pdb101.rcsb.org/
The applications and importance has been discussed, but let me explain why what we are doing right now does not work which will also emphasize the importance of this.
At this time, we create drugs, test them on animals, and see what the side effects and results actually are. We are very limited in our capabilities and basically throw mud at the wall and see what sticks. This would allow us to try potential drug candidates without so much randomness.
There are a million articles and podcasts explaining exactly your question. Those will be better than HN responses. I suggest you take 15 seconds to Google it.
Yes there are a million articles. That is why asking a question here on HN is useful. The HN community more often than not offers intelligent insight as well as curated recommended links for learning more about a topic. Yes, the signal-to-noise ratio isn't perfect on HN, but it is a lot better than random Google searches. If nothing else, it often leads to enough basic understanding so that someone can then perform more refined and therefore more productive Google searches. I appreciate the basic questions and the generous nature of many HN commenters who offer thoughtful responses.
The endgame of protein folding is nanotechnology. Life is also nanotechnology. So basically the end game is to take full control of life: change existing life forms, design new life forms, new ecosystems. Eat the earth first, then the universe. Not yet there, but getting much closer.
Proteins are the molecular machines of all living beings on this planet. They do almost everything. We need to understand their structure and how that structure changes when they perform some function. Essentially we have to reverse engineer nature's machines.
Biology has traditionally studied these things from the outside in, top down. Advances in biology are correlated with advances in microscope technology. We went from seeing with the naked eye to being able to magnify tissue to see cells and even what's inside of them.
Studying biology from the inside out, bottom up is significantly harder. You have chemistry which is hard science. On top of that you have biochemistry which is equally rigorous. Above that level of abstraction things start to get a lot less well defined. We go from organic chemistry (carbon, hydrogen, oxygen, etc.) to more complex things found in life (aminoacids, carbohydrates, lipids) to polymers of those things (proteins, polysaccharides, triglycerides) to a point where they get so big and complex their tridimensional structure starts influencing how they interact with other molecules then we start seeing aggregates (quaternary proteins) of these complex things!
Important proteins like ATP synthase are well understood (it's a friggin' turbine like in a hydroelectric power plant!) but there are so many others haven't received the same attention. Bioinformatics technology is changing that. The human genome's been mapped. Now it's time to map the proteome.
https://en.m.wikipedia.org/wiki/Protein_folding
AlphaFold is a phenomenal tool that demonstrates how AI can already outclass humans for certain tasks. It is a prime example of a problem space where conventional approaches are simply inferior, and that AI is not just a fancy name but can be extraordinarily powerful.
This reads like something created by an AI. And the mere fact that I can't be sure if it is or not now is testament to the statement this is making
Just for shits & giggles.
Prompt:
Write a short abstract for an article on why AlphaFold isn't just a tremendous technological achievement, but an example of how Artificial Intelligence will allow humanity to continue to achieve technological progress in the future:
Response:
When DeepMind’s AlphaFold was announced as the winner of the CASP13 protein folding prediction competition, it was hailed as a major breakthrough in both artificial intelligence (AI) and biology.
While AlphaFold is undoubtedly a remarkable achievement, it is also a clear example of how AI can be used to solve complex problems that have traditionally been seen as beyond the reach of machines.
In particular, AlphaFold represents a significant advance in the ability of AI systems to learn from limited data. Unlike previous AI approaches to protein folding, which relied on large datasets of known protein structures, AlphaFold was able to produce accurate predictions with only a small number of known protein structures to work from.
This ability to learn from limited data is likely to be increasingly important as we face problems that are too complex for us to fully understand and collect all the data we would need to solve them using traditional methods.
In the future, AI systems like AlphaFold will become increasingly important in helping us to make progress on a wide range of pressing challenges, from developing new medicines to tackling climate change.
5 replies →
It's a new tool, AlphaFool
It's the superstition and the amount of parasite words.
> demonstrates how AI can already outclass humans for certain tasks
I'm not sure how clear the edge over humans in this case is. There were some attempts at machine assisted human solving like Foldit that did produce results: https://en.wikipedia.org/wiki/Foldit#Accomplishments
Many thanks to Deepmind for releasing predicted structures of all known protein monomers. What I'd like next is for Alphafold (or some other software) to be able to show us multimeric structures based on the single monomer/subunit predictions and protein-protein interactions (i.e. docking). For example the one I helped work on back in my structural biology days was the circadian clock protein KaiC : https://www.rcsb.org/structure/2GBL, that's the "complete" hexameric structure that shows how each of the subunits pack. The prediction for the single monomer that forms a hexamer is very close to the experimental https://alphafold.ebi.ac.uk/entry/Q79PF4 and in fact shows the correct structure of AA residues 500 - 519 which we were never able to validate until 12 years later (https://www.rcsb.org/structure/5C5E) when we expressed those residues along with another protein called KaiA which we knew binds to the "top" CII terminal (AAs 497-519) of KaiC. If we would have had this data then, it would have allowed us to not only make better predictions about biological function and protein-protein interactions but would have helped better guide future experiments.
What we can do with this data now is use methods such as cryo-em to see the "big picture", i.e. multi-subunit protein-protein interactions where we can plug in the Alphafold predicted structure into the cryo-em 3d density map and get predicted angstrom level views of what's happening without necessarily having to resort to slower methods such as NMR or x-ray crystallography to elucidate macromolecular interactions.
A small gripe about the alphafold ebi website: it doesn't seem to show the known experimental structure, it just shows "Experimental structures: None available in PDB". For example the link to the alphafold structure above should link to the 2GBL, 1TF7, or any of the other kaic structures from organism PCC7942 at RCSB. This would require merging/mapping data from RCSB with EBI and at least doing some string matching, hopefully they're working on it!
You might be interested in https://www.biorxiv.org/content/10.1101/2021.10.04.463034v2
Obtaining this dataset prior to alphafold would have cost on the order of $200 trillion. https://twitter.com/wintonARK/status/1552653527670857729
Anyone knowledgeable know if this estimate is accurate? Insane if true
It's impossible to really put a number on it, because the task itself was impossible. PHDs and the field's top scientists simply couldn't figure out many complicated protein structures after years of attempts, and the fact that there's so many (200M+) mean that the problem space is vast.
It doesn't make any sense on multiple levels. This is a computational prediction and there was no computational alternative- for many of these proteins would never have had a structure solved even if you spent the money. They are just taking $cost_per_structure_solved * number_of_remaining_structures and assuming that things scale linearly like that.
Note that crystallographers are now using these predicftions to bootstrap models of proteins they've struggled to work with, which indicates the level of trust in the structural community for these predictions is pretty high.
Even if that's exaggerated, it might have taken significant time to reach to this stage. Probably on the order of >50 years.
Off the top of my head:
(200 trillion cost) / (200 million structures predicted) = 1 million per structure.
That reflects the personnel cost (5 Yr PHP scholarship, PostDoc/Prof mentorship; inverstment+depreciation for the lab equipment). All this to crystallize 1 structure and characterize its folding behavior.
I don't know if this calculation is too simplistic, just coming up with something.
How do they know their structures are correct?
Disclaimer: I work in Google, organizationally far away from Deep Mind and my PhD is in something very unrelated.
They can't possibly know that. What they know is that their guesses are very significantly better than the previous best and that they could do this for the widest range in history. Now, verifying the guess for a single (of the hundreds of millions in the db) protein is up to two years of expensive project. Inevitably some will show discrepancies. These will be fed to regression learning, giving us a new generation of even better guesses at some point in the future. That's what I believe to be standard operating practice.
A more important question is: is today's db good enough to be a breakthrough for something useful, e.g. pharma or agriculture? I have no intuition here, but the reporting claims it will be.
The press release reads like an absurdity. It's not the "protein universe", it's the "list of presumed globular proteins Google found and some inferences about their structure as given by their AI platform".
Proteins don't exist as crystals in a vacuum, that's just how humans solved the structure. Many of the non-globular proteins were solved using sequence manipulation or other tricks to get them to crystallize. Virtually all proteins exist to have their structures interact dynamically with the environment.
Google is simply supplying a list of what it presumes to be low RMSD models based on their tooling, for some sequences they found, and the tooling is based itself on data mostly from X-ray studies that may or may not have errors. Heck, we've barely even sequenced most of the DNA on this planet, and with methods like alternative splicing the transcriptome and hence proteome has to be many orders of magnitude larger than what we have knowledge of.
But sure, Google has solved the structure of the "protein universe", whatever that is.
10 replies →
Same as any other prediction I'd presume. Run it against a known protein and see how the answer lines up. Predict the structure of an unknown protein, then use traditional methods (x-ray crystallography, maybe STEM, etc) to verify.
As a simple example, one measure used to compare a predicted structure against a reference is the RMSD (root mean square deviation).
https://en.m.wikipedia.org/wiki/Root-mean-square_deviation_o...
The lower the RMSD between two structures, the better (up to some limit).
"Verify" is almost correct. The crystallography data is taken to be "ground truth" and the predicted protein structure from AlphaFold is taken to be a good guess starting point. Then other software can produce a model that is a best fit to the ground truth data starting from the good guess. So even if the guess is wrong in detail it's still useful to reduce the search space.
This is exactly right.
This video goes some way to explaining how they know the structures are correct: https://www.youtube.com/watch?v=vXZzftX03VY
This is the right line of questioning.
As we solve viewability into the complex coding of proteins, we need to be right. Next, hopefully, comes causal effect identification, then construction ability.
If medicine can use broad capacity to create bespoke proteins, our world becomes both weird and wonderful.
they don't but they are more correct than what others have predicted. Some of their predictions can be compared with structures determined with x-ray crystallography
did they come up with their structures independently of the x-ray crystallography, or was that part of a ML dataset for predicting structure
2 replies →
They won a decades-long standing challenge predicting the protein structures of a much smaller (yet significantly quite large) set of proteins using a model (AlphaFold).
Then they use the model to predict more.
Although we don't know if they are correct, these structures are the best (or the least bad) we have for now.
We know the structure of some proteins. It's not that it's impossible to measure, it's just very expensive. This is why having a model that can "predict" it is so useful.
They compare the predicted structure (computed) to a known structure (physical x-ray crystallography). There's an annual competition CASP (Crtical Assessment of protein Structure Prediction) that does X-Ray crystallography on a protein. The identity of this protein is held secret by the organizers. Then research teams across the world present their models and attempt to predict without advance knowledge, the structure of the protein from their amino acid sequence. Think of CASP as a validation data set used to evaluate a machine learning model.
DeepMind crushes everyone else at this competition.
The worry is about dataset shifting. Previously, the data were collected for a few hundreds thousands structures, now it is 200m. I think there could be doubts on distributions and how that could play a role in prediction accuracy.
>we’re now releasing predicted structures for nearly all catalogued proteins known to science
is the result that researchers will now much more quickly 'manually' validate or invalidate the predicted structures for proteins they are working with? i understand it is traditionally a long and complex process, but i imagine it is expedited by having a predicted structure to test as the baseline?
Demis and John will probably win either the Chemistry or Physics Nobel Prize in the next couple of years.
Some people are using "AI wins a Nobel price" as the new Turing test. Maybe that is going to happen sooner than they expect. Or maybe the owners of the AI will always claim it on its behalf.
there's no AI here. This is just ML. All deepmind did here was use multiple excellent resources- large numbers of protein sequences, and small numbers of protein structures, to create an approximation function of protein structure, without any of the deep understanding of "why". Interestingly, the technology they used to do this didn't exist 5 years ago!
Today I learned that there are bacteria that have a protein helping to form ice on plants [1] to destroy them and extract nutrients (however I didn't understand how bacteria themselves survive this).
Machine learning typically uses existing data to predict new data. Please explain: Does it mean that AlphaFold can only use known types of interactions between atoms and will mispredict the structure of proteins that use not yet known interactions?
And why we cannot just simulate protein behaviour and interactions using quantum mechanics?
[1] https://pubs.acs.org/doi/10.1021/acs.jpcb.1c09342
>And why we cannot just simulate protein behaviour and interactions using quantum mechanics?
If you wanted to simulate the behaviour of an entire protein using quantum mechanics, the sheer number of calculations required would be infeasible.
For what it's worth, I have a background in computational physics and am studying a PhD in structural biology. For any system (of any size) that you want to simulate, you have to consider how much information you're willing to 'ignore' in order to focus on the information you would like to 'get out' of a set of simulations. Being aware of the approximations you make and how this impacts your results is crucial.
For example, if I am interested in how the electrons of a group of Carbon atoms (radius ~ 170 picometres) behave, I may want to use Density Functional Theory (DFT), a quantum mechanical method.
For a single, small protein (e.g. ubiquitin, radius ~ 2 nanometres), I may want to use atomistic molecular dynamics (AMD), which models the motion of every single atom in response to thermal motion, electrostatic interactions, etc using Newton's 2nd law. Electron/proton detail has been approximated away to focus on overall atomic motion.
In my line of work, we are interested in how big proteins (e.g. the dynein motor protein, ~ 40 nanometres in length) move around and interact with other proteins at longer time (micro- to millisecond) and length (nano- to micrometre) scales than DFT or AMD. We 'coarse-grain' protein structures by representing groups of atoms as tetrahedra in a continuous mesh (continuum mechanics). We approximate away atomic detail to focus on long-term motion of the whole protein.
Clearly, it's not feasible to calculate the movement of dynein for hundreds of nanoseconds using DFT! The motor domain alone in dynein contains roughly one million atoms (and it has several more 'subunits' attached to it). Assuming these are mostly Carbon, Oxygen or Nitrogen, then you're looking at around ten million electons in your DFT calculations, for a single step in time (rounding up). If you're dealing with the level of atomic bonds, you're probably going to a use time steps between a femto- (10^-15 s) or picosecond (10^-12 s). The numbers get a bit ridiculous. There are techniques that combine QM and AMD, although I am not too knowledgeable in this area.
Some further reading, if you're interested (I find Wikipedia articles on these topics to generally be quite good):
DFT: https://en.wikipedia.org/wiki/Density_functional_theory
Biological continuum mechanics: https://doi.org/10.1371/journal.pcbi.1005897
Length scales in biological simulations: https://doi.org/10.1107/S1399004714026777
Electronic time scales: https://www.pnas.org/doi/10.1073/pnas.0601855103
To add to this comment (from someone who used to engineer proteins, and long ago DFT as well): DFT is only really decent at ground state predictions, computational chemists often have to resort to even more expensive methods to capture "chemistry", i.e. correlated electron-pair physics and higher-state details. Simulating catalysis is extremely challenging!
> And why we cannot just simulate protein behaviour and interactions using quantum mechanics?
QM calculations have been done in proteins, but they’re computationally very expensive. IIRC, there are hybrid approaches where only a small portion of interest in the protein structure is modelled by QM and the rest by classical molecular mechanics.
This is an incredible gift to humanity. A huge positive impact. The team should be proud
The press release is a bit difficult to place into historical context. I believe that the first AlphaFold release was mostly human and mouse proteins, and this press release marks the release of structures for additional species.
> I believe that the first AlphaFold release was mostly human and mouse proteins,
More than that. The press release actually contains an infographic comparing the amount of published protein models for different clades of organisms. The infographic shows that the previous release (~1mln proteins) contained proteins of some animal, plant, bacterial, and fungal species.
A fun way I've been thinking about all this is what nanotech/nanobots are actually going to look like. Tiny little protein machines doing what they've been doing since the dawn of life. We now have a library of components, and as we start figuring out what they can do, and how to stack them, we can start building truly complex machinery for whatever crazy tasks we can imagine. The impact goes so far beyond drugs and treatments.
I had a dream about this a few days ago. About complexly wrinkled/crumpled/convolved things.
Like a fresh crepe stuffed into the toe of a boot. Bewilderingly complex.
But I have a question. Does such contortion work for 3d "membranes" in a 4d space? It's something I'm chewing on. Hard to casually visualize, obviously.
Of course! The term you might wanna start off googling is "curvature of manifolds". What's even neater than "3d thing curving in 4d space" is that these notions can be made precise also without the "in [whatever] space" part (see "intrinsic curvature" and "Riemannian manifold").
Thank you very much.
As an aside, the protein structure visualizations in the article are pretty. Is there a good source for more?
* https://pdb101.rcsb.org/motm/
* https://ccsb.scripps.edu/goodsell/
* https://pdb101.rcsb.org/sci-art/geis-archive/irving-geis
* https://www.digizyme.com/portfolio.html
* https://www.drewberry.com/
* https://biochem.web.utah.edu/iwasa/projects.html
* http://onemicron.com/
* The art of Jane Richardson, of which I couldn’t find a link
* This blog has plenty of good links: https://blogs.oregonstate.edu/psquared/
https://alphafold.ebi.ac.uk/
I haven’t had a chance to look through some of the new predictions, but I know there were some issues with predicting the structure for membrane bound proteins previously. PDB hardly contains any.
Does the new set of predictions contain a bunch of membrane bound protiens?
Come play biotech with us and let's figure out EVERYTHING and not just protein folding, yay! https://epicquest.bio
Now we can start guessing what futures they are betting on: these, in which open-sourcing the whole thing commoditises critical complements.
---
https://www.gwern.net/Complement
The many body problem remains unsolved. So the question is, is this approach useful?
Is folding@home obsolete now?
Folding@home answers a related but different question. While AlphaFold returns the picture of a folded protein in its most energetically stable conformation, Folding@home returns a video of the protein undergoing folding, traversing its energy landscape.
It's not, but the question is (and has long been) whether the energy expended by folding@home is worth the scientific result. IMHO- probably not.
I would say no, the two approaches may be used to validate each other.
Good question… I’d imagine that other methods of folding solutions are still valuable, because AlphaFold needs to be checked.
How do you know that the predicted structure will be correct? I presume researchers will need to validate the structure empirically. Do we know how good the model has been at predicting so far?
Just imagine if the tech world puts all programatic advertising development on hold for a year and the collective brain power is channeled to science instead…
Does anyone know what it would cost to download this whole dataset? Google Cloud Datasets only allow 1 TB/month for free to download, I believe
To answer my own question it looks like for folks who don’t want to wait 21 months for 21 terabytes, that it might cost approximately 1600 USD to download the full approx 20TB dataset assuming egress costs of .08 USD per GB as mentioned here: https://cloud.google.com/storage/pricing#network-egress It’s a pity it’s so expensive to download
> Today, I’m incredibly excited to share the next stage of this journey. In partnership with EMBL’s European Bioinformatics Institute (EMBL-EBI), we’re now releasing predicted structures for nearly all catalogued proteins known to science, which will expand the AlphaFold DB by over 200x - from nearly 1 million structures to over 200 million structures - with the potential to dramatically increase our understanding of biology.
And later:
> Today’s update means that most pages on the main protein database UniProt will come with a predicted structure. All 200+ million structures will also be available for bulk download via Google Cloud Public Datasets, making AlphaFold even more accessible to scientists around the world.
This is the actual announcement.
UniProt is a large database of protein structure and function. The inclusion of the predicted structures alongside the experimental data makes it easier to include the predictions in workflows already set up to work with the other experimental and computed properties.
It's not completely clear from the article whether any of the 200+ million predicted structures deposited to UniProt have not be previously released.
Protein structure determines function. Before AlphaFold, experimental structure determination was the only option, and that's very costly. AlphaFold's predictions appears to be good enough to jumpstart investigations without an experimental structure determination. That has the potential to accelerate many areas of science and could percolate up to therapeutics.
One area that doesn't get much discussion in the press is the difference between solid state structure and solution state structure. It's possible to obtain a solid state structure determination (x-ray) that has nothing to do with actual behavior in solution. Given that AlhpaFold was trained to a large extent on solid state structures, it could be propagating that bias into its predicted structures.
This paper talks about that:
> In the recent Critical Assessment of Structure Prediction (CASP) competition, AlphaFold2 performed outstandingly. Its worst predictions were for nuclear magnetic resonance (NMR) structures, which has two alternative explanations: either the NMR structures were poor, implying that Alpha-Fold may be more accurate than NMR, or there is a genuine difference between crystal and solution structures. Here, we use the program Accuracy of NMR Structures Using RCI and Rigidity (ANSURR), which measures the accuracy of solution structures, and show that one of the NMR structures was indeed poor. We then compare Alpha-Fold predictions to NMR structures and show that Alpha-Fold tends to be more accurate than NMR ensembles. There are, however, some cases where the NMR ensembles are more accurate. These tend to be dynamic structures, where Alpha-Fold had low confidence. We suggest that Alpha-Fold could be used as the model for NMR-structure refinements and that Alpha-Fold structures validated by ANSURR may require no further refinement.
https://pubmed.ncbi.nlm.nih.gov/35537451/
> Before AlphaFold, experimental structure determination was the only option
Other computational methods have existed for a long time. Folding@home was founded 22 years ago.
folding@home doesn't predict structures, it simulates protein folding. Different area with some overlap.