Comment by bifftastic

3 years ago

How do they know their structures are correct?

26 comments

bifftastic

Disclaimer: I work in Google, organizationally far away from Deep Mind and my PhD is in something very unrelated.

They can't possibly know that. What they know is that their guesses are very significantly better than the previous best and that they could do this for the widest range in history. Now, verifying the guess for a single (of the hundreds of millions in the db) protein is up to two years of expensive project. Inevitably some will show discrepancies. These will be fed to regression learning, giving us a new generation of even better guesses at some point in the future. That's what I believe to be standard operating practice.

A more important question is: is today's db good enough to be a breakthrough for something useful, e.g. pharma or agriculture? I have no intuition here, but the reporting claims it will be.

f38zf5vdt 3 years ago
The press release reads like an absurdity. It's not the "protein universe", it's the "list of presumed globular proteins Google found and some inferences about their structure as given by their AI platform".
Proteins don't exist as crystals in a vacuum, that's just how humans solved the structure. Many of the non-globular proteins were solved using sequence manipulation or other tricks to get them to crystallize. Virtually all proteins exist to have their structures interact dynamically with the environment.
Google is simply supplying a list of what it presumes to be low RMSD models based on their tooling, for some sequences they found, and the tooling is based itself on data mostly from X-ray studies that may or may not have errors. Heck, we've barely even sequenced most of the DNA on this planet, and with methods like alternative splicing the transcriptome and hence proteome has to be many orders of magnitude larger than what we have knowledge of.
But sure, Google has solved the structure of the "protein universe", whatever that is.
- dekhn 3 years ago
  
  People have been making grand statements about the structure of the protein universe for quite some time (I've seen a fair number of papers on this, such as https://faseb.onlinelibrary.wiley.com/doi/abs/10.1096/fasebj... and https://faseb.onlinelibrary.wiley.com/doi/abs/10.1096/fasebj... from a previous collaborator of mine).
  Google didn't solve the structure of the protein universe (thank you for saying that). But the idea of the protein structure universe is fairly simple- it's a latent space that allows for direct movement over what is presumably the rules of protein structures along orthogonal directions. It would encompass all the "rules" in a fairly compact and elegant way. Presumably, superfamilies would automagically cluster in this space, and proteins in different superfamilies would not.
- lrem 3 years ago
  
  I recognize your superior knowledge in the topic and assume you're right.
  But you also ignore where we're at in the standard cycle:
  https://phdcomics.com/comics/archive_print.php?comicid=1174
  ;)
  
  5 replies →
- gilleain 3 years ago
  
  edit: I should have read the post first! What do you mean 'only globular proteins'? They say they have predictions for all of UniProt...
  ---------------
  Yes, the idea of a 'protein universe' seems like it should at least encompass 'fold space'.
  For example, WR Taylor : https://pubmed.ncbi.nlm.nih.gov/11948354/
  I think the rough estimate was that there were around 1000 folds - depending on how fine-grained you want to go.
  Absolutely agree, though, that a lot of proteins are hard to crystalise (i understand) due to being trans-membrane or just the difficulty of getting the right parameters for the experiment.
  
  2 replies →

luma 3 years ago

Same as any other prediction I'd presume. Run it against a known protein and see how the answer lines up. Predict the structure of an unknown protein, then use traditional methods (x-ray crystallography, maybe STEM, etc) to verify.

gilleain 3 years ago

As a simple example, one measure used to compare a predicted structure against a reference is the RMSD (root mean square deviation).
https://en.m.wikipedia.org/wiki/Root-mean-square_deviation_o...
The lower the RMSD between two structures, the better (up to some limit).
iandanforth 3 years ago

"Verify" is almost correct. The crystallography data is taken to be "ground truth" and the predicted protein structure from AlphaFold is taken to be a good guess starting point. Then other software can produce a model that is a best fit to the ground truth data starting from the good guess. So even if the guess is wrong in detail it's still useful to reduce the search space.
christudor 3 years ago

This is exactly right.

christudor 3 years ago

This video goes some way to explaining how they know the structures are correct: https://www.youtube.com/watch?v=vXZzftX03VY

tomrod 3 years ago

This is the right line of questioning.

As we solve viewability into the complex coding of proteins, we need to be right. Next, hopefully, comes causal effect identification, then construction ability.

If medicine can use broad capacity to create bespoke proteins, our world becomes both weird and wonderful.

seydor 3 years ago

they don't but they are more correct than what others have predicted. Some of their predictions can be compared with structures determined with x-ray crystallography

cupofpython 3 years ago
did they come up with their structures independently of the x-ray crystallography, or was that part of a ML dataset for predicting structure
- unlikelymordant 3 years ago
  
  The casp competition that they won consists of a bunch of new proteins, the structures of which havnt been published. So the test set is for brand new proteins in that case.
  
  1 reply →

__rito__ 3 years ago

They won a decades-long standing challenge predicting the protein structures of a much smaller (yet significantly quite large) set of proteins using a model (AlphaFold).

Then they use the model to predict more.

Although we don't know if they are correct, these structures are the best (or the least bad) we have for now.

ArnoVW 3 years ago

We know the structure of some proteins. It's not that it's impossible to measure, it's just very expensive. This is why having a model that can "predict" it is so useful.

DevX101 3 years ago

They compare the predicted structure (computed) to a known structure (physical x-ray crystallography). There's an annual competition CASP (Crtical Assessment of protein Structure Prediction) that does X-Ray crystallography on a protein. The identity of this protein is held secret by the organizers. Then research teams across the world present their models and attempt to predict without advance knowledge, the structure of the protein from their amino acid sequence. Think of CASP as a validation data set used to evaluate a machine learning model.

DeepMind crushes everyone else at this competition.

liuliu 3 years ago

The worry is about dataset shifting. Previously, the data were collected for a few hundreds thousands structures, now it is 200m. I think there could be doubts on distributions and how that could play a role in prediction accuracy.