← Back to context

Comment by f38zf5vdt

2 years ago

The press release reads like an absurdity. It's not the "protein universe", it's the "list of presumed globular proteins Google found and some inferences about their structure as given by their AI platform".

Proteins don't exist as crystals in a vacuum, that's just how humans solved the structure. Many of the non-globular proteins were solved using sequence manipulation or other tricks to get them to crystallize. Virtually all proteins exist to have their structures interact dynamically with the environment.

Google is simply supplying a list of what it presumes to be low RMSD models based on their tooling, for some sequences they found, and the tooling is based itself on data mostly from X-ray studies that may or may not have errors. Heck, we've barely even sequenced most of the DNA on this planet, and with methods like alternative splicing the transcriptome and hence proteome has to be many orders of magnitude larger than what we have knowledge of.

But sure, Google has solved the structure of the "protein universe", whatever that is.

People have been making grand statements about the structure of the protein universe for quite some time (I've seen a fair number of papers on this, such as https://faseb.onlinelibrary.wiley.com/doi/abs/10.1096/fasebj... and https://faseb.onlinelibrary.wiley.com/doi/abs/10.1096/fasebj... from a previous collaborator of mine).

Google didn't solve the structure of the protein universe (thank you for saying that). But the idea of the protein structure universe is fairly simple- it's a latent space that allows for direct movement over what is presumably the rules of protein structures along orthogonal directions. It would encompass all the "rules" in a fairly compact and elegant way. Presumably, superfamilies would automagically cluster in this space, and proteins in different superfamilies would not.

I recognize your superior knowledge in the topic and assume you're right.

But you also ignore where we're at in the standard cycle:

https://phdcomics.com/comics/archive_print.php?comicid=1174

;)

  • That's exactly what this is, but it's embarrassing that it's coming from somewhere purported to be a lab. Any of the hundreds or more of labs working in protein structure prediction for the past 50 years could have made this press release at any time and said, "look, we used a computer and it told us these are the structures, we solved the protein universe!"

    It's not to diminish the monumental accomplishment that was the application of modern machine learning techniques to outpace structure prediction in labs, but other famous labs have already moved to ML predictions and are competitive with DeepMind now.

    • > but other famous labs have already moved to ML predictions and are competitive with DeepMind now.

      That's great! AlphaFold DB mas made 200 million structure predictions available for everyone. How many structure predictions have other famous labs made available for everyone?

      2 replies →

edit: I should have read the post first! What do you mean 'only globular proteins'? They say they have predictions for all of UniProt...

---------------

Yes, the idea of a 'protein universe' seems like it should at least encompass 'fold space'.

For example, WR Taylor : https://pubmed.ncbi.nlm.nih.gov/11948354/

I think the rough estimate was that there were around 1000 folds - depending on how fine-grained you want to go.

Absolutely agree, though, that a lot of proteins are hard to crystalise (i understand) due to being trans-membrane or just the difficulty of getting the right parameters for the experiment.

  • I don't think non-globular proteins are well represented by the predictions. All our predictions for proteins are based on proteins we were able to crystallize, so my guess is that even if many of them aren't globular proteins the predictions themselves are made from the foundations of structures we do have, which are predominantly globular proteins and it's presumed that the inference treats folding as if they were globular and crystallized (non-dynamic). X-ray crystallography and fitting to electron density maps itself is a bit of an art form.

    For example for transmembrane proteins, there is a gross under-representation of structures derived from experimental evidence, so we would expect that whatever your algorithm is "solving" is going to have a much higher degree of error than globular proteins, and likely artifacts associated with learning from much more abundant globular proteins.

    edit: As an example, "Sampling the conformational landscapes of transporters and receptors with AlphaFold2". AF2 was able to reproduce the alternative conformations of GPCRs, but only with non-default settings. With default settings there is clear evidence of overfitting.

    > Overall, these results demonstrate that highly accurate models adopting both conformations of all eight protein targets could be predicted with AF2 by using MSAs that are far shallower than the default. However, because the optimal MSA depth and choice of templates varied for each protein, they also argue against a one-size-fits-all approach for conformational sampling.

    • Fair point. I guess if their training data is biased towards existing known structures (via xray or nmr or whatever) then there is the risk of incorrect predictions.

      At a guess, the core packing in non-globular proteins might be different? Also the distribution of secondary structure might also vary between classes. Might be worth someone studying how much structural constraints depend on fold (if they have not already).