← Back to context

Comment by photochemsyn

1 day ago

This does reveal the weakness of AlphaFold approaches for answering questions like “what is possible in the protein folding space if you use the 20 canonical amino acids” since the data used to train AlphaFold is limited to existing experimentally determined protein structures.

We don’t even know if this is like body plans (four legs for mammals, why not six?) i.e. is this about physical limitations of the folding space (did evolution explore most of the space and hold onto the most useful folds, or are the common set of folds one of those accident-of-history results?). Then there’s the issue that folding takes place as the protein chain exits the ribosomal tunnel so that’s a whole other constraint on what kinds of folds might be selected. For that matter, why not other genetically determined complex amino acids instead of just the canonical set?

Also, a common evolutionary process in eukaryotes is duplication of protein sequences and shuffling of code blocks which might represent folding domains, which might tend to lock in the existing collection of folds rather than generating novel folds. That’s not so clear.

This weakness of AlphaFold has some modern practical relevance since non-canonical amino acids and modified proteins are increasingly used medically, and their structures mostly seem to be determined using the direct experimental methods, eg:

https://pmc.ncbi.nlm.nih.gov/articles/PMC10296201/

“Non-Canonical Amino Acids as Building Blocks for Peptidomimetics: Structure, Function, and Applications” (2023)

> since the data used to train AlphaFold is limited to existing experimentally determined protein structures

Protein sequences, but the point still stands.