How can computer predictions of protein folding be verified computationally?

How can computer predictions of protein folding be verified computationally?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Currently, there is a lot of research focused on solving the folding patterns of proteins using computers ([email protected],, etc.).

The question that I have is: How do you know when you get it right? Is there some way of verifying, in silico, that you have found a legitimate/correct structure for a protein?


Modelling has come on leaps and bounds over the last decade or so and in many cases has acted as a sometimes viable, and inexpensive substitute for experimental structures.

How do you know when you get it right?

Ultimately, one still needs experimental evidence to know when a model generated in silico is right. But there are ways of scoring a model for how likely it is to be right.

Is there some way of verifying, in silico, that you have found a legitimate/correct structure for a protein?

There are lots of ways to score and verify your models. Each method tells you something slightly different about the merits, or lack thereof, of your structural model. Some are designed to weed out the obviously awful models and some allow you to detect exactly where your model looks to be accurate or inaccurate.

MODELLER Homology modelling output verification on the fly.

I am most familiar with modeller for homology modelling. Other softwares are available and they are each evaluated by CASP every two years since 1994.

In homology modelling there are 3 common scoring systems that can be used to assess the biochemical viability of a model. This email covers when to use each one. My answer expands and explains a bit more.

molpdf is the Modeller objective function. GA341, discussed here is derived from Z-score (calculated with a statistical potential function), which is a target-template sequence identity, and a measure of structural compactness. DOPE is a more up to date method, first published in 2006, and is more true to "biological viability". From the publication:

DOPE is based on an improved reference state that corresponds to noninteracting atoms in a homogeneous sphere with the radius dependent on a sample native structure; it thus accounts for the finite and spherical shape of the native structures.

Which to use depends on what you want to do with the model, but of those three scores, DOPE is the most reliable at separating native-like models from "decoys". DOPE is usually the starting place for figuring out which models might be right and which models are just plain rubbish.

Note: If you use Rosetta then there will be equivalents to these, or you can run your generated models through these techniques. If you are using SWISS MODEL that comes with it's own somewhat black box verification techniques but you can still export the model for further verification.

General model check against experimental data.

A further validation of homology modelling methods or other structural models is ProSA. ProSA provides a great visual representation of where the z-score lies amongst actual crystal and NMR structures. There are probably others that do similar functions, but this is my personal go-to to get an idea of where my structure lies among experimentally gathered structures.

Sensitive residue by residue verification.

Although the aforementioned methods examine each residue, they usually output an overall score. Residue by residue scores are also available and require a lot of careful interpretation. For example, if you are analysing catalytic activity, a surface looping region that scores poorly might not be an issue, but a core catalytic residue that scores poorly renders the model useless. This means that just because your model has a good (lower) overall DOPE score than another model, doesn't mean it is necessarily a more accurate model for what you are interested in.

There are plenty of sensitive modelling scoring systems. Some of which are XdVal, MTZdump, the famous albeit old-school Ramachandran Plotting method, pdbU, pdbSNAFU, PROCHECK, Verify3D, and ERRAT to name a few. Each has a place when checking how correct your model is.

At this point, it must be verified experimentally.

In this foldit research paper, they use software and user input to design essentially an enhanced version of a naturally occurring protein, but they then physically make their new protein and determine its structure experimentally, using x-ray crystallography. Overall, they use a lot of trial and error

Projects like this are kind of geared towards towards the goal of being able to determine a protein's structure from its amino acid sequence in silico. Once we achieve this ability, it will be revolutionary. However it is very difficult, because making such predictions accurately would require the use of quantum mechanics in a way that is extremely difficult to computationally model. These projects use shortcuts to get around this problem, so their results aren't very accurate, but they can be accurate enough to be useful, as shown in that paper.

Computer-based redesign of a protein folding pathway

A fundamental test of our current understanding of protein folding is to rationally redesign protein folding pathways. We use a computer-based design strategy to switch the folding pathway of protein G, which normally involves formation of the second, but not the first, β-turn at the rate limiting step in folding. Backbone conformations and amino acid sequences that maximize the interaction density in the first β-hairpin were identified, and two variants containing 11 amino acid replacements were found to be ∼ 4 kcal mol −1 more stable than wild type protein G. Kinetic studies show that the redesigned proteins fold ∼ 100× faster than wild type protein and that the first β-turn is formed and the second disrupted at the rate limiting step in folding.


Prediction of protein folding rates from amino acid sequences is one of the most important challenges in molecular biology. In this work, I have related the protein folding rates with physical-chemical, energetic and conformational properties of amino acid residues. I found that the classification of proteins into different structural classes shows an excellent correlation between amino acid properties and folding rates of two- and three-state proteins, indicating the importance of native state topology in determining the protein folding rates. I have formulated a simple linear regression model for predicting the protein folding rates from amino acid sequences along with structural class information and obtained an excellent agreement between predicted and experimentally observed folding rates of proteins the correlation coefficients are 0.99, 0.96 and 0.95, respectively, for all-α, all-β and mixed class proteins. This is the first available method, which is capable of predicting the protein folding rates just from the amino acid sequence with the aid of generic amino acid properties and structural class information.

Corresponding author phone: +81-3-3599-8046 fax: +81-3-3599-8081 e-mail: [email protected]


Anfinsen’s Thermodynamic Hypothesis

A major milestone in protein science was the thermodynamic hypothesis of Christian Anfinsen and colleagues (3, 92). From his now-famous experiments on ribonuclease, Anfinsen postulated that the native structure of a protein is the thermodynamically stable structure it depends only on the amino acid sequence and on the conditions of solution, and not on the kinetic folding route. It became widely appreciated that the native structure does not depend on whether the protein was synthesized biologically on a ribosome or with the help of chaperone molecules, or if, instead, the protein was simply refolded as an isolated molecule in a test tube. [There are rare exceptions, however, such as insulin, α-lytic protease (203), and the serpins (227), in which the biologically active form is kinetically trapped.] Two powerful conclusions followed from Anfinsen’s work. First, it enabled the large research enterprise of in vitro protein folding that has come to understand native structures by experiments inside test tubes rather than inside cells. Second, the Anfinsen principle implies a sort of division of labor: Evolution can act to change an amino acid sequence, but the folding equilibrium and kinetics of a given sequence are then matters of physical chemistry.

One Dominant Driving Force or Many Small Ones?

Prior to the mid-1980s, the protein folding code was seen a sum of many different small interactions—such as hydrogen bonds, ion pairs, van der Waals attractions, and water-mediated hydrophobic interactions. A key idea was that the primary sequence encoded secondary structures, which then encoded tertiary structures (4). However, through statistical mechanical modeling, a different view emerged in the 1980s, namely, that there is a dominant component to the folding code, that it is the hydrophobic interaction, that the folding code is distributed both locally and nonlocally in the sequence, and that a protein’s secondary structure is as much a consequence of the tertiary structure as a cause of it (48, 49).

Because native proteins are only 5� kcal/mol more stable than their denatured states, it is clear that no type of intermolecular force can be neglected in folding and structure prediction (238). Although it remains challenging to separate in a clean and rigorous way some types of interactions from others, here are some of the main observations. Folding is not likely to be dominated by electrostatic interactions among charged side chains because most proteins have relatively few charged residues they are concentrated in high-dielectric regions on the protein surface. Protein stabilities tend to be independent of pH (near neutral) and salt concentration, and charge mutations typically lead to small effects on structure and stability. Hydrogen-bonding interactions are important, because essentially all possible hydrogen-bonding interactions are generally satisfied in native structures. Hydrogen bonds among backbone amide and carbonyl groups are key components of all secondary structures, and studies of mutations in different solvents estimate their strengths to be around 1𠄴 kcal/mol (21, 72) or stronger (5, 46). Similarly, tight packing in proteins implies that van der Waals interactions are important (28).

However, the question of the folding code is whether there is a dominant factor that explains why any two proteins, for example, lysozyme and ribonuclease, have different native structures. This code must be written in the side chains, not in the backbone hydrogen bonding, because it is through the side chains that one protein differs from another. There is considerable evidence that hydrophobic interactions must play a major role in protein folding. (a) Proteins have hydrophobic cores, implying nonpolar amino acids are driven to be sequestered from water. (b) Model compound studies show 1𠄲 kcal/mol for transferring a hydrophobic side chain from water into oil-like media (234), and there are many of them. (c) Proteins are readily denatured in nonpolar solvents. (d) Sequences that are jumbled and retain only their correct hydrophobic and polar patterning fold to their expected native states (39, 98, 112, 118), in the absence of efforts to design packing, charges, or hydrogen bonding. Hydrophobic and polar patterning also appears to be a key to encoding of amyloid-like fibril structures (236).

What stabilizes secondary structures? Before any protein structure was known, Linus Pauling and colleagues (180, 181) inferred from hydrogen-bonding models that proteins might have α-helices. However, secondary structures are seldom stable on their own in solution. Although different amino acids have different energetic propensities to be in secondary structures (6, 41, 55, 100), there are also many 𠇌hameleon” sequences in natural proteins, which are peptide segments that can assume either helical or β conformations depending on their tertiary context (158, 162). Studies of lattice models (25, 29, 51) and tube models (11, 12, 159) have shown that secondary structures in proteins are substantially stabilized by the chain compactness, an indirect consequence of the hydrophobic force to collapse ( Figure 1 ). Like airport security lines, helical and sheet configurations are the only regular ways to pack a linear chain (of people or monomers) into a tight space.

(a) Binary code. Experiments show that a primarily binary hydrophobic-polar code is sufficient to fold helix-bundle proteins (112). Reprinted from Reference 112 with permission from AAAS.

(b) Compactness stabilizes secondary structure, in proteins, from lattice models. (c) Experiments supporting panel b, showing that compactness correlates with secondary structure content in nonnative states of many different proteins (218). Reprinted from Reference 218 with permission.

Designing New Proteins and Nonbiological Foldamers

Although our knowledge of the forces of folding remains incomplete, it has not hindered the emergence of successful practical protein design. Novel proteins are now designed as variants of existing proteins (43, 94, 99, 145, 173, 243), or from broadened alphabets of nonnatural amino acids (226), or de novo (129) ( Figure 2 ). Moreover, folding codes are used to design new polymeric materials called foldamers (76, 86, 120). Folded helix bundles have now been designed using nonbiological backbones (134). Foldamers are finding applications in biomedicine as antimicrobials (179, 185), lung surfactant replacements (235), cytomegalovirus inhibitors (62), and siRNA delivery agents (217). Hence, questions of deep principle are no longer bottlenecks to designing foldable polymers for practical applications and new materials.

(a) A novel protein fold, called Top7, designed by Kuhlman et al. (129). Designed molecule (blue) and the experimental structure determined subsequently (red). From Reference 129 reprinted with permission from AAAS. (b) Three-helix bundle foldamers have been made using nonbiological backbones (peptoids, i.e., N-substituted glycines).

(c) Their denaturation by alcohols indicates they have hydrophobic cores characteristic of a folded molecule (134).

No, DeepMind has not solved protein folding

This week DeepMind has announced that, using artificial intelligence (AI), it has solved the 50-year old problem of ‘protein folding’. The announcement was made as the results were released from the 14 th and latest competition on the Critical Assessment of Techniques for Protein Structure Prediction (CASP14). The competition pits teams of computational scientists against one another to see whose method is the best at predicting the structures of protein molecules – and DeepMind’s solution, ‘AlphaFold 2’, emerged as the clear winner.

Don’t believe everything you read in the media

There followed much breathless reporting in the media that AI can now be used to accurately predict the structures of proteins – the molecular machinery of every living thing. Previously the laborious experimental work of solving protein structures was the domain of protein crystallographers, NMR spectroscopists and cryo-electron microscopists, who worked for months and sometimes years to work out each new structure.

Should the experimentalist now all quit the lab and leave the field to Deep Mind?

No, they shouldn’t, for several reasons.

Firstly, there is no doubt that DeepMind have made a big step forward. Of all the teams competing against one another they are so far ahead of the pack that the other computational modellers may be thinking about giving up. But we are not yet at the point where we can say that protein folding is ‘solved’. For one thing, only two-thirds of DeepMind’s solutions were comparable to the experimentally determined structure of the protein. This is impressive but you have to bear in mind that they didn’t know exactly which two-thirds of their predictions were closest to correct until the comparison with experimental solutions was made.* Would you buy a satnav that was only 67% accurate?

So a dose of realism is required. It is also difficult to see right now, despite DeepMind’s impressive performance, that this will immediately transform biology.

Impressive predictions – but how do you know they’re correct?

Alphafold 2 will certainly help to advance biology. For example, as already reported, it can generate folded structure predictions that can then be used to solve experimental structures by crystallography (and probably other techniques). So this will help the science of structure determination go a bit faster in some cases.

However, despite some of the claims being made, we are not at the point where this AI tool can be used for drug discovery. For DeepMind’s structure predictions (111 in all), the average or root-mean-squared difference (RMSD) in atomic positions between the prediction and the actual structure is 1.6 Å (0.16 nm). That’s about the size of a bond-length.

That sounds pretty good but it’s not clear from DeepMind’s announcement how that number is calculated. It might be calculated only by comparing the positions of the alpha-Carbon atoms in the protein backbone – a reasonable way to estimate the accuracy of the overall fold of the protein. Or, it might be calculated over all the atomic positions, a much more rigorous test. If it is the latter, then an RMSD of 1.6 Å is an even more impressive result.

But it’s still not nearly good enough for delivering reliable insights into protein chemistry or drug design. To do that, we want to be confident of atomic positions to within a margin of around 0.3 Å. AlphaFold 2’s best prediction has an RMSD for all atoms of 0.9 Å. Many of the predictions contributing to their average of 1.6 Å will have deviations in atomic positions even greater than that. So, despite the claims, we’re not yet ready to use Alphafold 2 to create new drugs.

There are other reasons not to believe that the protein folding problem is ‘solved’. AI methods rely on learning the rules of protein folding from existing protein structures. This means that it may find it more difficult to predict the structures of proteins with folds that are not well represented in the database of solved structures.

Also, as reported in Nature, the method cannot yet reliably tackle predictions of proteins that are components of multi-protein complexes. These are among the most interesting biological entities in living things (e.g. ribosomes, ion channels, polymerases). So there is quite a large territory remaining were AlphaFold 2 cannot take us. The experimentalists, who have been successful in mapping out the structures of complexes of growing complexity, have still a lot of valuable work to do.

While all of the above is supposed to sound a note of caution to counter some of the more hyperbolic claims that have been heard in the media in recent days, I still want to emphasise my admiration for the achievements of the AlphaFold team. They have clearly made a very significant advance.

That advance will be much clearer once their peer-reviewed paper is published (we should not judge science by press releases), and once the tool is openly available to the academic community – or indeed anyone who wants to study protein structure.

Update (02 Dec, 18:43): This post was updated to provide a clearer explanation of the RMSD measures used to compare predicted and experimentally determined protein structures. I am very grateful to Prof Leonid Sazanov who pointed out some necessary corrections and additions on Twitter.

*Update (12 Dec, 15:35): Strictly this is true, but it misses the more important point that the score given to each structure prediction (GDT_TS) broadly correlates with the closeness of its match to the experimental structure. As a result, I have deleted my SatNav crack.

For a deeply informed and very measured assessment of what DeepMind has actually achieved in CASP14, please read this blogpost by Prof. Mohammed AlQuraishi who knows this territory much better than I do. His post is pretty long but you can skip the technical bits explaining how AlphaFold 2 works. He gives a very good account of the nature of DeepMind’s advance in AlQuraishi’s view, AlphaFold 2 does represent a solution to the protein structure prediction problem, though he is careful to define what he means by a solution. He also acknowledges that there are still some significant improvements to be made to the programme, but regards these as more of an engineering challenge than a scientific one. He agrees that AlphaFold 2 won’t be used any time soon for drug design work. AlQuraishi also gives an excellent overview of the implications of this work for protein folders, structural biologists and biotechnologists in general, and offers some very interesting thoughts on the differences between DeepMind’s approach to research and that of more traditional academic groups.

Villin Headpiece

One of the best studied examples of fast-folding proteins, wild type villin headpiece is known to fold in 4-5 microseconds in addition, a fast-folding mutant exists which folds in under a microsecond. The villin headpiece has been the target of a wide variety of experimental and computational efforts to characterize its folding however, at present, no atomic-scale predictions regarding the folding mechanism of villin headpiece have survived experimental scrutiny, and thus the details of the folding of this apparently simple model system remain unknown. Part of the challenge in computationally studying villin folding is undoubtedly a matter of resources even for this fairly small system, until recently no full-length folding trajectories had been obtained.

We have performed a series of MD simulations of villin headpiece folding in explicit solvent in order to study the folding mechanism of villin and understand how folding is accelerated in the fast-folding mutant. In three separate trajectories (movies: 1, 2, 3) wild type villin was found to fold after 5-8 microseconds these trajectories represent the first all-atom, explicit solvent MD simulations of villin folding on realistic timescales. The early stages of folding were very different between the trajectories, exploring a variety of different non-native conformations in each case. Near the end, however, all trajectories come to a common path: all of the secondary structure elements of the protein form, but arrive at a conformation where one of the helices is flipped relative to the rest of the protein (key steps in the transition are shown below). Folding can only occur after the helices fully dissociate from each other and then come back together in correct (i.e., folded) orientations. The results of an example trajectory, shown at right, illustrate folding to a native state in 5.5 microseconds. The consistent folding path followed by the villin trajectories late in folding agrees with experimental findings that a single rate-limiting transition dominates the folding of the protein, and provides information on the nature of this transition that is impossible to obtain through other means. Based on the simulations we were able to identify a set of mutations on the flipped helix that would destabilize the trapped folding intermediate and thus are expected to accelerate folding.

Key steps in the transition from the flipped to folded structure in a WT villin folding simulation. Click for a full size image.

A fast-folding villin mutant

AI makes stunning protein folding breakthrough — but not all researchers are convinced

Within every biological body, there are thousands of proteins, each twisted and folded into a unique shape. The formation of these shapes is crucial to their function, and researchers have struggled for decades to predict exactly how this folding will take place.

Now, AlphaFold (the same AI that mastered the games of chess and Go) seems to have solved this problem, essentially paving the way for a new revolution in biology. But not everyone’s buying it.

An AlphaFold prediction against the real thing.

What the big deal is

Proteins are essential to life, supporting practically all its functions, a DeepMind blog post reads. The Google-owned lab British artificial intelligence (AI) research became famous in recent years as their algorithm became the best chess player on the planet, and even surpassed humans in Go — a feat once thought impossible. After toying with a few more games, the DeepMind team set its eyes on a real-life task: protein folding.

In 2018, the team announced that AlphaFold 2 (the second version of the protein folding algorithm) has become quite good at predicting the 3D shapes of proteins, surpassing all other algorithms. Now, two years later, the algorithm seems to have been perfected even more.

In a global competition called Critical Assessment of protein Structure Prediction, or CASP, AlphaFold 2 and other systems are given the amino acid strings for proteins and asked to predict their shape. The competition organizers already know the actual shape of the protein, but of course, they keep it secret. Then, the prediction is compared to real-world results. DeepMind CEO Demis Hassabis calls this the “Olympics of protein folding” in a video.

AlphaFold nailed it. Not all its predictions were spot on, but all were very close — it was the closest thing to perfection ever seen since CASP kicked off.

“AlphaFold’s astonishingly accurate models have allowed us to solve a protein structure we were stuck on for close to a decade,” Andrei Lupas, Director of the Max Planck Institute for Developmental Biology and a CASP assessor, said in the DeepMind blog.

CASP uses the “Global Distance Test (GDT)” metric, assessing accuracy from 0 to 100. AlphaFold 2 achieved a median score of 92.4 across all targets, which translates to an average error of approximately 1.6 Angstroms, or about the width of an atom.

Improvements have been slow in the protein folding competition. Image credits: DeepMind.

It’s not perfect. Even one Angstrom can be to big of an error and render the protein useless, or even worse. But the fact that it’s so close suggests that a solution is in sight. The problem has seemed unsolvable for so long that researchers were understandably excited.

“We have been stuck on this one problem – how do proteins fold up – for nearly 50 years. To see DeepMind produce a solution for this, having worked personally on this problem for so long and after so many stops and starts, wondering if we’d ever get there, is a very special moment.”

Why protein folding is so important

It can take years for a research team to identify the shape of individual proteins — and these shapes are crucial for biological research and drug development.

A protein’s shape is closely linked to the way it works. If you understand its shape, you also have a pretty good idea of how it works.

Having a method to predict this rapidly and without hard and extensive work could usher in a revolution in biology. It’s not just the development of new drugs and treatments, though that would be motivation enough. Development of enzymes that could break down plastic, biofuel production, even vaccine development could all be dramatically sped up by protein folding prediction algorithms.

Essentially, protein folding has become a bottleneck for biological research, and it’s exactly the kind of field where AI could make a big difference, unlocking new possibilities that seemed impossible even a few years ago.

At a more foundational level, mastering protein folding can even get us closer to understanding the biological building blocks that make up the world. Professor Andrei Lupas, Director of the Max Planck Institute for Developmental Biology and a CASP assessor, commented that:

“AlphaFold’s astonishingly accurate models have allowed us to solve a protein structure we were stuck on for close to a decade, relaunching our effort to understand how signals are transmitted across cell membranes.”

Why not everyone is convinced

Frankly, the hype serves no one. DeepMind can now never live up to the promise that's been made and have dragged experimentalists through the mud in the process. Also, until DeepMind shares their code, nobody in the field cares and it's just them patting themselves on the back

&mdash Mike Thompson (@mctucsf) December 1, 2020

The announcement of DeepMind’s achievements sent ripples through the science world, but not everyone was thrilled. A handful of researchers raised the point that just because it works in the CASP setting, doesn’t really mean it will work in real life, where the possibilities are far more varied.

Speaking to Business Insider, Max Little, an associate professor and senior lecturer in computer science at the University of Birmingham expressed skepticism about the real-world applications. Professor Michael Thompson, an expert in structural biology at the University of California, took to Twitter to express what he sees as unwarranted hype (see above), making the important point that the team at DeepMind hasn’t shared its code, and they haven’t even published a scientific paper with the results. Thompson did say “the advance in prediction is impressive.” He added: “However, making a big step forward is not the same as ‘solving’ a decades-old problem in biology and chemical physics.”

Lior Pachter, a professor of computational biology at the California Institute of Technology, echoed these feelings. It’s an important step, he argued, but protein folding is not solved by any means.

A friend (who does not work in science) asked me today whether it is true that "protein folding has been solved". My short answer:

The AlphaFold method produced very impressive results on CASP14. Protein folding is not a solved problem.

&mdash Lior Pachter (@lpachter) December 1, 2020

Just how big this achievement is remains to be seen, but it’s an important one no matter how you look at it. Whether it’s a stepping stone or a true breakthrough is not entirely clear at this moment, but researchers will surely help clear this out as quickly as possible.

In the meantime, if you want to have a deeper look at how AlphaFold was born and developed, here’s a video that’s bound to make you feel good:

Computer simulation explains folding in cellular protein

Athens, Ga. – Most parts of living organisms come packaged with ribbons. The ribbons are proteins-chains of amino acids that must fold into three-dimensional structures to work properly. But when for any reason the ribbons fold incorrectly, bad things can happen, and in humans misfolded-protein disorders include Alzheimer’s and Parkinson’s diseases.

Scientists have for the past three decades tried to understand what makes proteins fold into functional units and why it happens, and several breakthroughs have occurred through computer modeling-a field that dramatically increases analytical speed.

Now, scientists at the University of Georgia have created a two-step computer simulation (using an important process called the Wang-Landau algorithm) that sheds light on how a crucial protein-glycophorin A-becomes an active part of living cells. The new use of Wang-Landau could lead to a better understanding of the controlling mechanisms behind protein folding.

“Our goal is to present the methodology in a clear, self-consistent way, accessible to any scientist with knowledge of Monte Carlo simulations,” said David Landau, distinguished research professor of physics at the University of Georgia and director of the Center for Simulational Physics.

The research was just published in The Journal of Chemical Physics. Authors of the paper are Clare Gervais and Thomas Wüst, formerly of UGA and now employed in Switzerland Landau, and Ying Xu, Regents-Georgia Research Alliance Eminent Scholar and professor of bioinformatics and computational biology, also at UGA. The research was supported by grants from the National Institutes of Health and the National Science Foundation. Landau and Xu are in UGA’s Franklin College of Arts and Sciences.

“This work demonstrates the power and potential of combining expertise from computational physics and computational biology in solving challenging biological problems,” said Xu.

Monte Carlo simulations-the use of algorithms with repeated random samplings to produce reliable predictions-have been around for some decades but have been steadily refined. These simulations are useful for extremely complex problems with multiple variables, and though they often require considerable computer “brain power,” they are able to give scientists startlingly accurate predictions of how biological processes work.

In the current paper, the research team developed a two-step Monte Carlo procedure to investigate, for glycophorin A (GpA), an important biochemical process called dimerization. (A dimer in biology or chemistry consists of two structurally similar units that are held together by intra- or intermolecular forces.)

“One particularly promising approach is to investigate the thermodynamics of protein folding through examining the energy landscape,” Landau explained. “By doing this, we can learn about the characteristics of proteins including possible folding pathways and folding intermediates. Thus, it allows us to bridge the gap between statistical and experimental results.”

Unfortunately, so much is happening physically and biochemically as proteins fold into their functional shapes (called the native state) that the problems must be broken down one by one and studied. That led the team to a question: Could they use a Monte Carlo Simulation along with the Wang-Landau algorithm to discover an efficient simulation method capable of sampling the energy density states that allow such folding?

Perhaps remarkably, they did. The first step in studying the dimerization process was to estimate those states in GpA using Wang-Landau. The second step was to sample various energy and structural “observables” of the system to provide insights into the thermodynamics of the entire system.

The results could be broadly applied to many fields of protein-folding studies that are important to understanding-and treating-certain diseases. (Wang-Landau, named for David Landau and Fugao Wang, is a Monte Carlo algorithm that has proved to be useful in studying a variety of physical systems. Wang was a doctoral student at UGA and now works for the Intel Corp.)

GpA is a 131-amino acid protein that spans the human red-blood cell membrane and is crucial in cell procedures. Because it has been studied in depth for many years, it also serves as an important model system for how similar systems work. That’s why the new simulation may open doors in many other areas of inquiry.

“The main advantage of this two-step approach lies in its flexibility as well as its generality,” said Landau. “This method is widely applicable to any study of biological systems, such as the folding process of soluble proteins, polymers, DNA or protein complexes. Therefore, it is an excellent alternative to other simulation methods used traditionally in the field of protein-folding thermodynamics.”

In the current study, the team discovered something generally important about membrane proteins in general, too. They found that unlike some proteins for which folding is mainly governed by their attraction to or repulsion by water, the process in GpA is driven by a subtle interplay between multiple types of interactions.

Part B: How to (almost) Fold (almost) Anything

In this part you will be folding protein sequences into 3D structures. The goal is to get an understanding on how computational protein modeling works as well as to see first hand the great computing power needed for molecular simulations in biology.

For questions 1 and 2 you will be using the Python version of the Rosetta protein structure prediction software, while for question 3 (extra credit) you can use any of the available software listed in the resources.

The files for this exercise are available to clone or download from the followign GitHub repository:


Folding a small (30 aa) peptide. Follow the "Setting up PyRosetta" instructions below and make sure you have a working PyRosetta installation.

a. Open the "Protein Folding with Pyrosetta" Jupyter notebook. Execute interactively the code in the notebook and answer the questions therein. When you are done, save the notebook (with the answers and all outputs) to an HTML file, and link it to your class page.

b. Pick the lowest energy model and structurally (visually) compare it to the native. How close is it to the native? If its different, what parts did the computer program get wrong? Note: To compare the structures you have first to align them to the native. You can do that very easily in PyMOL. Here is a short video tutorial on aligning structures with PyMOL

c. Pick the lowest RMSD model and structurally compare it to the native. How close is it to the native? If its different than the lowest energy model, how is it different? Remember that in a blind case, we will not have the benefit of an RMSD column.

Fold your own sequence! In question 1 we used the sequence from a human protein as input to the folding algorithm. Yet, in principle, you can give any arbitrary sequence of amino acids as an input.

a. Use any process to create a sequence of 30-50 amino acids, and predict it's 3D structure using the notebook from Q1. You can try to run the script with multiple parameter combinations and compare the results. Log the parameters that had the best outcome.

b. Compare the resulting structures of 2(a) with those from question 1. Do the structures in both cases look protein-like ? If not, can you think of an explanation?

c. Try folding multiple sequences to come up with the most protein-looking structure!

Folding protein homologs (extra credit) For this exercise you will be running multiple protein folding simulations. If you don't have access to a powerful machine, use any of the folding servers listed in the resources.

a. Take the protein sequence from question 1 and randomly change 5 letters to any other amino acid. Predict the protein structure of the unedited (probably done already in Q.1) and edited protein and compare the results. Did the changes you introduced changed the structure significantly?

b. Take again the original sequence from Q.1 and now change 5 letters to favorable alternatives according to the BLOSUM matrix. Predict the protein structure for the new sequence and compare with the results of 3(a). Did the new changes have the same effect to the structure?

c. By using the BLOSUM matrix as a guide, try to introduce as many changes as possible to the protein sequence, without significantly changing it's structure.

How can computer predictions of protein folding be verified computationally? - Biology

Interplay between accurate protein structure prediction and successful de novo protein design.

Reviews current state-of-the-art structural protein prediction methods and challenges.

Reviews features of successful de novo protein designs.

Biotechnology applications in therapeutics, biocatalysts, and nanomaterials are summarized.

In the postgenomic era, the medical/biological fields are advancing faster than ever. However, before the power of full-genome sequencing can be fully realized, the connection between amino acid sequence and protein structure, known as the protein folding problem, needs to be elucidated. The protein folding problem remains elusive, with significant difficulties still arising when modeling amino acid sequences lacking an identifiable template. Understanding protein folding will allow for unforeseen advances in protein design often referred to as the inverse protein folding problem. Despite challenges in protein folding, de novo protein design has recently demonstrated significant success via computational techniques. We review advances and challenges in protein structure prediction and de novo protein design, and highlight their interplay in successful biotechnological applications.