The impact of new protein AI tools on real-world biologics discovery

Protein AI tools are useful for aggressive protein design, while physics based tools continue to be required for re-design of natural proteins for most cases

By Steven Lewis, Dan Farrell, Brandon Frenz, Ragul Gowthaman, Ryan Pavlovicz, David Thieker, Indigo King, Sam DeLuca, Yifan Song, and Lucas Nivon

As new tools come online daily applying deep learning architectures to new problems, the field of artificial intelligence is replete with claims, from the fairly pedestrian to aggressive claims of having “solved” entire fields of human endeavor. Some of this hype is the result of academic creators’ excitement about their progeny, some is commercial marketing…and some is of course highly valid and well supported by the data! Taking ChatGPT as an example, use of the tool will lead the savvy user to the conclusion that while it is excellent at producing well structured English text, the tool’s grasp on reality is much weaker – stories of grammatically correct but factually nonsensical interactions abound, for example in health and writing plagiarism. The media and PR rhetoric for “this tool writes English” is very valid, but it’s an open question whether one should trust it to write something that is factually correct, true, and well supported by other publications and data.

ChatGPT is wrong – Lucas Nivon is the CEO, Cyrus has never had a COO, and as far as we are aware John Atwater is a fictitious person in this context.  (ChatGPT is an evolving tool and your responses may vary if you try this experiment yourself.)

Written English is a great area to evaluate AI fallibility because even a layperson can easily judge for quality – but these problems presumably recur within other domains of AI tooling, including the powerful tools for protein folding and sequence design like AlphaFold2, OpenFold, and ProteinMPNN. Repeated experiments and benchmarking have shown they are good at their most direct intended use cases. In the case of AlphaFold2/OpenFold, that is prediction of the structures of naturally occurring sequences, using “blinded” 3D protein structures released as part of the CAMEO project to test the algorithms on input sequences that have never been seen before. For ProteinMPNN, that is de novo or nearly de novo design problems – the creation of new protein sequences, followed by validation to determine whether those proteins have the intended structure. Most scientists agree that these tools are great for some use cases, but it is necessary to ask where things might go wrong, to understand when these tools produce a beautifully written essay, and when they produce the protein equivalent of our fictitious “John M. Atwater”. AlphaFold2/OpenFold will simply take “a protein sequence” as input – is it wise to query these with any sequence you might have, or are there some wrong “questions” to ask? For ProteinMPNN, are all design problems equally valid, or only certain types of proteins designable, or only large sets of mutations computable?

One way to interpret the collective decades of computational biophysics at Cyrus is that we have an enormous amount of experience in knowing when a computer is lying to us in the form of a nonsensical, non-physical, or otherwise unrealistic protein structure. With Rosetta specifically, we have world class experience in understanding when to take out the computational in “computational protein design”. For example, for many years Rosetta had a tendency to place too many Trp residues on the surface of proteins, or to make mistakes with the backbone angles of certain aromatic residues – the latter set of failures even had a specific set of flags designed by Nobu Koga to minimize these failures that only a handful of human experts could spot by eye. We’ve been spending time expanding this type of “protein design quality control” to the realm of AI tooling to understand the failure modes and limitations of popular tools – to know where we can trust the AI output as is (like a reasonable rhyming poem written by ChatGPT), when we should be dubious or filter the output (like the plagiarism problem), and when we should simply not trust the AI output whatsoever (“John M. Atwater is the CEO of Cyrus”).  

AI structure prediction

Before AlphaFold1 was released in 2018, Rosetta was reliably a strong bet for protein structure prediction in all contexts, especially homology modeling problems via Hybridize (protein structure prediction where relatively high sequence identity homologs exist in the pdb, over 30% sequence identity). For folding of soluble native monomers, classic Rosetta algorithms and their competitors lost their crown to AlphaFold1; the crown is currently shared between AlphaFold2 and OpenFold. These are sibling algorithms independently trained with similar performance; the chief difference is that OpenFold has been tuned to run faster on commercially available chips from NVIDIA, and its training code is released in the public domain allowing anyone to fine-tune the algorithm for their protein problems of interest. Because these algorithms are siblings, and perform similarly in experiments mentioned here, we will abbreviate the pair AF2/OF.  

These tools were trained on the wealth of structurally determined proteins in the PDB, most of which are wild type (not mutated from the natural sequence cloned out of an organism) and all of which were sufficiently well behaved to be expressed and isolated in quantity for structure determination. Any bench scientist will tell you – not all proteins are so neatly behaved, so these proteins are by construction a biased sample of the actual proteome or structureome, where poorly behaved proteins, or even proteins that are only well structured in their physiological context, are simply not included.  

In the context of native proteins, AlphaFold2 already does valuable if unexpected things for predicting inherently disordered regions – it leaves those regions as “spaghetti” looping all over the model, to use the technical term. These types of structures are extremely rare in the PDB, as the disorder is not compatible with the ordered nature of a crystal. This type of loop-heavy output turns out to be rather a bonus, in that AlphaFold2 is useful as a disordered structure predictor (for example, here’s an early review). However, the output for “this probably doesn’t have structure” type of protein sequence being so wild raises the question of “how SHOULD the tool signal that the input sequence does not produce well ordered structure”? To the untrained eye, these looped outputs might seem sensible, but to Cyrus scientists it is immediately apparent when a disordered region is being predicted, even though AF2 does not alert the user to this. It is also obvious to the biophysicist, but not necessarily the layperson, that the spaghetti regions are useful as space-filling envelopes but that the coordinates themselves are invalid. The lesson here is first, that the AI tool gives reliably incorrect results (the atomic coordinates are wrong), and second, that the results can be useful anyway in a different, less obvious context.

Protein structure prediction: Small changes to natural proteins

The next step past disordered structure prediction, and one with which Cyrus is deeply concerned, is how these tools ought to respond to narrowly non-native sequence inputs. Small mutations to natural proteins that are intended to improve the efficacy or safety of a protein biologic candidate are the bread and butter of much of the $300B market for biologics, and at the core of Cyrus’s drug discovery efforts. 

In Cyrus’s view, after having worked with over 100 BioPharma firms including 13 of the top 20 BioPharma firms by annual sales, this type of optimization of natural proteins by introducing small numbers of mutations or creating novel fusion proteins with other natural proteins represents 99% of the therapeutic use of protein design tools in 2023, and therefore is where the most near-term impact on patient health will be delivered by AI tools

The future will bring huge growth in that 1% of de novo designed proteins that are unrelated to natural proteins, and Cyrus’s view is that in twenty years a large majority of biologics in clinical trials are likely to be de novo designed optimal proteins, but for patients looking for cures in the next few years, the modification of natural proteins is of paramount importance. Nature Biotechnology recently covered this broader de novo space in a news article, including an interview with Cyrus’s CEO, Lucas Nivon. Next-next-generation therapeutics on de novo protein tech is coming, but the next generation will be more nativelike, which forces us to ask how well AI tooling works on nearly native proteins.  

The difference between well folded protein and misfolded, aggregated crud is often a single mutation. How do AlphaFold2 and OpenFold respond when challenged with nearly-native proteins known to be poorly behaved in solution?

The answer turns out to be “not well”. Cyrus has a deep mutational scan data set for a certain human signaling protein, from which we selected deleterious mutations – because this is the subject of Cyrus IP, we’ll call this protein FOO. An ideal response from a structure prediction tool, when presented with a sequence that does not form a stable structure, would be either a non-structure output (essentially an error message – “this does not fold”), or at worst an output that is visibly malformed relative to the parent sequence’s structure, so a human user could immediately recognize “something has gone wrong, this is very likely a bad mutation”. A usable, but not ideal, response would be a glaring change to the confidence statistics, even if the predicted fold were maintained.

The actual response from AlphaFold2 and OpenFold to known destabilizing mutations is quite tepid. If you load 50 proven deleterious mutations simultaneously onto the <150 residue FOO sequence, AlphaFold2 and OpenFold show only mild deficits in pLDDT and some extra noise and variability in the predicted model population, even though this is an extremely poor protein that has no hope of expressing in vitro. When presented with a single deleterious mutation at a time there is basically no response, even though we know that each of those results in a protein that will not express in vitro. We interpret this insensitivity to deleterious mutations as template bias: WT FOO protein structure would have been present in the training data sets, and of course is present in the MSAs used for structure prediction. The structure prediction algorithm appears to place all its faith in the MSA and ignores the sequence deviations introduced by our known deleterious mutations. This makes sense for predicting native proteins known to be well-behaved enough to be useful to a cell, but unfortunately means these tools are not useful for judging the structural impact of mutations.

In theory this behavior makes sense, given the training corpus. AF2/OF are trained on the pdb structures which, as we mentioned above, are a pre-selected set of proteins that are well behaved enough to have their structures solved in the first place. AF2/OF are not structurally trained on the “bad” proteins that did not express, or expressed poorly, or did not crystallize because they are too dynamic or disordered. Their MSA inputs and related training do not “know” much about protein sequences that are not functional in nature. The models do not “know” anything about badly behaved proteins, and therefore they cannot label proteins as poorly behaved. This suggests a path to improving this behavior by including these “bad” protein sequences in training data. 

This image represents simulated data showing approximately the number and distribution of deleterious mutations (spheres) to which AF2/OF did not respond, either in combination or individually.

How can we rescue the performance of these tools for predicting the behavior of deleterious mutations?

The quality and quantity of the MSA data given to an AI folding tool is often directly correlated with the tool’s confidence and accuracy. In this context, we can reference our previous example and ask: why do these tools fail to recognize the structural changes that should be induced with deleterious mutations? Our findings suggest that this characteristic is partially due to how tolerant MSA constructing tools are to mismatches and partially due to the lack of structural data derived from peptides that may or may not express, or be functional.

To reassess AF2’s ability to predict changes induced by deleterious mutations, in the following figure we asked AF2 about the BAR enzyme (again, this is internal Cyrus IP, so we will anonymize for the purposes of this blog post), with and without 2 proline mutations in the middle of 2 helices. We found that the unmodified AlphaFold2 tool happily predicted (with relatively high confidence!) that a double proline mutation would have no effect on the local helical structure, and also found that the MSAs created in each of the two AlphaFold2 experiments were nearly identical. If you think about this experiment with the MSA as the input in mind, we have tasked this tool to predict two wildly different results, given inputs that are >99% similar. Although all your biophysics alarm bells should be going off at full volume after seeing such a result – two Pro mutations will break almost any local structure – we should not only blame AlphaFold2 for this result. AlphaFold2 was misled into thinking this was ‘just another protein family MSA’ by the MSA tool that ignored this mismatch in order to retain its gapless alignment. Since we can’t force AlphaFold2 to read Leninger’s Principles of Biochemistry, to give it a fighting chance at predicting deleterious mutations we can mask out regions of the MSA to coerce the tool to create predictions based on its internal biophysical intuition alone.

To test this we asked AF2 to again fold the BAR enzyme with and without 2 proline mutations in the middle of 2 helices, but this time we masked a small region of the MSA near the position of these mutations. We find that in this environment AF2 is able to produce an accurate model for the native sequence, but has diminished confidence. When AF2 predicts the double proline mutant however, it completely blows apart the helix and instead predicts an unstructured loop – the result we would expect, short of totally unfolding the protein. In other words, by masking out the MSA we find that AF2/OF will now predict a structural shift as a result of these highly damaging Pro mutations, which now fits with experimental reality. It is clear that both cases with masked MSAs lead to lower confidence predictions – which suggests that in the tool’s structure prediction confidence may be too much a function of MSA coverage/quality and too little a function of physical plausibility.

Protein language models and protein structure prediction

Meta’s (Facebook) ESMFold tool is also a major player in the AI structure prediction field, with the advantage that it does not require an MSA input, but the disadvantage that its outputs are  on average less accurate than MSA-input outputs from AF2/OF. ESMFold uses a different first step, with a language model replacing the MSA step that AF2/OF use. The prediction step is using OpenFold with embeddings from the language model.

We asked two questions about ESMFold. First, given its lack of MSA dependence, does it do any better at the deleterious point mutations experiment? Unfortunately the answer is no, although we are less sure why. Second, given that it is faster than AlphaFold2 or a full OpenFold run (because the language model step is faster than generating MSAs), are the results of lower quality? In our hands, ESMFold generally produces somewhat lower quality results, as measured by RMSD to known structures – but it also tracks neatly with lower confidence. ESMFold is a tool that is faster, correct (under 2 Ångstroms RMSD) less often, but also knows when it is incorrect (on native sequences) – suggesting that it is a fine first pass tool if you want to look at many native proteins at once.

Plotted above is pLDDT vs. RMSD for a subset of 67 sequences/structures from CAMEO that are modeled using all three methods. AF2 and OF tend to produce higher pLDDT “confidence” scores even if the structure is rather incorrect (over 5 Ångstroms RMSD). In contrast, ESMFold produces slightly more incorrect structures but those structures have lower pLDDT/“low confidence”, so the model is correctly flagging when the answer is likely wrong.

Protein modifications, glycosylation, and other chemistries

A final limitation to consider with AI structure prediction tools, especially for real-world drug development where post-translational modifications such as PEG or glycosylation are added in order to produce a clinically useful drug, is the narrowness of their chemical universe. The tools discussed here are unable to deal with modified or noncanonical amino acids, small molecule ligands, nucleic acids, or carbohydrates (glycans in the protein context). Physics-based algorithms like Rosetta’s Hybridize for homology modeling are no longer cutting edge for simple native proteins, but they are still the best game in town for more complex problems involving more diverse biomolecules – for example, Cyrus has an automated integrated tool using OpenEye software for small molecule physical parameters combined with Rosetta Hybridize to produce high-quality models of protein/ligand complexes, we call this “Ligand HM”. We are confident that over time the AI tools available in the open source or publicly disclosed world will fill in some of these gaps either with AI or with hybrid AI/physics solutions. Baker and Baek described some of these gaps in a short Comment in Nature Methods in 2022. Multiple academic groups are working on a variety of these problems as we speak. 

By the same token, classical mutational DDG (free energy (delta-G) of a point mutation/delta, hence delta-delta-G) calculations are still better at evaluating the effect of single or small numbers of mutations than the AI tools. The AI folding tools tend to get lost in the big picture and ignore the details of a mutation.

AI and protein design at the large and small scale

ProteinMPNN has demonstrated extremely impressive results for de novo (no relation to natural proteins) and nearly de novo (some relation to natural proteins) design problems. The commercial application of protein design is currently more grounded in the biochemistry of natural proteins: usually you already have a protein that is nearly right and you want to modify it minimally to improve some specific behavior, like making a minimal number of mutations to increase stability, binding affinity to the disease target, or increase serum half-life. Many problems are easy to cast in terms of mutational DDG on an existing protein, but much more difficult to cast as a fully de novo sequence in novel scaffolds.

We were curious if the ProteinMPNN “scorefunction” (its confidence) tracked with the same deep mutational scan data as above – can MPNN predict which individual mutations are beneficial and which are harmful? The correlation goes in the right direction, but it is not impressive in comparison with pre-existing Rosetta/physics based techniques. This experiment is hedged with caveats – for example the native backbone is probably not accurate for deleterious mutations, so a simple backbone-threaded input is maybe set up to perform poorly in ProteinMPNN. ProteinMPNN is not a tool built for judging point mutations, and indeed it does only a mediocre job at that task when applied directly without modification, as we’ve done here.

The correlation between deep mutational scan scores for protein FOO and MPNN scores for the same sequences is unimpressive, although with some trend in the correct direction.

We’ve also aimed to test ProteinMPNN on tasks more aggressive than single point mutations but less aggressive than full protein design – for example redesign of a protein core while leaving the surface intact. The results concerned us in several ways.  In this example, we were designing 41 out of 173 residues in the BAZ helical bundle. ProteinMPNN was trained with an atom-positional fuzzing in effect, which should make it insensitive to small changes in input backbones. We were nevertheless running this experiment in quintuplicate, with 5 AlphaFold2-generated backbones (BAZ has a crystal structure but the loops are missing, and AlphaFold2 is a cheap way to fix that particular problem) regularized through Rosetta Relax.  

We noted a pair of related problems: first, that ProteinMPNN happily packed 4 glutamate residues facing each other in the core of the protein (again, your biophysics alarm bells ought to be screaming); and second, that it only liked to do this in 2 of the 5 input models. It is plausible that ProteinMPNN thinks some sort of metal binding site ought to go here, which would be the only explanation for charges like this in a protein core. Given that ProteinMPNN does not know what metal ions are, nor can it place them in its sequence output, it’s still worrying. Even if a metal site could go there, a protein design experiment usually has that as an explicit goal, not a random happenstance. Our best hypothesis for the backbone dependence of this effect is that ProteinMPNN is simply not performing well on partial redesigns where “partial” is not the vast majority of the protein. ProteinMPNN’s published native sequence recovery results prove it is perfectly happy to keep native sequence where appropriate, but we assert that the tool only performs well when that is a choice ProteinMPNN is making, rather than one the experimenter forces by removing design positions.  

Above, note the unrealistic quadruple E mutation in the protein core.  Below, note that it is only common in 2 of the 5 input backbones, even though all are mutually low RMSD to each other and the native BAZ and geometrically normal.  

From these experiences we conclude that ProteinMPNN should be used only when it can alter nearly all of the protein sequence – in other words if one is allowing some large fraction (we are guessing 80%) of a protein to vary, MPNN should perform well. In the authors’ hands, in our hands, and in many experiments from other researchers it does a great job when nearly an entire protein is open for mutation, even when it chooses to maintain native sequence. We find if you aren’t letting it “think about” mutations at most positions it probably won’t work well, and consequently if your intended use can’t tolerate the possibility of mutations scattered throughout the protein it may not be the right tool to use. For project confidentiality reasons we can’t provide detailed data on this point (we can’t even provide a fake name as we ran out of metasyntactic variables), but we find similar results for cases of nearly full protein redesign to what the authors found. Cases where >80% of the protein is mutable still come back with designs where ~50% of the sequence is native, and most of the expressed proteins are both soluble and active. 

Protein design for drug discovery using AI protein design tools

At Cyrus we are incorporating the AF2/OF tools and ProteinMPNN tools into our workflows, alongside our own Rosetta tools and combined AI/Rosetta software we have written for other protein tasks. For example, we are using AF2/OF to validate novel homologs identified by sequence bioinformatics searches for protein targets of interest prior to screening a small number in the lab for enzyme or binding activity. We no longer use ProteinMPNN for small redesign problems, but are using it for exploratory design when over 80% of the residues in a protein sequence can be left free. 

At Cyrus we design for immunogenicity reduction, solubility, lack of aggregation, protein stability, serum stability, serum half life, binding affinity (to a target protein or substrate), and binding specificity (to a target protein or substrate). For most of those we are using a combination of Rosetta and specialized Rosetta/AI combination tools to introduce a relatively small number of mutations, thereby keeping close to a natural sequence and minimizing chances of adverse outcomes in the clinic. We are gradually increasing our use of ProteinMPNN and derived ProteinMPNN models in certain fusion protein situations. 

A large fraction of biologic therapeutic development is on mAb and IgG fold proteins, which have gone unmentioned in this blog post. Cyrus does not currently work on many mAb or IgG protein folds, although in the past (prior to our pivot away from pure software) we created the nextgen Antibody tool, and we have in the past done some projects using the Rosetta antibody design tool for partners. We anticipate that in the specialized mAb/IgG design space new AI tools will make an increasing impact over time, but have not analyzed those in depth here for this blog post, since these tools don’t match our current commercial priorities. 

If you have questions or comments about this post, or would like to discuss a collaboration with Cyrus, please reach out to info@cyrusbio.com. Cyrus pursues internal programs in biologics discovery (current programs are in gout, Fabry disease, gene therapy technology, COVID, and other indications) as well as milestone and royalty based collaborations with other firms (e.g. Selecta), and we are open to new collaborations in indications where our capabilities can help create unique value. 

COI Disclosure: Cyrus Biotechnology develops its own Rosetta, Rosetta/AI, and pure AI algorithms for protein design and structure prediction, and has a financial interest in those algorithms and in the biologic assets developed with those algorithms. Cyrus is a co-founder and Executive Committee member of the OpenFold consortium openfold.io, which is part of the 501c3 OMSF, releasing all code under permissive Apache licenses, therefore neither Cyrus nor other OF members have a direct financial interest in OF software.