Foundation Models for Proteins: Cyrus, OpenFold and the future of biologics
Recent advances in the architecture and scale of AI are leading us from the era of narrowly focused AI (e.g. text auto-complete, immune epitope prediction in a protein sequence, antibody domain detection) to broader models with applications across domains from sales and marketing to medical diagnostics. In protein biochemistry these are the “AIFold” models such as AlphaFold2, RoseTTAFold, and OpenFold, which outperform previous models and physics-based systems at structure prediction.
Scientists at Stanford formalized the concept of foundation models in 2021, describing them as AI models “that are trained on broad data at scale and are adaptable to a wide range of downstream tasks”, much as GPT-3 or BERT work as language models. A key aspect of foundation models is that they can be used without further modification or training to perform novel business functions, but they can also be fine-tuned with additional training data to gain greater performance for a highly specialized task – GPT-3 can be provided samples of an individual’s writing to make an AI model whose writing closely resembles that individual. At Cyrus, we see the AIFolds, together with protein design AI and physics-based traditional models, as the foundation models of protein engineering and design, leading to a next generation of new biologics derived either from natural proteins or fully designed from scratch.
In other areas of the software world, foundation models are rapidly becoming cores around which value can be created in individual sectors, such as search, writing assistance, marketing copy, and sales coaching, as described in Forbes. The term foundation model describes both the opportunities and the perils (such as failure in critical use cases or social bias) of the widespread use of tools such as GPT-3. These models are possible because of increases in the scale and speed of GPU (graphics processing units) chips, and innovations in the practice of AI, chief among those the Transformer model. Transformers solve the problem of context, or “attention” in AI terms. In written language, attention is what allows me to write not just the next word or two in this sentence, but to craft a full sentence, situate it in a paragraph, and plan out the flow of that paragraph in an overall piece of writing. It is a major element in what allows AI to appear intelligent. Without attention AI can complete a short phrase, but with a transformer an AI model can write a convincing paragraph.
Rosetta, developed at David Baker’s lab at UW Seattle, combines statistics and atomic-scale physics to model proteins and other molecules, and has been the leading model for protein engineering at the atomic scale for two decades (Molecular Dynamics has been the other dominant model for protein behavior, but not design). Rosetta is not a trained AI system, but instead combines various human-tuned parameters to achieve the highest performance against many benchmarks. Rosetta also achieved many of the firsts in computational protein design, including the first designed protein, top7, and the first designed protein drug, SKYCOVIone. Many groups have built off of the core of Rosetta to make new functions for certain types of molecules. So in this sense Rosetta has served as a foundation model in a pre-transformer world. Baker’s work was honored with the 2021 Breakthrough Prize in Life Sciences, often considered a precursor to a Nobel prize for important new science.
In late 2019, Google DeepMind took a big step forward with the release of the AlphaFold2 protein structure and sequence model for protein structure prediction, borrowing the Transformer concept that had been applied to machine vision and large scale language models and applying it brilliantly to protein systems. RoseTTAFold, OpenFold and others soon followed with similar or identical architecture and scale, each reaching about 1/1000th the size (number of parameters) of well known Large Language Models (LLMs) such as GPT-3, and BERT. Scientists at Meta AI separately trained pure protein language models, without structural information, to even larger scale, with ESM2 reaching about 1/10th the size of the LLMs. In language and image modeling the performance of deep learning models has tended to increase with increasing number of parameters or number of input data points, and for protein models such as ESM2 it has also been true that new behavior emerges at larger scale.
These AIFold models have exploded in popularity across biology. Now extensions to these models, such as Baek and Baker’s RoseTTAFoldNA for nucleic acids, are focusing on individual types of molecules, such as monoclonal antibodies. The Baker lab has also developed another large-scale or “deep learning” AI system for protein design, ProteinMPNN, which is opening up powerful new approaches to better engineering of novel proteins and for improving existing biologics. Unfortunately, for a variety of technical reasons, although AlphaFold2 was the first of the new AIFolds it is not possible to use it to train derivative models with new data. For this reason, systems such as RoseTTAFold and OpenFold are more accessible alternatives for the academic and drug discovery communities.
Now, through the creation of OpenFold with Mohammed AlQuraishi’s lab, along with Genentech, Amazon, Arzeda, Cyrus, Outpace, and some AI non-profit organizations, the Biopharma and Tech communities are creating a trainable foundation model for proteins with state of the art performance that surpasses AlphaFold2 in both speed and memory usage on commercially available NVIDIA chips. AlphaFold2 represented a huge advance, being honored with the Breakthrough Prize for 2023, but it was trained using a proprietary system and with Google proprietary TPU chips that are not available to others – OpenFold creates a truly extensible foundation model that is tuned for training and inference on commercially available NVIDIA chips.
The OpenFold consortium is already working on an extension to the original model that will incorporate larger sets of protein sequences as input, OpenFold-singlesequence. In the future, the consortium will expand to other types of molecules. But the goal of the group is to enable anyone to create derived models from the OpenFold foundation using their own biological data, sequences, structures or whatever data might be developed in the future.
In many cases there won’t be enough data to train useful models, and as Baker and Baek have pointed out, those cases will benefit from a complementary system such as Rosetta. For example, in order for Rosetta to model small molecules it requires molecular modeling software such as RDkit to calculate the physics of the non-protein atoms.
At Cyrus we are embracing all of these future possibilities. We have been working on hybrid Rosetta/AI systems, such as our T-cell epitope deimmunizer, for years. We developed a system to automatically incorporate small molecular calculations into Rosetta, integrating tools from Openeye, to create a hybrid system capable of modeling proteins with small molecule drugs. Now with our unique Deep Mutational Scan capabilities in mammalian cells, and collaborations with over 100 biotech companies under our belts, we can begin to extend OpenFold and other foundation models such as ProteinMPNN into future applications. By doing so we’ll improve our ability to create new biologics from natural proteins, and bring new therapeutics to patients across a wide range of indications from infectious disease vaccines and therapeutics, to rheumatology, to rare disease, and to autoimmune disorders.
About Cyrus Biotechnology
Cyrus Biotechnology is a pre-clinical-stage biotech company combining computational and experimental protein design and screening to create novel biologics for serious unmet medical needs. Using this approach, Cyrus is developing an early pipeline of innovative programs in multiple indications. The Cyrus platform improves both the efficacy (binding affinity, aggregation propensity, solubility, and stability) and safety (binding specificity and immunogenicity) of natural proteins. Cyrus is also partnering with leading biotech and pharma companies and research institutes to bring collaborative programs forward from discovery to the clinic. Cyrus is based on core software from the lab of David Baker at the University of Washington. Cyrus has worked with over 90 industry partners. We are based in Seattle, WA and financed by leading US and Asian Biotech and Tech investors including Orbimed, Trinity Ventures, Springrock, Agent Capital, iSelect, Yard Ventures, WRF, and Alexandria.