A Biophysics-based Protein Language Model for Protein Engineering
Summary from Cross Labs AI:
Just as words combine to form sentences that convey meaning in human languages, the specific arrangement of amino acids in proteins can be viewed as an information-rich language describing molecular structure and behavior.
Protein language models harness advances in natural language processing to decode intricate patterns and relationships within protein sequences. These models learn meaningful, low-dimensional representations that capture the semantic organization of protein space and have broad utility in protein engineering. However, while protein language models are powerful, they do not take advantage of the extensive knowledge of protein biophysics and molecular mechanisms acquired over the last century. Thus, they are largely unaware of the underlying physical principles governing protein function.
We introduce Mutational Effect Transfer Learning (METL), a specialized protein language model that bridges the gap between traditional biophysics-based and machine learning approaches by incorporating synthetic data from molecular simulations. We pretrain a transformer on millions of molecular simulations to capture the relationship between protein sequence, structure, energetics, and stability. We then finetune the neural network to harness these fundamental biophysical signals and apply them when predicting protein functional scores from experimental assays. METL excels in protein engineering tasks like generalizing from small training sets and extrapolating to new sequence positions. We demonstrate METL’s ability to design functional green fluorescent protein variants when trained on only 64 experimental examples.
Links & code
- About the Speaker → samgelman.com
- Check out the preprint
- All code is available under the MIT license. A collection of METL software repositories is provided to reproduce the results of this manuscript and run METL on new data:
- github.com/gitter-lab/metl for pretraining and finetuning METL PLMs (archived at doi:10.5281/zenodo.10819483)
- github.com/gitter-lab/metl-sim for generating biophysical attributes with Rosetta (archived at doi:10.5281/zenodo.10819523)
- github.com/gitter-lab/metl-pretrained for making predictions with pretrained METL PLMs (archived at doi:10.5281/zenodo.10819499)
- github.com/gitter-lab/metl-pub for additional code and data to reproduce these results (archived at doi:10.5281/zenodo.10819536)
Jump to section
- [3:01] Intro
- [5:03] Proteins as nature’s molecular machines
- [6:58] Proteins defined by a sequence of amino acids
- [10:34] Challenge: Vastness of sequence space
- [11:18] Navigating sequence space
- [11:52] Challenge: Small changes can have a large impact
- [12:51] Protein language models (PLMs)
- [15:57] Incorporating biophysics
- [17:33] Mutational Effect Transfer Learning (METL)
- [19:50] Simulating protein structures with Rosetta
- [21:10] Local and global strategies for simulations
- [24:26] Train transformer encoder to predict Rosetta scores
- [25:38] Finetune to predict experimental fitness score
- [27:10] Evaluation baselines (evolutionary models): METL, ESM, and EVE
- [28:26] Generalizing from small datasets
- [31:08] Extrapolating beyond train set
- [33:50] Simulating specific functions
- [35:46] How much simulated/experimental data is needed?
- [38:01] Engineering GFP variants with METL
- [41:40] Q&A