A Biophysics-based Protein Language Model for Protein Engineering

Videos

Cross Labs AI

UW-Madison

Transfer learning

Biology

Biophysics

Protein language models

Foundation models

LLM

Deep learning

Protein engineering

Simulations

We introduce Mutational Effect Transfer Learning (METL), a specialized protein language model that bridges the gap between traditional biophysics-based and machine learning approaches by incorporating synthetic data from molecular simulations.

Presenter

Sam Gelman

Date

June 18, 2024

Summary from Cross Labs AI:

Just as words combine to form sentences that convey meaning in human languages, the specific arrangement of amino acids in proteins can be viewed as an information-rich language describing molecular structure and behavior.

Protein language models harness advances in natural language processing to decode intricate patterns and relationships within protein sequences. These models learn meaningful, low-dimensional representations that capture the semantic organization of protein space and have broad utility in protein engineering. However, while protein language models are powerful, they do not take advantage of the extensive knowledge of protein biophysics and molecular mechanisms acquired over the last century. Thus, they are largely unaware of the underlying physical principles governing protein function.

We introduce Mutational Effect Transfer Learning (METL), a specialized protein language model that bridges the gap between traditional biophysics-based and machine learning approaches by incorporating synthetic data from molecular simulations. We pretrain a transformer on millions of molecular simulations to capture the relationship between protein sequence, structure, energetics, and stability. We then finetune the neural network to harness these fundamental biophysical signals and apply them when predicting protein functional scores from experimental assays. METL excels in protein engineering tasks like generalizing from small training sets and extrapolating to new sequence positions. We demonstrate METL’s ability to design functional green fluorescent protein variants when trained on only 64 experimental examples.

Links & code

About the speaker: Sam Gelamn, PhD → samgelman.com
Check out the preprint
All code is available under the MIT license. A collection of METL software repositories is provided to reproduce the results of this manuscript and run METL on new data:
- github.com/gitter-lab/metl for pretraining and finetuning METL PLMs (archived at doi:10.5281/zenodo.10819483)
- github.com/gitter-lab/metl-sim for generating biophysical attributes with Rosetta (archived at doi:10.5281/zenodo.10819523)
- github.com/gitter-lab/metl-pretrained for making predictions with pretrained METL PLMs (archived at doi:10.5281/zenodo.10819499)
- github.com/gitter-lab/metl-pub for additional code and data to reproduce these results (archived at doi:10.5281/zenodo.10819536)

Summary from Cross Labs AI:

Links & code

Jump to section