Introducing Profluent-E1

Proteins are fundamental components of the molecular machinery of life, driving biological processes such as molecular transport, enzyme catalysis, immune response, and gene regulation. Their diverse functions underpin applications across many industries—from pharmaceuticals to agriculture—enabling gene therapies, vaccines, and industrial enzymes, among other products. To harness these functions, protein engineers design, modify, or select amino acid sequences that fold into proteins with desired activities.
‍
At Profluent, we build Protein Language Models (PLMs) that offer a data-driven framework for modeling the relationships between protein sequence, structure, and function. Trained in a self-supervised manner on large databases of natural protein sequences, PLMs learn evolutionary patterns shaped by natural selection over billions of years. Recent approaches have incorporated explicit evolutionary context through retrieval augmentation during inference. Retrieval-augmented PLMs (RA-PLMs) enhance standard single-sequence models by providing homologous sequences during training and inference. This allows the model to leverage evolutionary context directly.

Today, we are excited to release Profluent-E1, a new family of retrieval-augmented protein encoder models trained with a masked language modeling objective. We leverage Profluent’s large-scale Protein Atlas and introduce targeted architectural and training innovations that yield more performant RA-PLMs. Profluent-E1 models achieves state-of-the-art performance among models trained exclusively on sequence data in both single-sequence (when no homologous sequence are provided at inference time) and retrieval-augmented mode. We release three E1 model variants – with 150M, 300M, and 600M parameters – freely for research and commercial use, enabling immediate application to tasks such as fitness prediction, protein ranking, and representation learning.

**E1 Architecture.** The E1 model can take in homologous sequences in addition to an input query sequence. The homologous sequences are prepended to the input sequence of interest to construct a multi-sequence input to the model. E1 alternates between intra-sequence and block-causal attention, enabling it to build internal representations based on residues within the same protein sequence and those in preceding homologous sequences within the concatenated multi-sequence input.

Setting a New Bar for Protein Encoder Models

E1 achieves state-of-the-art performance compared to other publicly available PLMs in both single-sequence and retrieval-augmented mode. **Left:** Performance on Protein Gym Substitution DMS Assays. **Right:** Unsupervised contact map prediction on subset from CAMEO.

Protein language models have been shown to be effective zero-shot fitness predictors for local mutational landscapes. We use the 217 Deep Mutational Scan substitution assays from Protein Gym (v1.3) benchmark to evaluate the performance of E1 models in both single sequence and retrieval-augmented modes. E1 models outperform all ESM-2 and ESMC family models in single sequence mode at corresponding model sizes, indicating that E1 can be used as a drop-in replacement for existing single sequence encoder models without loss of performance (See Left Figure above). When evaluated with homologs at inference time, E1 models substantially outperform corresponding single sequence metrics and achieve state of the art performance compared to similarly situated publicly available models (i.e model that only take homologous sequences as additional context during inference) like MSA Pairformer and PoET.

Next, we compared the performance of E1 models against publicly available models on the long-range contact prediction task for protein sequences derived from CAMEO and CASP15 targets. We use the Categorical Jacobian approach to test the model's internal knowledge of the residue contacts without dependence on a specific model architecture. E1 models outperform the ESM family of models at all scales when tested in single sequence mode (see Right Figure above). Moreover, we see consistent improvements in performance when including homologous sequences during inference, indicating that the model is able to use information in-context to determine putative 3D contact information.

Overall, Profluent-E1 family of models demonstrates the continued value of research to improve retrieval-augmented protein language models and provides a new foundational tool for AI-driven protein design that advances both predictive performance and practical utility for a large class of protein design workflows.

Oops! Something went wrong while submitting the form.

Setting a New Bar for Protein Encoder Models

Sign up for updates on our journey