Introducing Profluent-E1
We are announcing Profluent-E1, our first Retrieval-Augmented Protein Encoder Model, available under a permissive license for both research and commercial use.
November 13, 2025
Read our PaperAccess E1 Models

Proteins are fundamental components of life’s molecular machinery, driving essential biological processes such as catalyzing chemical reactions, coordinating immune responses, and regulating gene expression. Their diverse functions power applications ranging from gene therapies and vaccines to industrial enzymes that support sectors like pharmaceuticals and agriculture. Each protein is encoded as a sequence of amino acids — a 20-letter alphabet that determines both its three-dimensional structure and its function. Protein engineers design or modify these sequences to create proteins with desired properties.

At Profluent, we build Protein Language Models (PLMs) that offer a data-driven framework for modeling the relationships between protein sequence, structure, and function. Our PLMs are trained on billions of proteins and trillions of amino acids from the Profluent Protein Atlas (PPA), the most extensive and carefully curated protein dataset available to date. Through large-scale training, PLMs implicitly learn evolutionary patterns shaped by natural selection over billions of years.

Profluent-E1 is our new family of retrieval-augmented PLMs (RA-PLMs) that enhances the PLM training paradigm by explicitly providing relevant evolutionary context in the form of homologous protein sequences, both during training and inference. In contrast to ProGen3, which is a generative model, E1 is an encoder model that learns representations of proteins useful for downstream tasks. E1 leverages the scale of the PPA and targeted architectural and training innovations to achieve state-of-the-art performance among encoder models trained exclusively on sequence data — both in single-sequence mode (without homologs at inference) and in retrieval-augmented mode.

We are releasing three E1 variants — with 150M, 300M, and 600M parameters — freely for both research and commercial use, enabling immediate application to tasks such as fitness prediction, structure prediction, and representation learning.

E1 Architecture. The E1 model can take as input homologous protein sequences in addition to a sequence of interest. The homologous sequences are prepended to the sequence of interest to construct a multi-sequence input to the model. E1 alternates between intra-sequence and block-causal attention, enabling it to build internal representations that include information not just from amino acids within the same protein sequence, but also amino acids from other homologous sequences.

Setting a New Bar for Protein Encoder Models

E1 achieves state-of-the-art zero-shot performance compared to other publicly available PLMs in both sequence-only and retrieval-augmented mode. Scaling model parameters correlates with better performance. Left: Fitness prediction performance on ProteinGym Substitution DMS Assays. Right: Structure understanding as measured by unsupervised contact map prediction on CAMEO.

It has been shown that PLMs are effective zero-shot fitness predictors. In other words, they can accurately assess the impact of introducing specific mutations to a protein’s overall fitness (e.g. activity, expression, stability). To evaluate E1’s fitness prediction capabilities, we used the ProteinGym (v1.3) benchmark, which includes 217 experimental deep mutational scanning assays. At comparable model sizes, E1 outperforms all ESM-2 and ESM-C models in single-sequence mode — demonstrating that E1 can serve as a drop-in replacement for existing single-sequence encoder models (see left figure above). Retrieval augmentation further boosts E1’s accuracy, enabling it to surpass publicly available retrieval-augmented models such as MSA Pairformer and PoET.

We also assessed E1’s ability to capture protein structural information via long-range contact prediction, using sequences from CAMEO and CASP15 targets. In single-sequence mode, E1 outperforms the ESM family across model scales (see right figure above). Retrieval augmentation provides additional consistent gains, showing that E1 effectively leverages homologous sequences at inference time to better infer putative 3D contacts.

Taken together, the Profluent-E1 family demonstrates the continued value of research in improving protein language models. E1 provides a new foundational tool for AI-driven protein design, improving predictive performance and delivering practical utility across a broad range of protein engineering workflows.

Sign up for updates on our journey

Oops! Something went wrong while submitting the form.