\equalcont

These authors contributed equally to this work. \equalcontThese authors contributed equally to this work. [1]\fnmMingchen \surLi [1]\fnmHuiqun \surYu

[1]\orgdivSchool of Information Science and Engineerin, \orgnameEast China University of Science and Technology, \orgaddress\streetStreet, \cityShanghai, \postcode200237, \countryChina

CPE-Pro: A Structure-Sensitive Deep Learning Model for Protein Representation and Origin Evaluation

\fnmWenrui \surGou gwr@mail.ecust.edu.cn \fnmWenhui \surGe gwh@mail.ecust.edu.cn \fnmYang \surTan tyang@mail.ecust.edu.cn \fnmGuisheng \surFan gsfan@ecust.edu.cn lmc@mail.ecust.edu.cn yhq@ecust.edu.cn *

Abstract

Protein structures are important for understanding their functions and interactions. Currently, many protein structure prediction methods are enriching the structure database. Discriminating the origin of structures is crucial for distinguishing between experimentally resolved and computationally predicted structures, evaluating the reliability of prediction methods, and guiding downstream biological studies. Building on works in structure prediction, We developed a structure-sensitive supervised deep learning model, Crystal vs Predicted Evaluator for Protein Structure (CPE-Pro), to represent and discriminate the origin of protein structures. CPE-Pro learns the structural information of proteins and captures inter-structural differences to achieve accurate traceability on four data classes, and is expected to be extended to more. Simultaneously, we utilized Foldseek to encode protein structures into “structure-sequence” and trained a protein Structural Sequence Language Model, SSLM. Preliminary experiments demonstrated that, compared to large-scale protein language models pre-trained on vast amounts of amino acid sequences, the “structure-sequences” enables the language model to learn more informative protein features, enhancing and optimizing structural representations. We have provided the code, model weights, and all related materials on https://github.com/GouWenrui/CPE-Pro-main.git.

keywords:

Deep learning, Strcuture Representation, Structure-sequence, Origin prediction

1 Introduction

The function of a protein is determined by its folded structure; therefore, characterizing protein structure is crucial for understanding and studying its biological functions. Protein folding is a complex and highly coordinated process that determines the active sites, ligand-binding capabilities, and functional roles of the protein both inside and outside the cell.[1] Even slight alterations in structure can lead to significant changes in protein function, potentially resulting in disease. Consequently, a comprehensive understanding of protein folding is not only essential for elucidating its biological functions but also serves as a guide for research and applications in drug design, disease diagnosis, and treatment.

However, experimentally determining protein structures poses numerous challenges. Despite significant efforts by scientists over the past few decades, the number of experimentally determined structures reported in the Protein Data Bank (PDB)[2] remains far lower than the number of known protein sequences, creating a substantial data gap. The limited availability of protein structure data significantly hinders a comprehensive understanding of protein functions and interaction mechanisms. To address this limitation, researchers have increasingly turned to computational methods, such as protein folding simulations (AMBER[3], GROMACS[4]), to guide the determination of protein structures. One of the most representative methods is homology modeling[5, 6]. However, homology modeling relies on the similarity to known structures and cannot be accurately applied to proteins without known similar structures. While these computational methods somewhat alleviate the issue of insufficient experimental data, they still do not fully resolve the broader applicability of structure prediction, particularly for novel or complex proteins, where significant limitations remain.

As artificial intelligence technology advances, an increasing number of supervised[7, 8] and unsupervised[9, 10, 11] deep learning models are showing promise in the field of protein research, significantly enhancing the efficiency and accuracy of structure prediction. However, despite the proliferation of structure prediction models[12, 13, 14], a comprehensive evaluation of the differences and distinguishability between predicted and experimentally determined structures has yet to be undertaken. Significant discrepancies may still exist between prediction results and experimental findings, arising from the limitations of prediction models in handling the complex dynamics and interactions involved in protein folding. For example, models may fail to capture subtle changes in the protein folding process or perform poorly when dealing with proteins that have not been extensively studied. Ignoring these differences could lead to misunderstandings of protein function and behavior, potentially negatively impacting subsequent biological applications. Additionally, malicious injection in structural databases may further exacerbate inaccuracies in model downstream performance, leading to a decrease in the reliability of prediction results. Therefore, it is crucial to systematically evaluate the differences between predicted and experimental structures and to develop effective methods to distinguish the origin of protein structures. Our research aims to address this gap and propose a general and reliable solution.

Our contributions are as follows:

•

We introduce CPE-Pro, a model that excels in distinguishing between crystal and predicted protein structures by learning structural features.
•

Using the CATH 4.3.0[15] non-redundant dataset, we create the protein folding dataset CATH-PFD with multiple prediction models.
•

Preliminary experiments indicate that, compared to amino acid sequences, “structure-sequences” enable language models to learn more effective protein feature information, enriching and optimizing structural representations when integrated with graph embeddings.
•

We have open-sourced the code, model weights, and CATH-PFD dataset for CPE-Pro, providing valuable resources for protein structure research.

2 Related Works

2.1 Protein Representation Learning

In various biological tasks, it is essential to learn effective protein representations, such as for predicting protein functions or the effects of mutations. Protein representation methods can be classified into three approaches based on different modalities: sequence-based, structure-based, and a combination of sequence and structure.Figure 1.a shows various protein representation methods.

Sequence-based. With the rise of deep learning and advances in high-throughput sequencing technologies, data-driven methods are gradually replacing traditional analyses based on biological or physical priors. Protein sequences can be viewed as a form of “biological text”, and convolutional neural network[16] can directly capture local dependencies between amino acids. Techniques from natural language processing are also widely applied in protein representation learning. Models for individual protein sequence modeling include Variational Auto-Encoders (VAE)[17], long short-term memory networks (LSTM)[18], and large pre-trained protein language models(PLMs) like ESM-1b[9] based on the Transformer architecture[19]. Some studies have employed GPT[20]-based architectures for sequence modeling, using generative pre-training to represent protein sequences and make predictions, such as ProtGen2[21] and ProGPT2[22]. Compared to single sequences, multiple sequence alignment (MSA)[23] input aims to capture co-evolutionary information from a set of evolutionarily related sequences. In this context, the MSA Transformer[24] utilizes row and column attention mechanisms to model a set of protein sequences, allowing it to simultaneously consider inter-sequence relationships and conservation. Additionally, some research integrates protein sequences with other types of information, such as converting gene ontology annotations into fixed-size binary vectors for joint input with sequences in ProteinBERT[25], enabling the model to leverage the connections between sequence and functional annotations effectively.

Structure-based. While sequence-based research methods have been shown to be effective in several studies, the structural information of proteins is a critical determinant of their function. Models based on graph neural networks (GNNs) demonstrate significant advantages and broad applicability in representing protein three-dimensional structures. For example, ProteinGCN[26] constructs spatially adjacent protein graphs and is trained on tasks related to both local and global accuracy in protein modeling. Pre-trained protein representation methods[27] based on three-dimensional structures represent proteins as residue graphs, where nodes correspond to the three-dimensional coordinates of $\alpha$ carbons and edges represent relationships between residues. In modeling protein secondary structures, these are often transformed into secondary structure sequences, with $\alpha$ -helices, $\beta$ -sheets, and random coils represented as token sequences. DeepSS2GO[28] utilizes one-hot matrices to model secondary structures for predicting key protein functions. SES-Adapter[29] enhances the performance of PLMs in downstream tasks by integrating secondary structure sequence embeddings with other types of embeddings through a cross-attention mechanism.

Integrating Sequence and Structure. Combining protein sequence and structural information not only considers the sequential characteristics of amino acids but also reveals their spatial interactions and arrangements. In this regard, DeeFRI[30] combines LSTM and graph convolutional networks (GCNs)[31] to jointly learn complex structure-function relationships. LM-GVP[32] modifies the input variables of GVP[33] by using sequence embeddings generated by ProteinBERT as input. ESM-GearNet[34] designs various fusion methods between PLMs and GearNet to investigate the effectiveness of different fusion strategies. ProtSSN[35] combines ESM2[11] with equivariant graph neural networks (E-GNNs)[36] to extract geometric features of proteins, aiming to accurately predict biological activity and thermal stability. ProstT5[37] introduces 3Di (Three-Dimensional Indexing) alphabet used by Foldseek[38] based on the ProtT5-XL-U50 model[39] and is trained on labeling translation tasks in two modes. Additionally, SaProt[40] creates a larger structural-aware vocabulary using two types of labels and is pre-trained on a large-scale protein dataset for masked language tasks. Compared to previous work, the SSLM in CPE-Pro uses only the “structure-sequences” built from 3Di alphabet for pre-training on Swiss-Prot, without relying on amino acid labels, and effectively learns protein representations by integrating existing structural information with GVP.

Refer to caption — Figure 1: a. Protein representation methods. Proteins can be input into the model in various forms, including amino acid sequences, feature maps, three-dimensional coordinates, functional descriptions, and sequences composed of structural tokens, capturing the multi-level features of proteins. b. Pre-training of SSLM. SSLM is pre-trained on over 100,000 protein structures from the Swiss-Prot database and trained on various masked language modeling tasks, learning the relationships between “structure-sequences” and their corresponding three-dimensional structural features, thereby effectively representing protein structural information. c. CPE-Pro model architecture. The CPE-Pro model integrates a pre-trained protein structure language model with a graph embedding module, inputting the combined representation into the GVP-GNN module for computation. The pooling module aggregates structural information using attention masking, enhancing the quality of the representation. Ultimately, a multilayer perceptron serves as the source discriminator, outputting predicted probabilities.

2.2 Protein Structure Prediction

Amino acids form the linear sequence of proteins, but they acquire activity and biological function only when folded into specific spatial conformations. Early structure prediction methodsolded into specific spatial conformations. Early structure prediction methods[41, 42] primarily relied on sequence similarity, utilizing the homology of known protein sequences for predictions. These methods infer the structure of the target protein by aligning the sequence of interest with known homologous sequences and using the structural information from these homologs. Compared to sequence-based methods, structure-based learning approaches theoretically offer better solutions for acquiring protein information. In recent years, advancements in deep learning technologies have led to breakthrough progress in protein structure prediction.

In the study of secondary structure prediction, DeepCNF[43] integrates conditional random fields (CRFs)[44] and shallow neural networks to successfully model the interdependencies between adjacent secondary structure labels. SSREDNs[45] leverage deep and recurrent structures to simulate the complex nonlinear mapping relationships between input protein features and secondary structures, while also capturing interactions among consecutive residues in the protein chain.

More research has focused on three-dimensional structure prediction. In this context, trRosetta [12] utilizes co-evolution data, combined with deep residual networks, to predict the orientation and distances between residues through Rosetta-constrained energy minimization protocols. trRosettaX-single[46] focuses on the structure prediction of single-chain proteins. Regarded as a ”milestone,” AlphaFold2 [13] significantly enhances the accuracy and breadth of protein structure prediction by embedding multiple sequence alignments and paired features through Evoformer. OmegaFold [47] employs GCNs and self-attention [19] to effectively capture both global and local features of protein sequences, excelling in handling proteins of varying lengths and complexities. ESMFold [14] adopts a large-scale Transformer architecture, trained on extensive protein sequence datasets, to extract deep evolutionary features from sequence information without relying on multiple sequence alignments, demonstrating exceptional performance in predicting new structural domains and distant homologs. Leveraging the predictions from these models will provide robust support for biomedical research, drug development, and further advancements across various scientific domains.

3 Materials & Methods

3.1 Dataset

In this study, we utilized the non-redundant dataset from version 4.3.0 of the CATH (Class, Architecture, Topology, Homologous superfamily) database as our benchmark dataset. CATH is a significant protein structure classification database that categorizes proteins based on their structural and functional features through a systematic hierarchical structure. The high-resolution three-dimensional structure data provided by these experimental techniques are stored in the Protein Data Bank (PDB). The non-redundant dataset has a sequence identity filtering threshold of 40%, ensuring no high sequence similarity between structural domains. By removing redundant protein structures, each entry in the database represents a unique structural representation.

We extracted the amino acid sequences of proteins from the benchmark dataset. Using multiple state-of-the-art protein structure prediction models, we predicted the structures corresponding to these amino acid sequences. These structures were organized and categorized based on individual proteins and prediction models to construct a Protein Folding Dataset, CATH-PFD, which will be used for training and validating CPE-Pro. Table 1 presents detailed information about the dataset CATH-PFD.

Table 1: The information on CATH-PFD

Model/Origin	Number	pLDDT(%)	Method to Obtain
CATH	31,885	–	Download from https://www.cathdb.info/
AlphaFold2	31,885	92.4	Predicted using the ColabFold[48]
OmegaFold	31,881	82.7	Predicted using the OmegaFold
ESMFold	23,912	76.2	Predicted using the ESMFold
\botrule

⁰⁰footnotetext: The number of structures from different origins, their pLDDT scores, and acquisition methods. The CATH database contains 31,885 protein structures. For structure prediction, we deployed OmegaFold, ESMFold, and the efficient AlphaFold2 variant, ColabFold, using publicly available resources from previous studies. Unpredictable or erroneous protein structure results were excluded during the prediction process, leading to differences in the number of entries across various categories.

3.2 Model Architecture

Crystal vs Predicted Evaluator for Protein Structure (CPE-Pro), designed for discriminating structural origins, integrates two distinct structure encoders corresponding to graphical and sequential representations of the structures. Figure 1.c illustrates the detailed architecture of CPE-Pro.

CPE-Pro implements Geometric Vector Perceptrons – Graph Neural Networks (GVP-GNN)[49], which learn dual relationships and geometric representations of three-dimensional macromolecular structures as part of the protein structural encoder.

In obtaining information about the three-dimensional structure of proteins, we focus on a specific chain, ignoring the parts we are not concerned with. We concentrate on the coordinates of key atoms that constitute the protein backbone—N, CA, and C atoms—since these atoms are necessary for understanding the protein structure. Subsequently, we extract the geometric information of nodes and edges from the raw data and compute features such as distances and directions between them.For a protein graph $G=(V,E)$ , where $V$ represents the set of nodes and $E$ represents the set of edges, each node $v\in V$ represents a residue of the protein, and each edge $e\in E$ represents an interaction or spatial proximity relationship between residues. For the ith residue, the feature consists of scalars and vectors, i.e., $\mathcal{R}_{v}^{i}=(r_{s}^{i},r_{vec}^{i})$ . A embedding layer computes embeddings for the protein feature $\mathcal{R}_{v}$ , i.e.,

\mathcal{R}_{v}^{\prime}=GVP(r_{s},r_{vec}).

(1)

The GVP (·) layer mainly carries out scalar-vector propagations, i.e.,

r_{s}^{\prime}=\sigma_{1}\left(W_{1}\cdot Concat\left(r_{s},Norm\left(W_{2}% \cdot r_{vec}\right)\right)+b_{1}\right).

(2)

r_{vec}^{\prime}=W_{3}\cdot\left(W_{2}\cdot r_{vec}\right)\odot\sigma_{2}\left% (W_{4}\cdot r_{s}^{\prime}+b_{2}\right).

(3)

Here, $W_{1}$ , $W_{2}$ , $W_{3}$ , $W_{4}$ , $b_{1}$ , $b_{2}$ are learnable parameters for this layer, $\sigma_{1}(\cdot)$ and $\sigma_{2}(\cdot)$ denote activation functions. In each graph propagation step, messages from neighboring nodes and edges in the protein graph update the embedding of the current node,

r_{msg}^{j\rightarrow i}:=GVP_{conv}\left(Concat\left(r_{v}^{j},r_{e}^{j% \rightarrow i}\right)\right).

(4)

r_{v}^{i}\leftarrow LayerNorm\left(r_{v}^{i}+\frac{1}{k}Drop\left(\sum_{\begin% {subarray}{c}j:e_{ji}\in E\end{subarray}}r_{msg}^{j\rightarrow i}\right)\right).

(5)

Here, $r_{v}^{i}$ and $r_{e}^{j\rightarrow i}$ are the embeddings of the node i and $edge_{j\rightarrow i}$ as above, and $r_{msg}^{j\rightarrow i}$ represents the message passed from node j to node i. $GVP_{conv}$ denotes the stacked 3-layer GVP. we also add a feed-forward layer to update the node embeddings at all nodes i:

r_{v}^{i}\leftarrow LayerNorm\left(r_{v}^{i}+Drop\left(GVP_{conv}^{\prime}% \left(r_{v}^{i}\right)\right)\right).

(6)

The $GVP_{conv}^{\prime}$ is a stacked 2-layer $GVP$ . The GVP-GNN block is formed by stacking $GVP$ convolution and feed-forward transformations, as defined in Eq. (4)–(6). To enhance node representations, this block is iterated $T$ times, with $T=3$ specified for our experiments.

The other part of CPE-Pro’s structural encoder is Structural Sequence Language Model, SSLM. First, the efficient protein structure data search tool Foldseek is used to convert protein structures into “structure-sequences”. The primary process involves mapping the amino acid backbone of the protein to the 3Di alphabet to achieve structural discretization. This reflects the tertiary interactions between amino acids and describes the geometric conformations of residues and their spatial neighbors.

Next, using the 3Di alphabet as the vocabulary of structural elements and based on the Transformer architecture, we pre-train a protein structural language model, SSLM, from scratch. This aims to effectively model the “structure-sequence” of proteins. The pre-training process employs the classic masked language modeling (MLM) objective[50], predicting masked elements based on the context of the “structure-sequence”. The probability distribution for predicting a masked element $P(s_{i}\mid s_{1},\dots,s_{i-1},s_{i+1},\dots,s_{n})$ is used, where $s_{i}$ is the masked structural element and $s_{i}$ ,…, $s_{i-1}$ , $s_{i+1}$ ,…, $s_{i}$ are its contexts. The loss function is defined as follows:

\mathcal{L}_{\mathcal{MLM}}=-\sum_{i=1}^{x}y_{i}\log(\hat{y}_{i}).

(7)

where $\hat{y}_{i}$ denotes the model’s predicted label, indicating the probability that the i-th token in the structural element vocabulary is the masked element, and $y_{i}$ denotes the true label. The loss is computed only on $x$ elements that are masked.

Encoder of CPE-Pro. We integrated the “structure-sequence” representations output by the pre-trained protein structural sequence language model with the embeddings of protein graphs. The SSLM can learn the sequential relationships and proximity interactions between local structural elements from the “structure-sequence”. When combined with three-dimensional topological information, this approach aims to enrich and optimize the representation of protein structures. Specifically, we combined the representations obtained from the SSLM and the graph embeddings obtained from the GVP embedding layer as follows: $\mathcal{R}_{S_{seq}}\in R^{N_{S_{seq}}\times D_{S_{seq}}}$ represents the “structure-sequence” representation, where $N_{S_{seq}}$ is the length of the “structure-sequence”, and $D_{S_{seq}}$ is the feature dimension of each element in the “structure-sequence”. We align the representation dimensions of $\mathcal{R}_{S_{seq}}$ with $r_{s}\in R^{N\times D_{scalar}}$ using a linear transformation layer. Afterwards, $\mathcal{R}_{S_{seq}}$ and $r_{s}$ are fused with adaptive weights, i.e.,

\mathcal{R}_{Pro}=\mathcal{W}\cdot r_{s}+(1-\mathcal{W})\cdot\mathcal{R}_{S_{% seq}}.

(8)

Here, $\mathcal{W}$ is a learnable parameter. Finally, the topological structure and node features of graph data incorporating information from the “structure-sequence” are utilized to learn structural representations in GVP-GNN blocks. The aim is to enhance the accuracy of structural discrimination through enriched and deepened data representations.

Representation Classification. The protein structures are processed by the structural encoder of CPE-Pro to obtain feature vectors $\mathcal{R}_{Pro}^{\prime}$ . We designed a straightforward classification head to perform the final discrimination task. The classification head consists of three components: a pooling layer with attention[51], multilayer perceptron (MLP), and output (activation) layer. It aims to simplify feature dimensions and enable efficient classification. The specific process is represented by Eq. (9)., where a layer of MLP applies weight matrix $W_{5}$ , bias term $b_{3}$ , and $ReLU$ function to transform inputs. The output of the MLP is then passed through an activation function (ACT) that utilizes both $Sigmoid$ and $Softmax$ functions. $Sigmoid$ corresponds to the binary classification task (Crystal - AlphaFold2, C-A), while $Softmax$ is used for the multi-class classification problem (Crystal - Multiple prediction models, C-M). The $Pred$ is the final output of CPE-Pro.

Pred=ACT\left(\ldots ReLU\left(W_{5}\cdot Attention\left(\frac{1}{n}\sum_{i=1}% ^{n}\left(\mathcal{R}_{Pro}^{\prime}\right)^{i}\right)+b_{3}\right)\right).

(9)

4 Experimental Setups

4.1 Baseline Methods

We compared CPE-Pro to various embedded-based deep learning methods. Our analysis includes pre-trained PLMs, such as ESM1b, ESM1v[10], ESM2, ProtBert[39], Ankh[52], combined with GVP-GNN as a model with amino acid sequence and structure input, and SaProt.

4.2 Training Setups

We employed the AdamW[53] optimizer, setting the learning rate between 0.0001 and 0.00005, depending on the task and dataset size. Additionally, we applied CosineAnnealingLR for learning rate decay to improve convergence. We applied a dropout rate of 0.2 at the output layer and the number of epochs was set between 50 and 100. The loss functions employed were binary cross-entropy (Eq. (10)) and categorical cross-entropy (Eq. (11)). We implemented early stopping based on validation metrics such as accuracy to prevent overfitting. All protein folding, pre-training, and experiments were conducted on 8 NVIDIA RTX 3090 GPUs. Table 2 shows the sampling and partitioning of the discriminative task on CATH-PFD.

Table 2: Dataset partitioning for discrimination tasks

Task	Train terms	Valid terms	Test terms	Total
C-A	10,000(5,000×2)	1,000(500×2)	1,000(500×2)	12,000
C-M	20,000(5,000×4)	2,000(500×4)	2,000(500×4)	24,000
\botrule

⁰⁰footnotetext: The distribution of training, validation, and test terms for two classification tasks: C-A and C-M. Each task is divided into subsets for model development and evaluation, detailing the number of terms and the class composition within each subset.

\mathcal{L}_{C-A}=-\left[y\log(\hat{y})+\left(1-y\right)\log\left(1-\hat{y}% \right)\right].

(10)

\mathcal{L}_{C-M}=-\sum_{i=1}^{n}y_{i}\log\left(\hat{y}_{i}\right).

(11)

4.3 Evaluation Metrics

Performance of pre-training tasks on SSLM is measured using perplexity[54], an indicator of a language model’s predictive capability for a given text sequence. It quantifies the uncertainty of the model’s probability predictions for each token. Let SSLM assign a probability $P(S)$ to a “structure-sequence” $\langle s_{1},\dots,s_{i-1},s_{i+1},\dots,s_{n}\rangle$ of length n. The perplexity of the model for the “structure-sequence” is defined as:

PP(S)=P^{-\frac{1}{N}}(S)=\left(\prod_{i=1}^{n}P\left(s_{i}\mid s_{1},\dots,s_% {i-1},s_{i+1},\dots,s_{n}\right)\right)^{-\frac{1}{N}}.

(12)

The experiment reported several metrics to evaluate the performance of different models, including accuracy (ACC), Precision, Recall, F1-Score (F1), and Matthews correlation coefficient (MCC). Their calculation equations are as follows:

ACC=\frac{TP+TN}{TP+FP+FN+TN}.

(13)

Precision=\frac{TP}{TP+FP}.

(14)

Recall=\frac{TP}{TP+FN}.

(15)

F1=\frac{2\times TP}{2\times TP+FP+FN}.

(16)

MCC=\frac{TP\times TN-FP\times FN}{\sqrt{(TP+FN)\times(TP+FP)\times(TN+FN)% \times(TN+FP)}}.

(17)

The terms TP, TN, FP, and FN denote the counts of correctly predicted positives, correctly predicted negatives, incorrectly predicted positives, and incorrectly predicted negatives, respectively.

Table 3: Performance Comparison of Baseline Models on CATH-PFD for Structure Discrimination

Method Information			Input			Acc(%) $\uparrow$		Precision(%) $\uparrow$		Recall(%) $\uparrow$		F1-score $\uparrow$		MCC $\uparrow$
Model	LM Version	Param	AA	Structure	Struct-seq	C-A	C-M	C-A	C-M	C-A	C-M	C-A	C-M	C-A	C-M
GVP-GNN w/
ESM1b	t33	655M	\usym2713	\usym2713	\usym2717	93.9	92.0	90.7	92.0	97.8	92.0	0.941	0.920	0.881	0.893
ESM1v	t33	655M	\usym2713	\usym2713	\usym2717	92.5	90.0	88.3	90.2	98.0	90.0	0.929	0.899	0.855	0.867
ESM2	t30	153M	\usym2713	\usym2713	\usym2717	91.8	87.4	86.9	88.9	98.4	87.4	0.923	0.874	0.843	0.837
ESM2	t33	655M	\usym2713	\usym2713	\usym2717	94.4	91.9	92.6	93.0	96.4	91.9	0.945	0.919	0.889	0.896
ProtBert	t30 Uniref	423M	\usym2713	\usym2713	\usym2717	92.8	90.3	98.9	89.6	86.6	88.2	0.923	0.902	0.863	0.872
ProtBert	t30 BFD	423M	\usym2713	\usym2713	\usym2717	93.2	91.5	88.6	91.7	99.2	91.6	0.936	0.916	0.870	0.888
Ankh	base	453M	\usym2713	\usym2713	\usym2717	92.3	91.4	86.8	91.4	99.8	91.4	0.928	0.913	0.856	0.887
SaProt	t12 AF2	35M	\usym2713	\usym2717	\usym2713	71.0	44.4	68.7	47.1	77.2	44.4	0.795	0.450	0.423	0.262
SaProt	t33 PDB	650M	\usym2713	\usym2717	\usym2713	75.8	46.7	73.0	46.5	82.0	46.6	0.727	0.459	0.520	0.292
CPE-Pro	t3 Swiss-Prot	29M	\usym2717	\usym2713	\usym2713	98.5	97.2	97.6	97.3	99.4	97.2	0.985	0.972	0.970	0.963
\botrule

⁰⁰footnotetext: Note: AA: Amino Acids, struct-seq: “structure-sequence”. All results are rounded to three decimal places and the best results are in bold. The top three are highlighted by First, Second, Third.

5 Results and Analysis

CPE-Pro demonstrates exceptionally high accuracy performance in structure discrimination tasks. Baseline models and CPE-Pro were first trained on the C-A task. After several iterations, CPE-Pro achieved an accuracy of 0.98 on the test set, with the other four metrics all exceeding 0.97. In contrast, although the hybrid approach combining senven PLMs with GVP-GNN also achieved an accuracy above 0.9 in the C-A task, it still fell short compared to CPE-Pro. Among these senven hybrid baseline methods, the middle-performing model was ESM-1v, while the other six PLMs all had top-three performances across various metrics. The best accuracy was achieved by the method using a 33-layer ESM2 as the sequence encoder, reaching over 0.94—still 4.1% lower than CPE-Pro. The worst-performing model was the 153M parameter ESM2, which, due to its smaller model size, had lower performance compared to the other six PLM methods. This suggests that the size and complexity of PLMs influence their ability to capture the structure and function of proteins to a certain extent, with smaller models possibly struggling to fully learn the deep connections between sequence and structure in complex tasks.

Building on this success, we extended training to the more complex C-M task. After training on multi-class structured data, the model’s performance on the C-M task gradually improved, with all five metrics converging around 0.97, demonstrating strong competitiveness. However, both versions of SaProt performed poorly in the experiments, achieving less than 80% accuracy in distinguishing between crystal and AlphaFold2-predicted structures, and failing to reach 50% accuracy in the C-M task. A detailed analysis of the underlying reasons for this will be discussed in the subsequent section.

The “structure-sequence” can be a better predictor. The pre-training of the protein structural sequence language model, SSLM utilized 109,334 high pLDDT score (>0.955) protein structures from the Swiss-Prot database. The masking strategy and rate used in the pre-training task were informed by the approach proposed in [55], which comprehensively considers the model size and dataset scale. The setup of pre-training task are detailed in Table 4 and Figure 2 illustrates the perplexity of the SSLM on the pre-training task.

Sequential encoder SSLM in CPE-Pro and baseline models ESM1b, ESM1v, ESM2, ProtBert, Ankh were evaluated on the performance of CATH-PFD. Pre-trained SSLM has 3 hidden layers and significantly reduced parameters. The results in Table 3 and 5 show that the inclusion of “structure-sequence” encoders outperforms these amino acid sequence encoders on downstream C-M tasks. We preliminarily confirm our hypothesis on a structure discrimination task that language models learn sequence information obtained directly from structure discretization more efficiently. The “structure-sequence” shows greater effectiveness in protein classification tasks, which provides new directions for further optimization and design of more efficient predictive models. Independent use of SSLM and the structure-aware language model SaProt performed poorly because both rely solely on sequence input. Figure 3 shows the average pLDDT scores and similarity between “structure-sequences” for protein structures used in training, validation, and testing. It is evident that the high pLDDT levels across the three categories resulted in higher similarity of “structure-sequences” (training set: 73.28%; validation set: 71.58%; test set: 69.60%), leading to homogenized feature representations and limited model generalization. In contrast, SaProt’s input comes from a vocabulary of size 441 that includes both amino acid and “structure-sequence” elements. By combining these two vocabularies, some effects of high similarity are mitigated, and its performance metrics improve compared to SSLM. To demonstrate the effectiveness of SSLM in capturing structural differences, we validated it through visualization methods in subsequent sections. We also speculate that the scaling effect applies to language models trained with “structure-sequences”. In other words, increasing the model’s depth and the scale of training data could significantly enhance the model’s performance. Even without relying on a structural encoder, the model may still achieve satisfactory results.

Table 4: Pre-training Settings of SSLM

Model	Data Size	Mask Radio	Mask Method
SSLM_t3_25M_Swiss-Prot	109,334	15%	8:1:1&random
		15%	8:1:1&distribution
		25%	9:0:1
		25%	9:1:0&distribution
\botrule

⁰⁰footnotetext: The pre-training process employed four masking strategies. For the SSLM_t3_25M_CATH-PFD model, with a 15% masking rate, two masking methods were used: an 8:1:1 ratio with random replacement and an 8:1:1 ratio with distribution. The distribution reflects the statistical number of tokens in the pre-training dataset. When the masking rate increased to 25%, the masking strategies were adjusted to 9:0:1 and 9:1:0 with distribution. These configurations were designed to explore the performance of the model pre-trained using the MLM task.

Table 5: Ablation Study Results on CPE-Pro

GVP-GNN	SSLM		Attention Pooling	Acc(%) $\uparrow$		F1-score $\uparrow$		MCC $\uparrow$
GVP-GNN	P-t	W/o P-t	Attention Pooling	C-A	C-M	C-A	C-M	C-A	C-M
\usym2717	$\usym{2713}$	–	\usym2713	68.5	39.8	0.694	0.372	0.371	0.204
\usym2713	\usym2717	–	\usym2713	90.4	87.2	0.912	0.867	0.823	0.839
\usym2713	\usym2713	–	\usym2717	95.9	94.9	0.961	0.949	0.921	0.933
\usym2713	–	\usym2713	\usym2713	98.1	92.2	0.981	0.922	0.962	0.897
\usym2713	\usym2713	–	\usym2713	98.5	97.2	0.985	0.972	0.970	0.963
\botrule

⁰⁰footnotetext: P-t: Pre-trained; Attention Pooling: pooling layer with attention mask. The experiments vary by removing components like SSLM, GVP-GNN, and AM-Pooling to test their impact. Note: The best results are in bold.

Feature Visualization Method Powerfully Demonstrates Pretrained SSLM’s Excellence in Capturing Structural Differences. The protein language model has been shown to embed secondary and tertiary structure[9] characteristics within its output representations of proteins. We selected a subset of gene domain sequences from the non-redundant Astral SCOPe 2.08 database in SCOPe[56], where the identity between sequences is less than 40%. From this subset, we focused on all- $\alpha$ helical proteins (2,644) and all- $\beta$ sheet proteins (3,059) and filtered the corresponding structural sets in the database. Figure 4 and 5 show the t-SNE visualization of the protein representations from the last hidden layers of SSLM and various PLMs on the aforementioned dataset. It is evident that, aside from SaProt, which incorporates “structure-sequence” in its input, the representations of other PLMs, while capturing some differences in structural types, exhibit relatively weak discriminative power. After dimensionality reduction, the distribution of data points becomes chaotic, and the boundaries between the protein classes are blurred. In contrast, both SaProt and SSLM not only differentiate the two protein classes more effectively but also provide clearer class boundaries, with SSLM showing a more concentrated distribution within each class. This suggests that SSLM possesses stronger discriminative capability in capturing and representing protein structural features, providing a more accurate reflection of the intrinsic characteristics of different structural types.

Ablation Study on Components of CPE-Pro. To validate the contribution of each component designed in CPE-Pro, we conducted five sets of ablation experiments on both the C-A and C-M tasks. These variations included removing the GVP-GNN, omitting the pre-training process of SSLM, removing the SSLM, eliminating attention in the pooling layer, and using all three components together. As shown in Table 5, each component made a positive contribution to the C-M task. Performance significantly declined when using only a single type of encoder, particularly when using the SSLM alone as discussed earlier. The CPE-Pro model, which employs a pre-trained SSLM, outperformed the non-pre-trained version, indicating that the pre-training process effectively helped the model learn the structural features embedded in the “structure-sequences”. Additionally, the application of attention-masked pooling layers also positively influenced model performance, further enhancing overall effectiveness.

Case Study: Discrimination of the Structural Origins of BLAT ECOLX and CP2C9 HUMAN. BLAT ECOLX is a $\beta$ -lactamase protein found in Escherichia coli that can hydrolyze the $\beta$ -lactam ring of $\beta$ -lactam antibiotics, rendering these antibiotics inactive. It plays a significant role in the study of antibiotic resistance. CP2C9 HUMAN is a human Cytochrome P450 2C9 enzyme responsible for the metabolism of various drugs, including non-steroidal anti-inflammatory drugs and anticoagulants. It plays a significant role in drug metabolism and the regulation of endogenous substances. In three structural prediction models, both proteins achieved pLDDT scores above 0.9, indicating high accuracy in structure prediction with minimal deviation from the crystal structure. We input the crystal structures and predicted structures of both proteins into CPE-Pro for origin discrimination. Figure 6 demonstrates that the model successfully and confidently predicted the origins of these structures. This result highlights the robustness of the model in assessing structural origins, even in cases of very minor structural differences.

6 Discussion

In this study, we developed a protein folding dataset, CATH-PFD, derived from the non-redundant dataset in the CATH database, which incorporates structures from various prediction models. By training and validating our model, CPE-Pro, on the CATH-PFD dataset, we created an innovative and effective solution for identifying the structural origins of proteins. The CPE-Pro model excels in learning and analyzing protein structural features, outperforming methods that combine amino acid sequences with structural data, as well as models that use structure-aware sequences, in the task of structural origin recognition. In case studies, CPE-Pro demonstrated superior performance.

These findings provide preliminary evidence that incorporating “structure-sequence” information significantly enhances the language model’s ability to learn protein features, enabling it to capture richer and more precise structural details, thereby improving the representation of protein structures. In subsequent visualization experiments, we further validated the sensitivity of SSLM, utilized within CPE-Pro, to structural variations, as well as its effectiveness in capturing and representing complex protein structural features. This exploration not only opens up new avenues for the application of protein structural language models but also paves the way for future developments in the paradigm and interpretability of protein structure prediction methods, offering fresh insights and possibilities for advancing practical applications in bioinformatics and structural biology.

Declarations

•

Competing interests The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

\bibcommenthead
Gold and Jackson [2006] Gold, N.D., Jackson, R.M.: Fold independent structural comparisons of protein–ligand binding sites for exploring functional relationships. Journal of molecular biology 355(5), 1112–1124 (2006)
Bernstein et al. [1977] Bernstein, F.C., Koetzle, T.F., Williams, G.J., Meyer Jr, E.F., Brice, M.D., Rodgers, J.R., Kennard, O., Shimanouchi, T., Tasumi, M.: The protein data bank: a computer-based archival file for macromolecular structures. Journal of molecular biology 112(3), 535–542 (1977)
Case et al. [2005] Case, D.A., Cheatham III, T.E., Darden, T., Gohlke, H., Luo, R., Merz Jr, K.M., Onufriev, A., Simmerling, C., Wang, B., Woods, R.J.: The amber biomolecular simulation programs. Journal of computational chemistry 26(16), 1668–1688 (2005)
Abraham et al. [2015] Abraham, M.J., Murtola, T., Schulz, R., Páll, S., Smith, J.C., Hess, B., Lindahl, E.: Gromacs: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1, 19–25 (2015)
Altschul et al. [1990] Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. Journal of molecular biology 215(3), 403–410 (1990)
Waterhouse et al. [2018] Waterhouse, A., Bertoni, M., Bienert, S., Studer, G., Tauriello, G., Gumienny, R., Heer, F.T., Beer, T.A.P., Rempfer, C., Bordoli, L., et al.: Swiss-model: homology modelling of protein structures and complexes. Nucleic acids research 46(W1), 296–303 (2018)
Luo et al. [2021] Luo, Y., Jiang, G., Yu, T., Liu, Y., Vo, L., Ding, H., Su, Y., Qian, W.W., Zhao, H., Peng, J.: Ecnet is an evolutionary context-integrated deep learning framework for protein engineering. Nature communications 12(1), 5743 (2021)
Li et al. [2023] Li, M., Kang, L., Xiong, Y., Wang, Y.G., Fan, G., Tan, P., Hong, L.: Sesnet: sequence-structure feature-integrated deep learning method for data-efficient protein engineering. Journal of Cheminformatics 15(1), 12 (2023)
Rives et al. [2021] Rives, A., Meier, J., Sercu, T., Goyal, S., Lin, Z., Liu, J., Guo, D., Ott, M., Zitnick, C.L., Ma, J., et al.: Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences 118(15), 2016239118 (2021)
Meier et al. [2021] Meier, J., Rao, R., Verkuil, R., Liu, J., Sercu, T., Rives, A.: Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in neural information processing systems 34, 29287–29303 (2021)
Lin et al. [2023] Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y., et al.: Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379(6637), 1123–1130 (2023)
Yang et al. [2020] Yang, J., Anishchenko, I., Park, H., Peng, Z., Ovchinnikov, S., Baker, D.: Improved protein structure prediction using predicted interresidue orientations. Proceedings of the National Academy of Sciences 117(3), 1496–1503 (2020)
Jumper et al. [2021] Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., et al.: Highly accurate protein structure prediction with alphafold. nature 596(7873), 583–589 (2021)
Lin et al. [2022] Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Santos Costa, A., Fazel-Zarandi, M., Sercu, T., Candido, S., et al.: Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv 2022, 500902 (2022)
Orengo et al. [1997] Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., Thornton, J.M.: Cath–a hierarchic classification of protein domain structures. Structure 5(8), 1093–1109 (1997)
LeCun et al. [1998] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
Kingma [2013] Kingma, D.P.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Graves and Graves [2012] Graves, A., Graves, A.: Long short-term memory. Supervised sequence labelling with recurrent neural networks, 37–45 (2012)
Vaswani [2017] Vaswani, A.: Attention is all you need. Advances in Neural Information Processing Systems (2017)
Radford [2018] Radford, A.: Improving language understanding by generative pre-training (2018)
Nijkamp et al. [2023] Nijkamp, E., Ruffolo, J.A., Weinstein, E.N., Naik, N., Madani, A.: Progen2: exploring the boundaries of protein language models. Cell systems 14(11), 968–978 (2023)
Ferruz et al. [2022] Ferruz, N., Schmidt, S., Höcker, B.: Protgpt2 is a deep unsupervised language model for protein design. Nature communications 13(1), 4348 (2022)
Katoh et al. [2002] Katoh, K., Misawa, K., Kuma, K.-i., Miyata, T.: Mafft: a novel method for rapid multiple sequence alignment based on fast fourier transform. Nucleic acids research 30(14), 3059–3066 (2002)
Rao et al. [2021] Rao, R.M., Liu, J., Verkuil, R., Meier, J., Canny, J., Abbeel, P., Sercu, T., Rives, A.: Msa transformer. In: International Conference on Machine Learning, pp. 8844–8856 (2021). PMLR
Brandes et al. [2022] Brandes, N., Ofer, D., Peleg, Y., Rappoport, N., Linial, M.: Proteinbert: a universal deep-learning model of protein sequence and function. Bioinformatics 38(8), 2102–2110 (2022)
Sanyal et al. [2020] Sanyal, S., Anishchenko, I., Dagar, A., Baker, D., Talukdar, P.: Proteingcn: Protein model quality assessment using graph convolutional networks. BioRxiv, 2020–04 (2020)
Zhang et al. [2022] Zhang, Z., Xu, M., Jamasb, A., Chenthamarakshan, V., Lozano, A., Das, P., Tang, J.: Protein representation learning by geometric structure pretraining. arXiv preprint arXiv:2203.06125 (2022)
Song et al. [2024] Song, F.V., Su, J., Huang, S., Zhang, N., Li, K., Ni, M., Liao, M.: Deepss2go: protein function prediction from secondary structure. Briefings in Bioinformatics 25(3), 196 (2024)
Tan et al. [2024] Tan, Y., Li, M., Zhou, B., Zhong, B., Zheng, L., Tan, P., Zhou, Z., Yu, H., Fan, G., Hong, L.: Simple, efficient, and scalable structure-aware adapter boosts protein language models. Journal of Chemical Information and Modeling (2024)
Gligorijević et al. [2021] Gligorijević, V., Renfrew, P.D., Kosciolek, T., Leman, J.K., Berenberg, D., Vatanen, T., Chandler, C., Taylor, B.C., Fisk, I.M., Vlamakis, H., et al.: Structure-based protein function prediction using graph convolutional networks. Nature communications 12(1), 3168 (2021)
Kipf and Welling [2016] Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
Wang et al. [2021] Wang, Z., Combs, S.A., Brand, R., Calvo, M.R., Xu, P., Price, G., Golovach, N., Salawu, E.O., Wise, C.J., Ponnapalli, S.P., et al.: Lm-gvp: A generalizable deep learning framework for protein property prediction from sequence and structure. bioRxiv, 2021–09 (2021)
Jing et al. [2020] Jing, B., Eismann, S., Suriana, P., Townshend, R.J.L., Dror, R.: Learning from protein structure with geometric vector perceptrons. In: International Conference on Learning Representations (2020)
Zhang et al. [2023] Zhang, Z., Wang, C., Xu, M., Chenthamarakshan, V., Lozano, A., Das, P., Tang, J.: A systematic study of joint representation learning on protein sequences and structures. arXiv preprint arXiv:2303.06275 (2023)
Tan et al. [2023] Tan, Y., Zhou, B., Zheng, L., Fan, G., Hong, L.: Semantical and geometrical protein encoding toward enhanced bioactivity and thermostability. bioRxiv, 2023–12 (2023)
Satorras et al. [2021] Satorras, V.G., Hoogeboom, E., Welling, M.: E (n) equivariant graph neural networks. In: International Conference on Machine Learning, pp. 9323–9332 (2021). PMLR
Heinzinger et al. [2023] Heinzinger, M., Weissenow, K., Sanchez, J.G., Henkel, A., Steinegger, M., Rost, B.: Prostt5: Bilingual language model for protein sequence and structure. biorxiv (2023)
Van Kempen et al. [2024] Van Kempen, M., Kim, S.S., Tumescheit, C., Mirdita, M., Lee, J., Gilchrist, C.L., Söding, J., Steinegger, M.: Fast and accurate protein structure search with foldseek. Nature biotechnology 42(2), 243–246 (2024)
Elnaggar et al. [2021] Elnaggar, A., Heinzinger, M., Dallago, C., Rehawi, G., Wang, Y., Jones, L., Gibbs, T., Feher, T., Angerer, C., Steinegger, M., et al.: Prottrans: Toward understanding the language of life through self-supervised learning. IEEE transactions on pattern analysis and machine intelligence 44(10), 7112–7127 (2021)
Su et al. [2023] Su, J., Han, C., Zhou, Y., Shan, J., Zhou, X., Yuan, F.: Saprot: Protein language modeling with structure-aware vocabulary. bioRxiv, 2023–10 (2023)
Sievers and Higgins [2014] Sievers, F., Higgins, D.G.: Clustal omega, accurate alignment of very large numbers of sequences. Multiple sequence alignment methods, 105–116 (2014)
Riesselman et al. [2018] Riesselman, A.J., Ingraham, J.B., Marks, D.S.: Deep generative models of genetic variation capture the effects of mutations. Nature methods 15(10), 816–822 (2018)
Wang et al. [2016] Wang, S., Peng, J., Ma, J., Xu, J.: Protein secondary structure prediction using deep convolutional neural fields. Scientific reports 6(1), 1–11 (2016)
Lafferty et al. [2001] Lafferty, J., McCallum, A., Pereira, F., et al.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Icml, vol. 1, p. 3 (2001). Williamstown, MA
Wang et al. [2017] Wang, Y., Mao, H., Yi, Z.: Protein secondary structure prediction by using deep learning method. Knowledge-Based Systems 118, 115–123 (2017)
Wang et al. [2022] Wang, W., Peng, Z., Yang, J.: Single-sequence protein structure prediction using supervised transformer protein language models. Nature Computational Science 2(12), 804–814 (2022)
Wu et al. [2022] Wu, R., Ding, F., Wang, R., Shen, R., Zhang, X., Luo, S., Su, C., Wu, Z., Xie, Q., Berger, B., et al.: High-resolution de novo structure prediction from primary sequence. BioRxiv, 2022–07 (2022)
Mirdita et al. [2022] Mirdita, M., Schütze, K., Moriwaki, Y., Heo, L., Ovchinnikov, S., Steinegger, M.: Colabfold: making protein folding accessible to all. Nature methods 19(6), 679–682 (2022)
Hsu et al. [2022] Hsu, C., Verkuil, R., Liu, J., Lin, Z., Hie, B., Sercu, T., Lerer, A., Rives, A.: Learning inverse folding from millions of predicted structures. In: International Conference on Machine Learning, pp. 8946–8970 (2022). PMLR
Devlin [2018] Devlin, J.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Tan et al. [2024] Tan, Y., Zheng, J., Hong, L., Zhou, B.: Protsolm: Protein solubility prediction with multi-modal features. arXiv preprint arXiv:2406.19744 (2024)
Elnaggar et al. [2023] Elnaggar, A., Essam, H., Salah-Eldin, W., Moustafa, W., Elkerdawy, M., Rochereau, C., Rost, B.: Ankh: Optimized protein language model unlocks general-purpose modelling. arXiv preprint arXiv:2301.06568 (2023)
Loshchilov [2017] Loshchilov, I.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Brown et al. [1990] Brown, P.F., Cocke, J., Della Pietra, S.A., Della Pietra, V.J., Jelinek, F., Lafferty, J., Mercer, R.L., Roossin, P.S.: A statistical approach to machine translation. Computational linguistics 16(2), 79–85 (1990)
Wettig et al. [2022] Wettig, A., Gao, T., Zhong, Z., Chen, D.: Should you mask 15% in masked language modeling? arXiv preprint arXiv:2202.08005 (2022)
Chandonia et al. [2019] Chandonia, J.-M., Fox, N.K., Brenner, S.E.: Scope: classification of large macromolecular structures in the structural classification of proteins—extended database. Nucleic acids research 47(D1), 475–481 (2019)