\equalcont

These authors contributed equally to this work. \equalcontThese authors contributed equally to this work. [1]\fnmMingchen \surLi [1]\fnmHuiqun \surYu

[1]\orgdivSchool of Information Science and Engineerin, \orgnameEast China University of Science and Technology, \orgaddress\streetStreet, \cityShanghai, \postcode200237, \countryChina

CPE-Pro: A Structure-Sensitive Deep Learning Model for Protein Representation and Origin Evaluation

\fnmWenrui \surGou gwr@mail.ecust.edu.cn    \fnmWenhui \surGe gwh@mail.ecust.edu.cn    \fnmYang \surTan tyang@mail.ecust.edu.cn    \fnmGuisheng \surFan gsfan@ecust.edu.cn    lmc@mail.ecust.edu.cn    yhq@ecust.edu.cn *
Abstract

Protein structures are important for understanding their functions and interactions. Currently, many protein structure prediction methods are enriching the structure database. Discriminating the origin of structures is crucial for distinguishing between experimentally resolved and computationally predicted structures, evaluating the reliability of prediction methods, and guiding downstream biological studies. Building on works in structure prediction, We developed a structure-sensitive supervised deep learning model, Crystal vs Predicted Evaluator for Protein Structure (CPE-Pro), to represent and discriminate the origin of protein structures. CPE-Pro learns the structural information of proteins and captures inter-structural differences to achieve accurate traceability on four data classes, and is expected to be extended to more. Simultaneously, we utilized Foldseek to encode protein structures into “structure-sequence” and trained a protein Structural Sequence Language Model, SSLM. Preliminary experiments demonstrated that, compared to large-scale protein language models pre-trained on vast amounts of amino acid sequences, the “structure-sequences” enables the language model to learn more informative protein features, enhancing and optimizing structural representations. We have provided the code, model weights, and all related materials on https://github.com/GouWenrui/CPE-Pro-main.git.

keywords:
Deep learning, Strcuture Representation, Structure-sequence, Origin prediction

1 Introduction

The function of a protein is determined by its folded structure; therefore, characterizing protein structure is crucial for understanding and studying its biological functions. Protein folding is a complex and highly coordinated process that determines the active sites, ligand-binding capabilities, and functional roles of the protein both inside and outside the cell.[1] Even slight alterations in structure can lead to significant changes in protein function, potentially resulting in disease. Consequently, a comprehensive understanding of protein folding is not only essential for elucidating its biological functions but also serves as a guide for research and applications in drug design, disease diagnosis, and treatment.

However, experimentally determining protein structures poses numerous challenges. Despite significant efforts by scientists over the past few decades, the number of experimentally determined structures reported in the Protein Data Bank (PDB)[2] remains far lower than the number of known protein sequences, creating a substantial data gap. The limited availability of protein structure data significantly hinders a comprehensive understanding of protein functions and interaction mechanisms. To address this limitation, researchers have increasingly turned to computational methods, such as protein folding simulations (AMBER[3], GROMACS[4]), to guide the determination of protein structures. One of the most representative methods is homology modeling[5, 6]. However, homology modeling relies on the similarity to known structures and cannot be accurately applied to proteins without known similar structures. While these computational methods somewhat alleviate the issue of insufficient experimental data, they still do not fully resolve the broader applicability of structure prediction, particularly for novel or complex proteins, where significant limitations remain.

As artificial intelligence technology advances, an increasing number of supervised[7, 8] and unsupervised[9, 10, 11] deep learning models are showing promise in the field of protein research, significantly enhancing the efficiency and accuracy of structure prediction. However, despite the proliferation of structure prediction models[12, 13, 14], a comprehensive evaluation of the differences and distinguishability between predicted and experimentally determined structures has yet to be undertaken. Significant discrepancies may still exist between prediction results and experimental findings, arising from the limitations of prediction models in handling the complex dynamics and interactions involved in protein folding. For example, models may fail to capture subtle changes in the protein folding process or perform poorly when dealing with proteins that have not been extensively studied. Ignoring these differences could lead to misunderstandings of protein function and behavior, potentially negatively impacting subsequent biological applications. Additionally, malicious injection in structural databases may further exacerbate inaccuracies in model downstream performance, leading to a decrease in the reliability of prediction results. Therefore, it is crucial to systematically evaluate the differences between predicted and experimental structures and to develop effective methods to distinguish the origin of protein structures. Our research aims to address this gap and propose a general and reliable solution.

Our contributions are as follows:

  • We introduce CPE-Pro, a model that excels in distinguishing between crystal and predicted protein structures by learning structural features.

  • Using the CATH 4.3.0[15] non-redundant dataset, we create the protein folding dataset CATH-PFD with multiple prediction models.

  • Preliminary experiments indicate that, compared to amino acid sequences, “structure-sequences” enable language models to learn more effective protein feature information, enriching and optimizing structural representations when integrated with graph embeddings.

  • We have open-sourced the code, model weights, and CATH-PFD dataset for CPE-Pro, providing valuable resources for protein structure research.

2 Related Works

2.1 Protein Representation Learning

In various biological tasks, it is essential to learn effective protein representations, such as for predicting protein functions or the effects of mutations. Protein representation methods can be classified into three approaches based on different modalities: sequence-based, structure-based, and a combination of sequence and structure.Figure 1.a shows various protein representation methods.

Sequence-based. With the rise of deep learning and advances in high-throughput sequencing technologies, data-driven methods are gradually replacing traditional analyses based on biological or physical priors. Protein sequences can be viewed as a form of “biological text”, and convolutional neural network[16] can directly capture local dependencies between amino acids. Techniques from natural language processing are also widely applied in protein representation learning. Models for individual protein sequence modeling include Variational Auto-Encoders (VAE)[17], long short-term memory networks (LSTM)[18], and large pre-trained protein language models(PLMs) like ESM-1b[9] based on the Transformer architecture[19]. Some studies have employed GPT[20]-based architectures for sequence modeling, using generative pre-training to represent protein sequences and make predictions, such as ProtGen2[21] and ProGPT2[22]. Compared to single sequences, multiple sequence alignment (MSA)[23] input aims to capture co-evolutionary information from a set of evolutionarily related sequences. In this context, the MSA Transformer[24] utilizes row and column attention mechanisms to model a set of protein sequences, allowing it to simultaneously consider inter-sequence relationships and conservation. Additionally, some research integrates protein sequences with other types of information, such as converting gene ontology annotations into fixed-size binary vectors for joint input with sequences in ProteinBERT[25], enabling the model to leverage the connections between sequence and functional annotations effectively.

Structure-based. While sequence-based research methods have been shown to be effective in several studies, the structural information of proteins is a critical determinant of their function. Models based on graph neural networks (GNNs) demonstrate significant advantages and broad applicability in representing protein three-dimensional structures. For example, ProteinGCN[26] constructs spatially adjacent protein graphs and is trained on tasks related to both local and global accuracy in protein modeling. Pre-trained protein representation methods[27] based on three-dimensional structures represent proteins as residue graphs, where nodes correspond to the three-dimensional coordinates of α𝛼\alphaitalic_α carbons and edges represent relationships between residues. In modeling protein secondary structures, these are often transformed into secondary structure sequences, with α𝛼\alphaitalic_α-helices, β𝛽\betaitalic_β-sheets, and random coils represented as token sequences. DeepSS2GO[28] utilizes one-hot matrices to model secondary structures for predicting key protein functions. SES-Adapter[29] enhances the performance of PLMs in downstream tasks by integrating secondary structure sequence embeddings with other types of embeddings through a cross-attention mechanism.

Integrating Sequence and Structure. Combining protein sequence and structural information not only considers the sequential characteristics of amino acids but also reveals their spatial interactions and arrangements. In this regard, DeeFRI[30] combines LSTM and graph convolutional networks (GCNs)[31] to jointly learn complex structure-function relationships. LM-GVP[32] modifies the input variables of GVP[33] by using sequence embeddings generated by ProteinBERT as input. ESM-GearNet[34] designs various fusion methods between PLMs and GearNet to investigate the effectiveness of different fusion strategies. ProtSSN[35] combines ESM2[11] with equivariant graph neural networks (E-GNNs)[36] to extract geometric features of proteins, aiming to accurately predict biological activity and thermal stability. ProstT5[37] introduces 3Di (Three-Dimensional Indexing) alphabet used by Foldseek[38] based on the ProtT5-XL-U50 model[39] and is trained on labeling translation tasks in two modes. Additionally, SaProt[40] creates a larger structural-aware vocabulary using two types of labels and is pre-trained on a large-scale protein dataset for masked language tasks. Compared to previous work, the SSLM in CPE-Pro uses only the “structure-sequences” built from 3Di alphabet for pre-training on Swiss-Prot, without relying on amino acid labels, and effectively learns protein representations by integrating existing structural information with GVP.

Refer to caption
Figure 1: a. Protein representation methods. Proteins can be input into the model in various forms, including amino acid sequences, feature maps, three-dimensional coordinates, functional descriptions, and sequences composed of structural tokens, capturing the multi-level features of proteins. b. Pre-training of SSLM. SSLM is pre-trained on over 100,000 protein structures from the Swiss-Prot database and trained on various masked language modeling tasks, learning the relationships between “structure-sequences” and their corresponding three-dimensional structural features, thereby effectively representing protein structural information. c. CPE-Pro model architecture. The CPE-Pro model integrates a pre-trained protein structure language model with a graph embedding module, inputting the combined representation into the GVP-GNN module for computation. The pooling module aggregates structural information using attention masking, enhancing the quality of the representation. Ultimately, a multilayer perceptron serves as the source discriminator, outputting predicted probabilities.

2.2 Protein Structure Prediction

Amino acids form the linear sequence of proteins, but they acquire activity and biological function only when folded into specific spatial conformations. Early structure prediction methodsolded into specific spatial conformations. Early structure prediction methods[41, 42] primarily relied on sequence similarity, utilizing the homology of known protein sequences for predictions. These methods infer the structure of the target protein by aligning the sequence of interest with known homologous sequences and using the structural information from these homologs. Compared to sequence-based methods, structure-based learning approaches theoretically offer better solutions for acquiring protein information. In recent years, advancements in deep learning technologies have led to breakthrough progress in protein structure prediction.

In the study of secondary structure prediction, DeepCNF[43] integrates conditional random fields (CRFs)[44] and shallow neural networks to successfully model the interdependencies between adjacent secondary structure labels. SSREDNs[45] leverage deep and recurrent structures to simulate the complex nonlinear mapping relationships between input protein features and secondary structures, while also capturing interactions among consecutive residues in the protein chain.

More research has focused on three-dimensional structure prediction. In this context, trRosetta [12] utilizes co-evolution data, combined with deep residual networks, to predict the orientation and distances between residues through Rosetta-constrained energy minimization protocols. trRosettaX-single[46] focuses on the structure prediction of single-chain proteins. Regarded as a ”milestone,” AlphaFold2 [13] significantly enhances the accuracy and breadth of protein structure prediction by embedding multiple sequence alignments and paired features through Evoformer. OmegaFold [47] employs GCNs and self-attention [19] to effectively capture both global and local features of protein sequences, excelling in handling proteins of varying lengths and complexities. ESMFold [14] adopts a large-scale Transformer architecture, trained on extensive protein sequence datasets, to extract deep evolutionary features from sequence information without relying on multiple sequence alignments, demonstrating exceptional performance in predicting new structural domains and distant homologs. Leveraging the predictions from these models will provide robust support for biomedical research, drug development, and further advancements across various scientific domains.

3 Materials & Methods

3.1 Dataset

In this study, we utilized the non-redundant dataset from version 4.3.0 of the CATH (Class, Architecture, Topology, Homologous superfamily) database as our benchmark dataset. CATH is a significant protein structure classification database that categorizes proteins based on their structural and functional features through a systematic hierarchical structure. The high-resolution three-dimensional structure data provided by these experimental techniques are stored in the Protein Data Bank (PDB). The non-redundant dataset has a sequence identity filtering threshold of 40%, ensuring no high sequence similarity between structural domains. By removing redundant protein structures, each entry in the database represents a unique structural representation.

We extracted the amino acid sequences of proteins from the benchmark dataset. Using multiple state-of-the-art protein structure prediction models, we predicted the structures corresponding to these amino acid sequences. These structures were organized and categorized based on individual proteins and prediction models to construct a Protein Folding Dataset, CATH-PFD, which will be used for training and validating CPE-Pro. Table 1 presents detailed information about the dataset CATH-PFD.

Table 1: The information on CATH-PFD
Model/Origin Number pLDDT(%) Method to Obtain
CATH 31,885 Download from https://www.cathdb.info/
AlphaFold2 31,885 92.4 Predicted using the ColabFold[48]
OmegaFold 31,881 82.7 Predicted using the OmegaFold
ESMFold 23,912 76.2 Predicted using the ESMFold
\botrule
00footnotetext: The number of structures from different origins, their pLDDT scores, and acquisition methods. The CATH database contains 31,885 protein structures. For structure prediction, we deployed OmegaFold, ESMFold, and the efficient AlphaFold2 variant, ColabFold, using publicly available resources from previous studies. Unpredictable or erroneous protein structure results were excluded during the prediction process, leading to differences in the number of entries across various categories.

3.2 Model Architecture

Crystal vs Predicted Evaluator for Protein Structure (CPE-Pro), designed for discriminating structural origins, integrates two distinct structure encoders corresponding to graphical and sequential representations of the structures. Figure 1.c illustrates the detailed architecture of CPE-Pro.

CPE-Pro implements Geometric Vector Perceptrons – Graph Neural Networks (GVP-GNN)[49], which learn dual relationships and geometric representations of three-dimensional macromolecular structures as part of the protein structural encoder.

In obtaining information about the three-dimensional structure of proteins, we focus on a specific chain, ignoring the parts we are not concerned with. We concentrate on the coordinates of key atoms that constitute the protein backbone—N, CA, and C atoms—since these atoms are necessary for understanding the protein structure. Subsequently, we extract the geometric information of nodes and edges from the raw data and compute features such as distances and directions between them.For a protein graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ), where V𝑉Vitalic_V represents the set of nodes and E𝐸Eitalic_E represents the set of edges, each node vV𝑣𝑉v\in Vitalic_v ∈ italic_V represents a residue of the protein, and each edge eE𝑒𝐸e\in Eitalic_e ∈ italic_E represents an interaction or spatial proximity relationship between residues. For the ith residue, the feature consists of scalars and vectors, i.e., vi=(rsi,rveci)superscriptsubscript𝑣𝑖superscriptsubscript𝑟𝑠𝑖superscriptsubscript𝑟𝑣𝑒𝑐𝑖\mathcal{R}_{v}^{i}=(r_{s}^{i},r_{vec}^{i})caligraphic_R start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ( italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_v italic_e italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ). A embedding layer computes embeddings for the protein feature vsubscript𝑣\mathcal{R}_{v}caligraphic_R start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, i.e.,

v=GVP(rs,rvec).superscriptsubscript𝑣𝐺𝑉𝑃subscript𝑟𝑠subscript𝑟𝑣𝑒𝑐\mathcal{R}_{v}^{\prime}=GVP(r_{s},r_{vec}).caligraphic_R start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_G italic_V italic_P ( italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_v italic_e italic_c end_POSTSUBSCRIPT ) . (1)

The GVP (·) layer mainly carries out scalar-vector propagations, i.e.,

rs=σ1(W1Concat(rs,Norm(W2rvec))+b1).superscriptsubscript𝑟𝑠subscript𝜎1subscript𝑊1𝐶𝑜𝑛𝑐𝑎𝑡subscript𝑟𝑠𝑁𝑜𝑟𝑚subscript𝑊2subscript𝑟𝑣𝑒𝑐subscript𝑏1r_{s}^{\prime}=\sigma_{1}\left(W_{1}\cdot Concat\left(r_{s},Norm\left(W_{2}% \cdot r_{vec}\right)\right)+b_{1}\right).italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_C italic_o italic_n italic_c italic_a italic_t ( italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_N italic_o italic_r italic_m ( italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_r start_POSTSUBSCRIPT italic_v italic_e italic_c end_POSTSUBSCRIPT ) ) + italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) . (2)
rvec=W3(W2rvec)σ2(W4rs+b2).superscriptsubscript𝑟𝑣𝑒𝑐direct-productsubscript𝑊3subscript𝑊2subscript𝑟𝑣𝑒𝑐subscript𝜎2subscript𝑊4superscriptsubscript𝑟𝑠subscript𝑏2r_{vec}^{\prime}=W_{3}\cdot\left(W_{2}\cdot r_{vec}\right)\odot\sigma_{2}\left% (W_{4}\cdot r_{s}^{\prime}+b_{2}\right).italic_r start_POSTSUBSCRIPT italic_v italic_e italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ⋅ ( italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_r start_POSTSUBSCRIPT italic_v italic_e italic_c end_POSTSUBSCRIPT ) ⊙ italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ⋅ italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) . (3)

Here, W1subscript𝑊1W_{1}italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, W2subscript𝑊2W_{2}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, W3subscript𝑊3W_{3}italic_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, W4subscript𝑊4W_{4}italic_W start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, b1subscript𝑏1b_{1}italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, b2subscript𝑏2b_{2}italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are learnable parameters for this layer, σ1()subscript𝜎1\sigma_{1}(\cdot)italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ⋅ ) and σ2()subscript𝜎2\sigma_{2}(\cdot)italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ⋅ ) denote activation functions. In each graph propagation step, messages from neighboring nodes and edges in the protein graph update the embedding of the current node,

rmsgji:=GVPconv(Concat(rvj,reji)).assignsuperscriptsubscript𝑟𝑚𝑠𝑔𝑗𝑖𝐺𝑉subscript𝑃𝑐𝑜𝑛𝑣𝐶𝑜𝑛𝑐𝑎𝑡superscriptsubscript𝑟𝑣𝑗superscriptsubscript𝑟𝑒𝑗𝑖r_{msg}^{j\rightarrow i}:=GVP_{conv}\left(Concat\left(r_{v}^{j},r_{e}^{j% \rightarrow i}\right)\right).italic_r start_POSTSUBSCRIPT italic_m italic_s italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j → italic_i end_POSTSUPERSCRIPT := italic_G italic_V italic_P start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v end_POSTSUBSCRIPT ( italic_C italic_o italic_n italic_c italic_a italic_t ( italic_r start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j → italic_i end_POSTSUPERSCRIPT ) ) . (4)
rviLayerNorm(rvi+1kDrop(j:ejiErmsgji)).superscriptsubscript𝑟𝑣𝑖𝐿𝑎𝑦𝑒𝑟𝑁𝑜𝑟𝑚superscriptsubscript𝑟𝑣𝑖1𝑘𝐷𝑟𝑜𝑝subscript:𝑗subscript𝑒𝑗𝑖𝐸superscriptsubscript𝑟𝑚𝑠𝑔𝑗𝑖r_{v}^{i}\leftarrow LayerNorm\left(r_{v}^{i}+\frac{1}{k}Drop\left(\sum_{\begin% {subarray}{c}j:e_{ji}\in E\end{subarray}}r_{msg}^{j\rightarrow i}\right)\right).italic_r start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ← italic_L italic_a italic_y italic_e italic_r italic_N italic_o italic_r italic_m ( italic_r start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_k end_ARG italic_D italic_r italic_o italic_p ( ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_j : italic_e start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT ∈ italic_E end_CELL end_ROW end_ARG end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_m italic_s italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j → italic_i end_POSTSUPERSCRIPT ) ) . (5)

Here, rvisuperscriptsubscript𝑟𝑣𝑖r_{v}^{i}italic_r start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and rejisuperscriptsubscript𝑟𝑒𝑗𝑖r_{e}^{j\rightarrow i}italic_r start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j → italic_i end_POSTSUPERSCRIPT are the embeddings of the node i and edgeji𝑒𝑑𝑔subscript𝑒𝑗𝑖edge_{j\rightarrow i}italic_e italic_d italic_g italic_e start_POSTSUBSCRIPT italic_j → italic_i end_POSTSUBSCRIPT as above, and rmsgjisuperscriptsubscript𝑟𝑚𝑠𝑔𝑗𝑖r_{msg}^{j\rightarrow i}italic_r start_POSTSUBSCRIPT italic_m italic_s italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j → italic_i end_POSTSUPERSCRIPT represents the message passed from node j to node i. GVPconv𝐺𝑉subscript𝑃𝑐𝑜𝑛𝑣GVP_{conv}italic_G italic_V italic_P start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v end_POSTSUBSCRIPT denotes the stacked 3-layer GVP. we also add a feed-forward layer to update the node embeddings at all nodes i:

rviLayerNorm(rvi+Drop(GVPconv(rvi))).superscriptsubscript𝑟𝑣𝑖𝐿𝑎𝑦𝑒𝑟𝑁𝑜𝑟𝑚superscriptsubscript𝑟𝑣𝑖𝐷𝑟𝑜𝑝𝐺𝑉superscriptsubscript𝑃𝑐𝑜𝑛𝑣superscriptsubscript𝑟𝑣𝑖r_{v}^{i}\leftarrow LayerNorm\left(r_{v}^{i}+Drop\left(GVP_{conv}^{\prime}% \left(r_{v}^{i}\right)\right)\right).italic_r start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ← italic_L italic_a italic_y italic_e italic_r italic_N italic_o italic_r italic_m ( italic_r start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_D italic_r italic_o italic_p ( italic_G italic_V italic_P start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ) ) . (6)

The GVPconv𝐺𝑉superscriptsubscript𝑃𝑐𝑜𝑛𝑣GVP_{conv}^{\prime}italic_G italic_V italic_P start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is a stacked 2-layer GVP𝐺𝑉𝑃GVPitalic_G italic_V italic_P. The GVP-GNN block is formed by stacking GVP𝐺𝑉𝑃GVPitalic_G italic_V italic_P convolution and feed-forward transformations, as defined in Eq. (4)–(6). To enhance node representations, this block is iterated T𝑇Titalic_T times, with T=3𝑇3T=3italic_T = 3 specified for our experiments.

The other part of CPE-Pro’s structural encoder is Structural Sequence Language Model, SSLM. First, the efficient protein structure data search tool Foldseek is used to convert protein structures into “structure-sequences”. The primary process involves mapping the amino acid backbone of the protein to the 3Di alphabet to achieve structural discretization. This reflects the tertiary interactions between amino acids and describes the geometric conformations of residues and their spatial neighbors.

Next, using the 3Di alphabet as the vocabulary of structural elements and based on the Transformer architecture, we pre-train a protein structural language model, SSLM, from scratch. This aims to effectively model the “structure-sequence” of proteins. The pre-training process employs the classic masked language modeling (MLM) objective[50], predicting masked elements based on the context of the “structure-sequence”. The probability distribution for predicting a masked element P(sis1,,si1,si+1,,sn)𝑃conditionalsubscript𝑠𝑖subscript𝑠1subscript𝑠𝑖1subscript𝑠𝑖1subscript𝑠𝑛P(s_{i}\mid s_{1},\dots,s_{i-1},s_{i+1},\dots,s_{n})italic_P ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is used, where sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the masked structural element and sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT,…,si1subscript𝑠𝑖1s_{i-1}italic_s start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT,si+1subscript𝑠𝑖1s_{i+1}italic_s start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT,…,sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are its contexts. The loss function is defined as follows:

=i=1xyilog(y^i).subscriptsuperscriptsubscript𝑖1𝑥subscript𝑦𝑖subscript^𝑦𝑖\mathcal{L}_{\mathcal{MLM}}=-\sum_{i=1}^{x}y_{i}\log(\hat{y}_{i}).caligraphic_L start_POSTSUBSCRIPT caligraphic_M caligraphic_L caligraphic_M end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (7)

where y^isubscript^𝑦𝑖\hat{y}_{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the model’s predicted label, indicating the probability that the i-th token in the structural element vocabulary is the masked element, and yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the true label. The loss is computed only on x𝑥xitalic_x elements that are masked.

Encoder of CPE-Pro. We integrated the “structure-sequence” representations output by the pre-trained protein structural sequence language model with the embeddings of protein graphs. The SSLM can learn the sequential relationships and proximity interactions between local structural elements from the “structure-sequence”. When combined with three-dimensional topological information, this approach aims to enrich and optimize the representation of protein structures. Specifically, we combined the representations obtained from the SSLM and the graph embeddings obtained from the GVP embedding layer as follows:SseqRNSseq×DSseqsubscriptsubscript𝑆𝑠𝑒𝑞superscript𝑅subscript𝑁subscript𝑆𝑠𝑒𝑞subscript𝐷subscript𝑆𝑠𝑒𝑞\mathcal{R}_{S_{seq}}\in R^{N_{S_{seq}}\times D_{S_{seq}}}caligraphic_R start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_s italic_e italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_s italic_e italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT × italic_D start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_s italic_e italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPTrepresents the “structure-sequence” representation, where NSseqsubscript𝑁subscript𝑆𝑠𝑒𝑞N_{S_{seq}}italic_N start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_s italic_e italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the length of the “structure-sequence”, and DSseqsubscript𝐷subscript𝑆𝑠𝑒𝑞D_{S_{seq}}italic_D start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_s italic_e italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the feature dimension of each element in the “structure-sequence”. We align the representation dimensions of Sseqsubscriptsubscript𝑆𝑠𝑒𝑞\mathcal{R}_{S_{seq}}caligraphic_R start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_s italic_e italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT with rsRN×Dscalarsubscript𝑟𝑠superscript𝑅𝑁subscript𝐷𝑠𝑐𝑎𝑙𝑎𝑟r_{s}\in R^{N\times D_{scalar}}italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_N × italic_D start_POSTSUBSCRIPT italic_s italic_c italic_a italic_l italic_a italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT using a linear transformation layer. Afterwards, Sseqsubscriptsubscript𝑆𝑠𝑒𝑞\mathcal{R}_{S_{seq}}caligraphic_R start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_s italic_e italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT and rssubscript𝑟𝑠r_{s}italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT are fused with adaptive weights, i.e.,

Pro=𝒲rs+(1𝒲)Sseq.subscript𝑃𝑟𝑜𝒲subscript𝑟𝑠1𝒲subscriptsubscript𝑆𝑠𝑒𝑞\mathcal{R}_{Pro}=\mathcal{W}\cdot r_{s}+(1-\mathcal{W})\cdot\mathcal{R}_{S_{% seq}}.caligraphic_R start_POSTSUBSCRIPT italic_P italic_r italic_o end_POSTSUBSCRIPT = caligraphic_W ⋅ italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + ( 1 - caligraphic_W ) ⋅ caligraphic_R start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_s italic_e italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT . (8)

Here, 𝒲𝒲\mathcal{W}caligraphic_W is a learnable parameter. Finally, the topological structure and node features of graph data incorporating information from the “structure-sequence” are utilized to learn structural representations in GVP-GNN blocks. The aim is to enhance the accuracy of structural discrimination through enriched and deepened data representations.

Representation Classification. The protein structures are processed by the structural encoder of CPE-Pro to obtain feature vectors Prosuperscriptsubscript𝑃𝑟𝑜\mathcal{R}_{Pro}^{\prime}caligraphic_R start_POSTSUBSCRIPT italic_P italic_r italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. We designed a straightforward classification head to perform the final discrimination task. The classification head consists of three components: a pooling layer with attention[51], multilayer perceptron (MLP), and output (activation) layer. It aims to simplify feature dimensions and enable efficient classification. The specific process is represented by Eq. (9)., where a layer of MLP applies weight matrix W5subscript𝑊5W_{5}italic_W start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT, bias term b3subscript𝑏3b_{3}italic_b start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, and ReLU𝑅𝑒𝐿𝑈ReLUitalic_R italic_e italic_L italic_U function to transform inputs. The output of the MLP is then passed through an activation function (ACT) that utilizes both Sigmoid𝑆𝑖𝑔𝑚𝑜𝑖𝑑Sigmoiditalic_S italic_i italic_g italic_m italic_o italic_i italic_d and Softmax𝑆𝑜𝑓𝑡𝑚𝑎𝑥Softmaxitalic_S italic_o italic_f italic_t italic_m italic_a italic_x functions. Sigmoid𝑆𝑖𝑔𝑚𝑜𝑖𝑑Sigmoiditalic_S italic_i italic_g italic_m italic_o italic_i italic_d corresponds to the binary classification task (Crystal - AlphaFold2, C-A), while Softmax𝑆𝑜𝑓𝑡𝑚𝑎𝑥Softmaxitalic_S italic_o italic_f italic_t italic_m italic_a italic_x is used for the multi-class classification problem (Crystal - Multiple prediction models, C-M). The Pred𝑃𝑟𝑒𝑑Preditalic_P italic_r italic_e italic_d is the final output of CPE-Pro.

Pred=ACT(ReLU(W5Attention(1ni=1n(Pro)i)+b3)).𝑃𝑟𝑒𝑑𝐴𝐶𝑇𝑅𝑒𝐿𝑈subscript𝑊5𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛1𝑛superscriptsubscript𝑖1𝑛superscriptsuperscriptsubscript𝑃𝑟𝑜𝑖subscript𝑏3Pred=ACT\left(\ldots ReLU\left(W_{5}\cdot Attention\left(\frac{1}{n}\sum_{i=1}% ^{n}\left(\mathcal{R}_{Pro}^{\prime}\right)^{i}\right)+b_{3}\right)\right).italic_P italic_r italic_e italic_d = italic_A italic_C italic_T ( … italic_R italic_e italic_L italic_U ( italic_W start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ⋅ italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( caligraphic_R start_POSTSUBSCRIPT italic_P italic_r italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) + italic_b start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ) . (9)

4 Experimental Setups

4.1 Baseline Methods

We compared CPE-Pro to various embedded-based deep learning methods. Our analysis includes pre-trained PLMs, such as ESM1b, ESM1v[10], ESM2, ProtBert[39], Ankh[52], combined with GVP-GNN as a model with amino acid sequence and structure input, and SaProt.

4.2 Training Setups

We employed the AdamW[53] optimizer, setting the learning rate between 0.0001 and 0.00005, depending on the task and dataset size. Additionally, we applied CosineAnnealingLR for learning rate decay to improve convergence. We applied a dropout rate of 0.2 at the output layer and the number of epochs was set between 50 and 100. The loss functions employed were binary cross-entropy (Eq. (10)) and categorical cross-entropy (Eq. (11)). We implemented early stopping based on validation metrics such as accuracy to prevent overfitting. All protein folding, pre-training, and experiments were conducted on 8 NVIDIA RTX 3090 GPUs. Table 2 shows the sampling and partitioning of the discriminative task on CATH-PFD.

Table 2: Dataset partitioning for discrimination tasks
Task Train terms Valid terms Test terms Total
C-A 10,000(5,000×2) 1,000(500×2) 1,000(500×2) 12,000
C-M 20,000(5,000×4) 2,000(500×4) 2,000(500×4) 24,000
\botrule
00footnotetext: The distribution of training, validation, and test terms for two classification tasks: C-A and C-M. Each task is divided into subsets for model development and evaluation, detailing the number of terms and the class composition within each subset.
CA=[ylog(y^)+(1y)log(1y^)].subscript𝐶𝐴delimited-[]𝑦^𝑦1𝑦1^𝑦\mathcal{L}_{C-A}=-\left[y\log(\hat{y})+\left(1-y\right)\log\left(1-\hat{y}% \right)\right].caligraphic_L start_POSTSUBSCRIPT italic_C - italic_A end_POSTSUBSCRIPT = - [ italic_y roman_log ( over^ start_ARG italic_y end_ARG ) + ( 1 - italic_y ) roman_log ( 1 - over^ start_ARG italic_y end_ARG ) ] . (10)
CM=i=1nyilog(y^i).subscript𝐶𝑀superscriptsubscript𝑖1𝑛subscript𝑦𝑖subscript^𝑦𝑖\mathcal{L}_{C-M}=-\sum_{i=1}^{n}y_{i}\log\left(\hat{y}_{i}\right).caligraphic_L start_POSTSUBSCRIPT italic_C - italic_M end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (11)

4.3 Evaluation Metrics

Performance of pre-training tasks on SSLM is measured using perplexity[54], an indicator of a language model’s predictive capability for a given text sequence. It quantifies the uncertainty of the model’s probability predictions for each token. Let SSLM assign a probability P(S)𝑃𝑆P(S)italic_P ( italic_S ) to a “structure-sequence” s1,,si1,si+1,,snsubscript𝑠1subscript𝑠𝑖1subscript𝑠𝑖1subscript𝑠𝑛\langle s_{1},\dots,s_{i-1},s_{i+1},\dots,s_{n}\rangle⟨ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⟩ of length n. The perplexity of the model for the “structure-sequence” is defined as:

PP(S)=P1N(S)=(i=1nP(sis1,,si1,si+1,,sn))1N.𝑃𝑃𝑆superscript𝑃1𝑁𝑆superscriptsuperscriptsubscriptproduct𝑖1𝑛𝑃conditionalsubscript𝑠𝑖subscript𝑠1subscript𝑠𝑖1subscript𝑠𝑖1subscript𝑠𝑛1𝑁PP(S)=P^{-\frac{1}{N}}(S)=\left(\prod_{i=1}^{n}P\left(s_{i}\mid s_{1},\dots,s_% {i-1},s_{i+1},\dots,s_{n}\right)\right)^{-\frac{1}{N}}.italic_P italic_P ( italic_S ) = italic_P start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG end_POSTSUPERSCRIPT ( italic_S ) = ( ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_P ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG end_POSTSUPERSCRIPT . (12)

The experiment reported several metrics to evaluate the performance of different models, including accuracy (ACC), Precision, Recall, F1-Score (F1), and Matthews correlation coefficient (MCC). Their calculation equations are as follows:

ACC=TP+TNTP+FP+FN+TN.𝐴𝐶𝐶𝑇𝑃𝑇𝑁𝑇𝑃𝐹𝑃𝐹𝑁𝑇𝑁ACC=\frac{TP+TN}{TP+FP+FN+TN}.italic_A italic_C italic_C = divide start_ARG italic_T italic_P + italic_T italic_N end_ARG start_ARG italic_T italic_P + italic_F italic_P + italic_F italic_N + italic_T italic_N end_ARG . (13)
Precision=TPTP+FP.𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑇𝑃𝑇𝑃𝐹𝑃Precision=\frac{TP}{TP+FP}.italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n = divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_P end_ARG . (14)
Recall=TPTP+FN.𝑅𝑒𝑐𝑎𝑙𝑙𝑇𝑃𝑇𝑃𝐹𝑁Recall=\frac{TP}{TP+FN}.italic_R italic_e italic_c italic_a italic_l italic_l = divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_N end_ARG . (15)
F1=2×TP2×TP+FP+FN.𝐹12𝑇𝑃2𝑇𝑃𝐹𝑃𝐹𝑁F1=\frac{2\times TP}{2\times TP+FP+FN}.italic_F 1 = divide start_ARG 2 × italic_T italic_P end_ARG start_ARG 2 × italic_T italic_P + italic_F italic_P + italic_F italic_N end_ARG . (16)
MCC=TP×TNFP×FN(TP+FN)×(TP+FP)×(TN+FN)×(TN+FP).𝑀𝐶𝐶𝑇𝑃𝑇𝑁𝐹𝑃𝐹𝑁𝑇𝑃𝐹𝑁𝑇𝑃𝐹𝑃𝑇𝑁𝐹𝑁𝑇𝑁𝐹𝑃MCC=\frac{TP\times TN-FP\times FN}{\sqrt{(TP+FN)\times(TP+FP)\times(TN+FN)% \times(TN+FP)}}.italic_M italic_C italic_C = divide start_ARG italic_T italic_P × italic_T italic_N - italic_F italic_P × italic_F italic_N end_ARG start_ARG square-root start_ARG ( italic_T italic_P + italic_F italic_N ) × ( italic_T italic_P + italic_F italic_P ) × ( italic_T italic_N + italic_F italic_N ) × ( italic_T italic_N + italic_F italic_P ) end_ARG end_ARG . (17)

The terms TP, TN, FP, and FN denote the counts of correctly predicted positives, correctly predicted negatives, incorrectly predicted positives, and incorrectly predicted negatives, respectively.

Table 3: Performance Comparison of Baseline Models on CATH-PFD for Structure Discrimination
Method Information Input Acc(%) \uparrow Precision(%) \uparrow Recall(%) \uparrow F1-score \uparrow MCC \uparrow
Model LM Version Param AA Structure Struct-seq C-A C-M C-A C-M C-A C-M C-A C-M C-A C-M
GVP-GNN w/
ESM1b t33 655M \usym2713 \usym2713 \usym2717 93.9 92.0 90.7 92.0 97.8 92.0 0.941 0.920 0.881 0.893
ESM1v t33 655M \usym2713 \usym2713 \usym2717 92.5 90.0 88.3 90.2 98.0 90.0 0.929 0.899 0.855 0.867
ESM2 t30 153M \usym2713 \usym2713 \usym2717 91.8 87.4 86.9 88.9 98.4 87.4 0.923 0.874 0.843 0.837
t33 655M \usym2713 \usym2713 \usym2717 94.4 91.9 92.6 93.0 96.4 91.9 0.945 0.919 0.889 0.896
ProtBert t30 Uniref 423M \usym2713 \usym2713 \usym2717 92.8 90.3 98.9 89.6 86.6 88.2 0.923 0.902 0.863 0.872
t30 BFD 423M \usym2713 \usym2713 \usym2717 93.2 91.5 88.6 91.7 99.2 91.6 0.936 0.916 0.870 0.888
Ankh base 453M \usym2713 \usym2713 \usym2717 92.3 91.4 86.8 91.4 99.8 91.4 0.928 0.913 0.856 0.887
SaProt t12 AF2 35M \usym2713 \usym2717 \usym2713 71.0 44.4 68.7 47.1 77.2 44.4 0.795 0.450 0.423 0.262
t33 PDB 650M \usym2713 \usym2717 \usym2713 75.8 46.7 73.0 46.5 82.0 46.6 0.727 0.459 0.520 0.292
CPE-Pro t3 Swiss-Prot 29M \usym2717 \usym2713 \usym2713 98.5 97.2 97.6 97.3 99.4 97.2 0.985 0.972 0.970 0.963
\botrule
00footnotetext: Note: AA: Amino Acids, struct-seq: “structure-sequence”. All results are rounded to three decimal places and the best results are in bold. The top three are highlighted by First, Second, Third.

5 Results and Analysis

CPE-Pro demonstrates exceptionally high accuracy performance in structure discrimination tasks. Baseline models and CPE-Pro were first trained on the C-A task. After several iterations, CPE-Pro achieved an accuracy of 0.98 on the test set, with the other four metrics all exceeding 0.97. In contrast, although the hybrid approach combining senven PLMs with GVP-GNN also achieved an accuracy above 0.9 in the C-A task, it still fell short compared to CPE-Pro. Among these senven hybrid baseline methods, the middle-performing model was ESM-1v, while the other six PLMs all had top-three performances across various metrics. The best accuracy was achieved by the method using a 33-layer ESM2 as the sequence encoder, reaching over 0.94—still 4.1% lower than CPE-Pro. The worst-performing model was the 153M parameter ESM2, which, due to its smaller model size, had lower performance compared to the other six PLM methods. This suggests that the size and complexity of PLMs influence their ability to capture the structure and function of proteins to a certain extent, with smaller models possibly struggling to fully learn the deep connections between sequence and structure in complex tasks.

Building on this success, we extended training to the more complex C-M task. After training on multi-class structured data, the model’s performance on the C-M task gradually improved, with all five metrics converging around 0.97, demonstrating strong competitiveness. However, both versions of SaProt performed poorly in the experiments, achieving less than 80% accuracy in distinguishing between crystal and AlphaFold2-predicted structures, and failing to reach 50% accuracy in the C-M task. A detailed analysis of the underlying reasons for this will be discussed in the subsequent section.

The “structure-sequence” can be a better predictor. The pre-training of the protein structural sequence language model, SSLM utilized 109,334 high pLDDT score (>0.955) protein structures from the Swiss-Prot database. The masking strategy and rate used in the pre-training task were informed by the approach proposed in [55], which comprehensively considers the model size and dataset scale. The setup of pre-training task are detailed in Table 4 and Figure 2 illustrates the perplexity of the SSLM on the pre-training task.

Sequential encoder SSLM in CPE-Pro and baseline models ESM1b, ESM1v, ESM2, ProtBert, Ankh were evaluated on the performance of CATH-PFD. Pre-trained SSLM has 3 hidden layers and significantly reduced parameters. The results in Table 3 and 5 show that the inclusion of “structure-sequence” encoders outperforms these amino acid sequence encoders on downstream C-M tasks. We preliminarily confirm our hypothesis on a structure discrimination task that language models learn sequence information obtained directly from structure discretization more efficiently. The “structure-sequence” shows greater effectiveness in protein classification tasks, which provides new directions for further optimization and design of more efficient predictive models. Independent use of SSLM and the structure-aware language model SaProt performed poorly because both rely solely on sequence input. Figure 3 shows the average pLDDT scores and similarity between “structure-sequences” for protein structures used in training, validation, and testing. It is evident that the high pLDDT levels across the three categories resulted in higher similarity of “structure-sequences” (training set: 73.28%; validation set: 71.58%; test set: 69.60%), leading to homogenized feature representations and limited model generalization. In contrast, SaProt’s input comes from a vocabulary of size 441 that includes both amino acid and “structure-sequence” elements. By combining these two vocabularies, some effects of high similarity are mitigated, and its performance metrics improve compared to SSLM. To demonstrate the effectiveness of SSLM in capturing structural differences, we validated it through visualization methods in subsequent sections. We also speculate that the scaling effect applies to language models trained with “structure-sequences”. In other words, increasing the model’s depth and the scale of training data could significantly enhance the model’s performance. Even without relying on a structural encoder, the model may still achieve satisfactory results.

Table 4: Pre-training Settings of SSLM
Model Data Size Mask Radio Mask Method
SSLM_t3_25M_Swiss-Prot 109,334 15% 8:1:1&random
8:1:1&distribution
25% 9:0:1
9:1:0&distribution
\botrule
00footnotetext: The pre-training process employed four masking strategies. For the SSLM_t3_25M_CATH-PFD model, with a 15% masking rate, two masking methods were used: an 8:1:1 ratio with random replacement and an 8:1:1 ratio with distribution. The distribution reflects the statistical number of tokens in the pre-training dataset. When the masking rate increased to 25%, the masking strategies were adjusted to 9:0:1 and 9:1:0 with distribution. These configurations were designed to explore the performance of the model pre-trained using the MLM task.
Refer to caption
Figure 2: The perplexity \downarrow of SSLM on the validation set. Among the 4 training strategies, the combination of a 25% masking rate with the 9:0:1 masking method shows superior performance. The original curve depicts how perplexity changes with training steps, while the smoothed curve illustrates its trend, reducing noise and providing a clearer view of the decreasing perplexity trend.
Refer to caption
Figure 3: The pLDDT scores of the predicted protein structures in the dataset used for training CPE-Pro and the similarity between the “structure-sequences”.
Refer to caption
Figure 4: Using the t-SNE method, the feature embeddings of four pre-trained versions of SSLM in the SCOPe database were dimensionally reduced and visualized on a two-dimensional plane.
Refer to caption
Figure 5: Using the t-SNE method, the feature embeddings of various PLMs in the SCOPe database were dimensionally reduced and visualized in a two-dimensional plane.
Table 5: Ablation Study Results on CPE-Pro
GVP-GNN SSLM Attention Pooling Acc(%) \uparrow F1-score \uparrow MCC \uparrow
P-t W/o P-t C-A C-M C-A C-M C-A C-M
\usym2717 \usym2713\usym2713\usym{2713}2713 \usym2713 68.5 39.8 0.694 0.372 0.371 0.204
\usym2713 \usym2717 \usym2713 90.4 87.2 0.912 0.867 0.823 0.839
\usym2713 \usym2713 \usym2717 95.9 94.9 0.961 0.949 0.921 0.933
\usym2713 \usym2713 \usym2713 98.1 92.2 0.981 0.922 0.962 0.897
\usym2713 \usym2713 \usym2713 98.5 97.2 0.985 0.972 0.970 0.963
\botrule
00footnotetext: P-t: Pre-trained; Attention Pooling: pooling layer with attention mask. The experiments vary by removing components like SSLM, GVP-GNN, and AM-Pooling to test their impact. Note: The best results are in bold.
Refer to caption
Figure 6: The X-ray crystal structures 1BT5 and 5W0C , along with the outputs from three predictive models for each structure, were selected for the case study. All predicted structures achieved pLDDT scores above 90. The confidence levels (CL) demonstrate the high accuracy and robustness of CPE-Pro in structural origin discrimination tasks.

Feature Visualization Method Powerfully Demonstrates Pretrained SSLM’s Excellence in Capturing Structural Differences. The protein language model has been shown to embed secondary and tertiary structure[9] characteristics within its output representations of proteins. We selected a subset of gene domain sequences from the non-redundant Astral SCOPe 2.08 database in SCOPe[56], where the identity between sequences is less than 40%. From this subset, we focused on all-α𝛼\alphaitalic_α helical proteins (2,644) and all-β𝛽\betaitalic_β sheet proteins (3,059) and filtered the corresponding structural sets in the database. Figure 4 and 5 show the t-SNE visualization of the protein representations from the last hidden layers of SSLM and various PLMs on the aforementioned dataset. It is evident that, aside from SaProt, which incorporates “structure-sequence” in its input, the representations of other PLMs, while capturing some differences in structural types, exhibit relatively weak discriminative power. After dimensionality reduction, the distribution of data points becomes chaotic, and the boundaries between the protein classes are blurred. In contrast, both SaProt and SSLM not only differentiate the two protein classes more effectively but also provide clearer class boundaries, with SSLM showing a more concentrated distribution within each class. This suggests that SSLM possesses stronger discriminative capability in capturing and representing protein structural features, providing a more accurate reflection of the intrinsic characteristics of different structural types.

Ablation Study on Components of CPE-Pro. To validate the contribution of each component designed in CPE-Pro, we conducted five sets of ablation experiments on both the C-A and C-M tasks. These variations included removing the GVP-GNN, omitting the pre-training process of SSLM, removing the SSLM, eliminating attention in the pooling layer, and using all three components together. As shown in Table 5, each component made a positive contribution to the C-M task. Performance significantly declined when using only a single type of encoder, particularly when using the SSLM alone as discussed earlier. The CPE-Pro model, which employs a pre-trained SSLM, outperformed the non-pre-trained version, indicating that the pre-training process effectively helped the model learn the structural features embedded in the “structure-sequences”. Additionally, the application of attention-masked pooling layers also positively influenced model performance, further enhancing overall effectiveness.

Case Study: Discrimination of the Structural Origins of BLAT ECOLX and CP2C9 HUMAN. BLAT ECOLX is a β𝛽\betaitalic_β-lactamase protein found in Escherichia coli that can hydrolyze the β𝛽\betaitalic_β-lactam ring of β𝛽\betaitalic_β-lactam antibiotics, rendering these antibiotics inactive. It plays a significant role in the study of antibiotic resistance. CP2C9 HUMAN is a human Cytochrome P450 2C9 enzyme responsible for the metabolism of various drugs, including non-steroidal anti-inflammatory drugs and anticoagulants. It plays a significant role in drug metabolism and the regulation of endogenous substances. In three structural prediction models, both proteins achieved pLDDT scores above 0.9, indicating high accuracy in structure prediction with minimal deviation from the crystal structure. We input the crystal structures and predicted structures of both proteins into CPE-Pro for origin discrimination. Figure 6 demonstrates that the model successfully and confidently predicted the origins of these structures. This result highlights the robustness of the model in assessing structural origins, even in cases of very minor structural differences.

6 Discussion

In this study, we developed a protein folding dataset, CATH-PFD, derived from the non-redundant dataset in the CATH database, which incorporates structures from various prediction models. By training and validating our model, CPE-Pro, on the CATH-PFD dataset, we created an innovative and effective solution for identifying the structural origins of proteins. The CPE-Pro model excels in learning and analyzing protein structural features, outperforming methods that combine amino acid sequences with structural data, as well as models that use structure-aware sequences, in the task of structural origin recognition. In case studies, CPE-Pro demonstrated superior performance.

These findings provide preliminary evidence that incorporating “structure-sequence” information significantly enhances the language model’s ability to learn protein features, enabling it to capture richer and more precise structural details, thereby improving the representation of protein structures. In subsequent visualization experiments, we further validated the sensitivity of SSLM, utilized within CPE-Pro, to structural variations, as well as its effectiveness in capturing and representing complex protein structural features. This exploration not only opens up new avenues for the application of protein structural language models but also paves the way for future developments in the paradigm and interpretability of protein structure prediction methods, offering fresh insights and possibilities for advancing practical applications in bioinformatics and structural biology.

Declarations

  • Competing interests The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

  • \bibcommenthead
  • Gold and Jackson [2006] Gold, N.D., Jackson, R.M.: Fold independent structural comparisons of protein–ligand binding sites for exploring functional relationships. Journal of molecular biology 355(5), 1112–1124 (2006)
  • Bernstein et al. [1977] Bernstein, F.C., Koetzle, T.F., Williams, G.J., Meyer Jr, E.F., Brice, M.D., Rodgers, J.R., Kennard, O., Shimanouchi, T., Tasumi, M.: The protein data bank: a computer-based archival file for macromolecular structures. Journal of molecular biology 112(3), 535–542 (1977)
  • Case et al. [2005] Case, D.A., Cheatham III, T.E., Darden, T., Gohlke, H., Luo, R., Merz Jr, K.M., Onufriev, A., Simmerling, C., Wang, B., Woods, R.J.: The amber biomolecular simulation programs. Journal of computational chemistry 26(16), 1668–1688 (2005)
  • Abraham et al. [2015] Abraham, M.J., Murtola, T., Schulz, R., Páll, S., Smith, J.C., Hess, B., Lindahl, E.: Gromacs: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1, 19–25 (2015)
  • Altschul et al. [1990] Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. Journal of molecular biology 215(3), 403–410 (1990)
  • Waterhouse et al. [2018] Waterhouse, A., Bertoni, M., Bienert, S., Studer, G., Tauriello, G., Gumienny, R., Heer, F.T., Beer, T.A.P., Rempfer, C., Bordoli, L., et al.: Swiss-model: homology modelling of protein structures and complexes. Nucleic acids research 46(W1), 296–303 (2018)
  • Luo et al. [2021] Luo, Y., Jiang, G., Yu, T., Liu, Y., Vo, L., Ding, H., Su, Y., Qian, W.W., Zhao, H., Peng, J.: Ecnet is an evolutionary context-integrated deep learning framework for protein engineering. Nature communications 12(1), 5743 (2021)
  • Li et al. [2023] Li, M., Kang, L., Xiong, Y., Wang, Y.G., Fan, G., Tan, P., Hong, L.: Sesnet: sequence-structure feature-integrated deep learning method for data-efficient protein engineering. Journal of Cheminformatics 15(1), 12 (2023)
  • Rives et al. [2021] Rives, A., Meier, J., Sercu, T., Goyal, S., Lin, Z., Liu, J., Guo, D., Ott, M., Zitnick, C.L., Ma, J., et al.: Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences 118(15), 2016239118 (2021)
  • Meier et al. [2021] Meier, J., Rao, R., Verkuil, R., Liu, J., Sercu, T., Rives, A.: Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in neural information processing systems 34, 29287–29303 (2021)
  • Lin et al. [2023] Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y., et al.: Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379(6637), 1123–1130 (2023)
  • Yang et al. [2020] Yang, J., Anishchenko, I., Park, H., Peng, Z., Ovchinnikov, S., Baker, D.: Improved protein structure prediction using predicted interresidue orientations. Proceedings of the National Academy of Sciences 117(3), 1496–1503 (2020)
  • Jumper et al. [2021] Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., et al.: Highly accurate protein structure prediction with alphafold. nature 596(7873), 583–589 (2021)
  • Lin et al. [2022] Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Santos Costa, A., Fazel-Zarandi, M., Sercu, T., Candido, S., et al.: Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv 2022, 500902 (2022)
  • Orengo et al. [1997] Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., Thornton, J.M.: Cath–a hierarchic classification of protein domain structures. Structure 5(8), 1093–1109 (1997)
  • LeCun et al. [1998] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
  • Kingma [2013] Kingma, D.P.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
  • Graves and Graves [2012] Graves, A., Graves, A.: Long short-term memory. Supervised sequence labelling with recurrent neural networks, 37–45 (2012)
  • Vaswani [2017] Vaswani, A.: Attention is all you need. Advances in Neural Information Processing Systems (2017)
  • Radford [2018] Radford, A.: Improving language understanding by generative pre-training (2018)
  • Nijkamp et al. [2023] Nijkamp, E., Ruffolo, J.A., Weinstein, E.N., Naik, N., Madani, A.: Progen2: exploring the boundaries of protein language models. Cell systems 14(11), 968–978 (2023)
  • Ferruz et al. [2022] Ferruz, N., Schmidt, S., Höcker, B.: Protgpt2 is a deep unsupervised language model for protein design. Nature communications 13(1), 4348 (2022)
  • Katoh et al. [2002] Katoh, K., Misawa, K., Kuma, K.-i., Miyata, T.: Mafft: a novel method for rapid multiple sequence alignment based on fast fourier transform. Nucleic acids research 30(14), 3059–3066 (2002)
  • Rao et al. [2021] Rao, R.M., Liu, J., Verkuil, R., Meier, J., Canny, J., Abbeel, P., Sercu, T., Rives, A.: Msa transformer. In: International Conference on Machine Learning, pp. 8844–8856 (2021). PMLR
  • Brandes et al. [2022] Brandes, N., Ofer, D., Peleg, Y., Rappoport, N., Linial, M.: Proteinbert: a universal deep-learning model of protein sequence and function. Bioinformatics 38(8), 2102–2110 (2022)
  • Sanyal et al. [2020] Sanyal, S., Anishchenko, I., Dagar, A., Baker, D., Talukdar, P.: Proteingcn: Protein model quality assessment using graph convolutional networks. BioRxiv, 2020–04 (2020)
  • Zhang et al. [2022] Zhang, Z., Xu, M., Jamasb, A., Chenthamarakshan, V., Lozano, A., Das, P., Tang, J.: Protein representation learning by geometric structure pretraining. arXiv preprint arXiv:2203.06125 (2022)
  • Song et al. [2024] Song, F.V., Su, J., Huang, S., Zhang, N., Li, K., Ni, M., Liao, M.: Deepss2go: protein function prediction from secondary structure. Briefings in Bioinformatics 25(3), 196 (2024)
  • Tan et al. [2024] Tan, Y., Li, M., Zhou, B., Zhong, B., Zheng, L., Tan, P., Zhou, Z., Yu, H., Fan, G., Hong, L.: Simple, efficient, and scalable structure-aware adapter boosts protein language models. Journal of Chemical Information and Modeling (2024)
  • Gligorijević et al. [2021] Gligorijević, V., Renfrew, P.D., Kosciolek, T., Leman, J.K., Berenberg, D., Vatanen, T., Chandler, C., Taylor, B.C., Fisk, I.M., Vlamakis, H., et al.: Structure-based protein function prediction using graph convolutional networks. Nature communications 12(1), 3168 (2021)
  • Kipf and Welling [2016] Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
  • Wang et al. [2021] Wang, Z., Combs, S.A., Brand, R., Calvo, M.R., Xu, P., Price, G., Golovach, N., Salawu, E.O., Wise, C.J., Ponnapalli, S.P., et al.: Lm-gvp: A generalizable deep learning framework for protein property prediction from sequence and structure. bioRxiv, 2021–09 (2021)
  • Jing et al. [2020] Jing, B., Eismann, S., Suriana, P., Townshend, R.J.L., Dror, R.: Learning from protein structure with geometric vector perceptrons. In: International Conference on Learning Representations (2020)
  • Zhang et al. [2023] Zhang, Z., Wang, C., Xu, M., Chenthamarakshan, V., Lozano, A., Das, P., Tang, J.: A systematic study of joint representation learning on protein sequences and structures. arXiv preprint arXiv:2303.06275 (2023)
  • Tan et al. [2023] Tan, Y., Zhou, B., Zheng, L., Fan, G., Hong, L.: Semantical and geometrical protein encoding toward enhanced bioactivity and thermostability. bioRxiv, 2023–12 (2023)
  • Satorras et al. [2021] Satorras, V.G., Hoogeboom, E., Welling, M.: E (n) equivariant graph neural networks. In: International Conference on Machine Learning, pp. 9323–9332 (2021). PMLR
  • Heinzinger et al. [2023] Heinzinger, M., Weissenow, K., Sanchez, J.G., Henkel, A., Steinegger, M., Rost, B.: Prostt5: Bilingual language model for protein sequence and structure. biorxiv (2023)
  • Van Kempen et al. [2024] Van Kempen, M., Kim, S.S., Tumescheit, C., Mirdita, M., Lee, J., Gilchrist, C.L., Söding, J., Steinegger, M.: Fast and accurate protein structure search with foldseek. Nature biotechnology 42(2), 243–246 (2024)
  • Elnaggar et al. [2021] Elnaggar, A., Heinzinger, M., Dallago, C., Rehawi, G., Wang, Y., Jones, L., Gibbs, T., Feher, T., Angerer, C., Steinegger, M., et al.: Prottrans: Toward understanding the language of life through self-supervised learning. IEEE transactions on pattern analysis and machine intelligence 44(10), 7112–7127 (2021)
  • Su et al. [2023] Su, J., Han, C., Zhou, Y., Shan, J., Zhou, X., Yuan, F.: Saprot: Protein language modeling with structure-aware vocabulary. bioRxiv, 2023–10 (2023)
  • Sievers and Higgins [2014] Sievers, F., Higgins, D.G.: Clustal omega, accurate alignment of very large numbers of sequences. Multiple sequence alignment methods, 105–116 (2014)
  • Riesselman et al. [2018] Riesselman, A.J., Ingraham, J.B., Marks, D.S.: Deep generative models of genetic variation capture the effects of mutations. Nature methods 15(10), 816–822 (2018)
  • Wang et al. [2016] Wang, S., Peng, J., Ma, J., Xu, J.: Protein secondary structure prediction using deep convolutional neural fields. Scientific reports 6(1), 1–11 (2016)
  • Lafferty et al. [2001] Lafferty, J., McCallum, A., Pereira, F., et al.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Icml, vol. 1, p. 3 (2001). Williamstown, MA
  • Wang et al. [2017] Wang, Y., Mao, H., Yi, Z.: Protein secondary structure prediction by using deep learning method. Knowledge-Based Systems 118, 115–123 (2017)
  • Wang et al. [2022] Wang, W., Peng, Z., Yang, J.: Single-sequence protein structure prediction using supervised transformer protein language models. Nature Computational Science 2(12), 804–814 (2022)
  • Wu et al. [2022] Wu, R., Ding, F., Wang, R., Shen, R., Zhang, X., Luo, S., Su, C., Wu, Z., Xie, Q., Berger, B., et al.: High-resolution de novo structure prediction from primary sequence. BioRxiv, 2022–07 (2022)
  • Mirdita et al. [2022] Mirdita, M., Schütze, K., Moriwaki, Y., Heo, L., Ovchinnikov, S., Steinegger, M.: Colabfold: making protein folding accessible to all. Nature methods 19(6), 679–682 (2022)
  • Hsu et al. [2022] Hsu, C., Verkuil, R., Liu, J., Lin, Z., Hie, B., Sercu, T., Lerer, A., Rives, A.: Learning inverse folding from millions of predicted structures. In: International Conference on Machine Learning, pp. 8946–8970 (2022). PMLR
  • Devlin [2018] Devlin, J.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
  • Tan et al. [2024] Tan, Y., Zheng, J., Hong, L., Zhou, B.: Protsolm: Protein solubility prediction with multi-modal features. arXiv preprint arXiv:2406.19744 (2024)
  • Elnaggar et al. [2023] Elnaggar, A., Essam, H., Salah-Eldin, W., Moustafa, W., Elkerdawy, M., Rochereau, C., Rost, B.: Ankh: Optimized protein language model unlocks general-purpose modelling. arXiv preprint arXiv:2301.06568 (2023)
  • Loshchilov [2017] Loshchilov, I.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
  • Brown et al. [1990] Brown, P.F., Cocke, J., Della Pietra, S.A., Della Pietra, V.J., Jelinek, F., Lafferty, J., Mercer, R.L., Roossin, P.S.: A statistical approach to machine translation. Computational linguistics 16(2), 79–85 (1990)
  • Wettig et al. [2022] Wettig, A., Gao, T., Zhong, Z., Chen, D.: Should you mask 15% in masked language modeling? arXiv preprint arXiv:2202.08005 (2022)
  • Chandonia et al. [2019] Chandonia, J.-M., Fox, N.K., Brenner, S.E.: Scope: classification of large macromolecular structures in the structural classification of proteins—extended database. Nucleic acids research 47(D1), 475–481 (2019)