. 2022 Nov 5;23(6):bbac467. doi: 10.1093/bib/bbac467

Clarion is a multi-label problem transformation method for identifying mRNA subcellular localizations

Yue Bi ^1,², Fuyi Li ^3,^4,^5,^✉, Xudong Guo ⁶, Zhikang Wang ⁷, Tong Pan ⁸, Yuming Guo ⁹, Geoffrey I Webb ¹⁰, Jianhua Yao ^11,^✉, Cangzhi Jia ^12,^✉, Jiangning Song ^13,^14,^✉

¹ Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia

² Monash Data Futures Institute, Monash University, Melbourne, Victoria 3800, Australia

³ Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia

⁴ College of Information Engineering, Northwest A&F University, Yangling, 712100, China

⁵ Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia

⁶ College of Information Engineering, Northwest A&F University, Yangling, 712100, China

⁷ Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia

⁸ Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia

⁹ Department of Epidemiology and Preventive Medicine, School of Public Health and Preventive Medicine, Monash University, Melbourne, Victoria 3004, Australia

¹⁰ Monash Data Futures Institute, Monash University, Melbourne, Victoria 3800, Australia

¹¹ Tencent AI Lab, Shenzhen, China

¹² School of Science, Dalian Maritime University, Dalian 116026, China

¹³ Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia

¹⁴ Monash Data Futures Institute, Monash University, Melbourne, Victoria 3800, Australia

^✉

Corresponding authors: Fuyi Li, College of Information Engineering, Northwest A&F University, Yangling, Shaanxi 712100, China, E-mail: fuyi.li@nwafu.edu.cn; Jianhua Yao, Tencent AI Lab, Shenzhen, China. E-mail: jianhua.yao@gmail.com; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China, E-mail: cangzhijia@dlmu.edu.cn; Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia, E-mail: jiangning.song@monash.edu

PMCID: PMC10148739 PMID: 36341591

Abstract

Subcellular localization of messenger RNAs (mRNAs) plays a key role in the spatial regulation of gene activity. The functions of mRNAs have been shown to be closely linked with their localizations. As such, understanding of the subcellular localizations of mRNAs can help elucidate gene regulatory networks. Despite several computational methods that have been developed to predict mRNA localizations within cells, there is still much room for improvement in predictive performance, especially for the multiple-location prediction. In this study, we proposed a novel multi-label multi-class predictor, termed Clarion, for mRNA subcellular localization prediction. Clarion was developed based on a manually curated benchmark dataset and leveraged the weighted series method for multi-label transformation. Extensive benchmarking tests demonstrated Clarion achieved competitive predictive performance and the weighted series method plays a crucial role in securing superior performance of Clarion. In addition, the independent test results indicate that Clarion outperformed the state-of-the-art methods and can secure accuracy of 81.47, 91.29, 79.77, 92.10, 89.15, 83.74, 80.74, 79.23 and 84.74% for chromatin, cytoplasm, cytosol, exosome, membrane, nucleolus, nucleoplasm, nucleus and ribosome, respectively. The webserver and local stand-alone tool of Clarion is freely available at http://monash.bioweb.cloud.edu.au/Clarion/.

Keywords: mRNA, subcellular localization, sequence analysis, machine learning, multi-class classification, multi-label prediction

Introduction

Asymmetric mRNA distribution was first reported in the early 1980s, which found actin mRNA to be localized in the cytoplasm during the ascidian embryos [1]. Later, Lawrence and Singer also observed such a phenomenon in chicken fibroblasts via in situ hybridization [2]. In the past decade, there have been an increasing number of studies suggesting that transcripts localization is a common and efficient way to target gene products to specific intracellular regions [3–6]. A number of different regulatory processes, including cell polarity, cell motility, embryo development and asymmetric cell division, have been shown to be related to the mRNA subcellular localization [7–9]. Abnormal mRNA subcellular localization has been found to be associated with numerous diseases, such as fragile X syndrome, embryonal disorders, Alzheimer’s disease and cancer [10–14]. However, there is a significant gap regarding the understanding of the underlying mechanisms of how exactly mRNAs are transported within the cells. Characterization of mRNA subcellular localizations can help elucidate the RNA localization patterns and human diseases mechanisms. With the development of high-throughput RNA sequencing technologies, an increasing number of mRNA transcripts have been recently identified. Several popular online databases have been developed to provide annotations of RNA cellular localization data for public use, such as RNALocate [15], lncATLAS [16] and lncSLdb [17]. LncATLAS and lncSLdb mainly store localization information of long non-coding RNAs, while RNALocate collects subcellular localization data for almost all kinds of RNAs. With the advances in data curation, recent years have witnessed a proliferation of computational methods developed for predicting RNA localizations with low cost and high efficiency compared with wet-lab experimental methods. Such computational methods provide important complementation to the wet-lab methods for the identification of RNA localizations.

In the present study, we comprehensively surveyed the state-of-the-art computational approaches for mRNA subcellular localization prediction in terms of a wide range of aspects, including their data sources, benchmark datasets, sequence encoding schemes, feature selection methods, machine learning algorithms and performance evaluation strategies, which are listed in Table 1. The mRNA subcellular localization prediction problem can generally be regarded as a multi-class classification task. We categorized the existing computational predictors into two major types according to the machine learning scheme: (1) single-label multi-classification predictors and (2) multi-label multi-classification predictors. For the first type, there are five predictors that address the mRNA subcellular localization prediction as a single-label classification task. These included RNATracker [18], iLoc-mRNA [19], mRNALoc [20], mRNALocater [21] and SubLocEP [22]. Unlike single-label multi-class classification, multi-label learning (MLL) has been attracting increasing attention in recent years. This is relevant because real-world objects can often have multiple semantic meanings simultaneously, where an instance may be associated with multiple labels. Taking the existence of transcriptome as an example, the mRNA Vg1 formed in the nucleus can be re-modelled in the cytoplasm during vegetal localization [23]; the mRNA bglG is localized in the membrane to form a pre-complex when being co-transcribed with its sensor, while bglG is localized in the cell poles when being expressed on its own [24]. This widespread multiple localization phenomenon inspires the multi-label task to investigate the mRNA subcellular localization identification problem. Recently, a predictor called DM3Loc has been developed based on the multi-head self-attention mechanism, representing the first multi-label mRNA subcellular localization prediction model [25]. In another recent work, Wang et al. constructed multi-label prediction models with multiple kernel support vector machines for mRNA, lncRNA, miRNA and snoRNA, respectively [26].

Table 1.

Summary of the reviewed predictors for mRNA subcellular localization

Type	Year	Tool	The sources of mRNA subcellular localization/sequence	Subcellular localization	Benchmark dataset size	Encoding scheme	Feature selection	Algorithm	Evaluation strategy	Web server/Github availability	Reference
Single-label	2019	RNATracker	CeFra-Seq & APEX-RIP /Ensembl	Cytosol Endoplasmic reticulum Insoluble Membranes Mitochondrial Nuclear	11 373 (Dataset 1) 13 860 (Dataset 2)	One-hot RNA secondary structure	None	CNN LSTM Attention	Tenfold cross-validation	https://www.github.com/HarveyYan/RNATracker	[18]
	2020	iLoc-mRNA	RNALocate /GenBank	Cytoplasm Cytosol Dendrite Endoplasmic reticulum Exosome Mitochondrion Nucleus Ribosome	4901	K-mer	Binomial distribution ANOVA IFS	SVM	Fivefold cross-validation	http://lin-group.cn/server/iLoc-mRNA/	[19]
	2020	mRNALoc	RNALocate	Cytoplasm Endoplasmic reticulum Extracellular region Mitochondria Nucleus	14 909	PseKNC	None	SVM	Fivefold cross-validation Independent test	http://proteininformatics.org/mkumar/mrnaloc	[20]
	2021	mRNALocater	RNALocate	Cytoplasm Endoplasmic reticulum Extracellular Mitochondria Nucleus	14 909	PseEIIP PseKNC	Remove collinear features SFS	CatBoost XGBoost LightGBM	Fivefold cross-validation Independent test	http://bio-bigdata.cn/mRNALocater	[21]
	2021	SubLocEP	RNALocate	Cytoplasm Endoplasmic reticulum Extracellular Mitochondria Nucleus	14 909	PseEIIP TNC DNC CKSNAP PCPseDNC PCPseTNC SCPseDNC SCPseTNC DACC	None	LightGBM	Fivefold cross-validation Independent test	http://lab.malab.cn/~lijing/SubLocEP.html	[22]
Multi-label	2021	DM3Loc	RNALocate /GenBank & NCBI	Cytosol Endoplasmic reticulum Exosome Membrane Nucleus Ribosome	17 870	One-hot	None	CNN Attention	Fivefold cross-validation Independent test	http://dm3loc.lin-group.cn/	[25]
Multi-label	2021	Wang’s	RNALocate	Cytoplasm Cytosol Endoplasmic reticulum Exosome Mitochondrion Nucleus Posterior Pseudopodium Ribosome	13475	K-mer RCKmer NAC DNC TNC CKSNAP	None	SVM	Tenfold cross-validation	http://lbci.tju.edu.cn/Online_services.htm	[26]

Open in a new tab

Abbreviations: PseKNC, pseudo k-tuple nucleotide composition; PseEIIP, electron-ion interaction pseudopotential values of trinucleotide; RCkmer, reverse compliment k-mer; NAC, nucleic acid composition; DNC, dinucleotide composition; TNC, trinucleotide composition; CKSNAP, composition of k-spaced nucleic acid pairs; PCPseDNC, parallel correlation pseudo dinucleotide composition; PCPseTNC, parallel correlation pseudo trinucleotide composition; SCPseDNC, series correlation pseudo dinucleotide composition; SCPseTNC, series correlation pseudo trinucleotide composition; DACC, dinucleotide-based auto-cross covariance; ANOVA, analysis of variance; IFS, incremental feature selection; SFS, sequential forward search.

To deal with MLL tasks, problem transformation algorithms such as binary relevance (BR), label powerset (LP) and classifier chains (CC) [27] are popular strategies, which essentially transform the multi-label problems into one or more single-label tasks. Amongst these, BR is the most widely used problem transformation algorithm and is able to transform a multi-label problem into a binary problem for each label [28]. However, a weakness of BR is that it ignores the correlation between the labels, thereby limiting its utility. In contrast, LP and CC are designed with the awareness of correlations between different data labels. The basic idea of LP is that each unique combination of labels present in a multi-label training set is regarded as a new class of a new single-label multi-class task [29]. CC generates a chain of binary classifiers, one for each label. The subsequent binary classifiers in the chain are further augmented by all preceding binary relevance predictions in the chain [30]. However, these two methods also have certain limitations: For LP, less frequent combinations would lead to the sample imbalance, and it cannot predict the label combinations that do not appear in the training set. In the case of CC, the order of the input labels can affect the model quality and prediction performance [27]. If the first model of the chain predicts inaccurately, an error may propagate along the chain. To overcome this problem, Read et al. [30] proposed the ensembles of classifier chain (ECC) method. The core idea of ECC is to average the predictions of CC models over a group of random chain ordering; however, it may still impose some restrictions on the computational capacity and time cost.

Herein, we introduced a novel computational method termed Clarion (subcellular localization predictor), which is capable of identifying multiple subcellular localizations of mRNAs simultaneously. Firstly, we established a multi-label benchmark dataset extracted from RNALocate, consisting of nine different compartments of mRNA subcellular localizations. The Inline graphic -mer nucleotide composition scheme was used to encode mRNA sequences. Next, we applied the weighted series approach as the ensemble framework of Clarion, which is a problem transformation method proposed to tackle multi-label tasks. The weighted series algorithm incorporates the prior information of the labels during model training to improve the prediction performance. Then, after the performance comparison with several machine learning algorithms, we selected XGBoost as the base classifier of Clarion. We optimized the weight of the weighted series and key parameters of XGBoost through 10-fold cross-validation. Additional independent tests illustrate that Clarion outperformed the existing state-of-the-art tools for identifying mRNA subcellular localizations. In addition, we also employed the SHAP (Shapley Addictive exPlanation) algorithm [31] to identify and interpret the most important Inline graphic -mer features for each type of mRNA subcellular localization that made the most important contributions to the model predictions.

Material and methods

Benchmark dataset

In this study, all subcellular localization annotations and mRNA sequences were collected from the RNALocate database (version 2.0) [32]. The latest version of RNALocate (updated in June 2021) contained more than 210 000 RNA subcellular localization entries, encompassing more than 110 000 RNAs with 171 subcellular localizations across 104 different species. Its version 2.0 provides more accurate localization annotations than the first version, facilitating the construction of a reliable benchmark dataset. More specifically, the benchmark dataset was constructed according to the five following major steps:

1) We downloaded all RNA subcellular localization annotation entries from RNALocate (version 2.0) and accordingly collected 84 792 mRNA subcellular localization entries as the initial dataset.
2) According to the statistics of the initial dataset, there were 150 different types of annotated subcellular localizations. However, some had minimal and incomplete entries and as such, we removed those subcellular localization types whose corresponding entry numbers were less than 3000. As a result, we obtained nine types of subcellular localizations with 152 887 unique transcripts including exosome, nucleus, cytosol, chromatin, nucleoplasm, ribosome, nucleolus, cytoplasm and membrane.
3) Next, we redefined the mapping relationships between mRNAs and subcellular localizations based on multiple localizations of mRNAs in the transcriptome. In particular, an mRNA can be labelled with multiple subcellular localizations instead of being only labelled with one subcellular localization.
4) To reduce the effect of sequence redundancy on the performance of the classifier, we applied CD-HIT-EST [33] to remove the redundant sequences with the 80% sequence identity threshold to ensure the similarity between any two nucleotide sequences was less than 80%. Finally, 36 971 mRNAs were obtained and used as the benchmark dataset.
5) We analyzed the distribution of sequence length of these 36 971 mRNAs, which varied from 119 nt to 12 000 nt. In view of the computing complexity and limitation of the feature engineering algorithms, the mRNA lengths were adjusted to no longer than 6000 nt. Specifically, for those mRNAs with more than 6000 nt, the first 3000 nt and the last 3000 nt were extracted and merged.

Sequence vectorization

mRNA sequences need to be encoded as numeric vectors prior to the training of machine learning models. The Inline graphic -mer nucleotide composition is one of the widely used sequence encoding methods, which has been successfully applied in a variety of bioinformatics studies [34–38]. Given an mRNA sequence with length nt, , where and represents the nucleotide acid at position , one type of adenine (A), cytosine (C), guanine (G) and uracil/thymine (U/T). Accordingly, when using the Inline graphic -mer () method to encode features, the feature vectors can be calculated as:

where Inline graphic represents the number of -mer type along the sequence. From the equation, we can observe that the -mer vector dimension increases exponentially with the increase of value. These numerous features may contain a great deal of redundancy and noise, which may cause extra training time and even have a negative influence on the model quality. In view of the dimension restriction and training effectiveness, we used 1-mer, 2-mer, 3-mer, 4-mer, 5-mer and 6-mer in this study.

Weighted series

In this study, we proposed a novel problem transformation method named weighted series (WS) to tackle multi-label learning problems, which is specific for the subcellular localization identification of mRNAs. The weighted series method involves two modules of binary classifiers, including a non-label module and a fusion-label module. The non-label module is concerned with training the model only from pre-extracted features. While the fusion-label module incorporates the priori information about the labels in the model training, whose learned priori label distributions could contribute to the model predictions. The final predictions of the two modules are combined by a weight Inline graphic (ranging from 0 to 1) that requires user customization, reflecting the labels' relevance. When there is a strong correlation between the labels, a smaller value of will promote the prediction performance.

Suppose Inline graphic represents the dimensional instance space and represents the dimensional label space. Given a multi-label data set that containing samples, where and . Both non-label module and fusion-label module require training q binary classifiers. When training the -th model () of non-label module, the training label is Inline graphic , and the training input is the feature vector of the training samples, i.e., . With regard to training the -th model () of the fusion-label module, the priori information on other labels is added to the training process by combining features as input. That is to say, the label is still Inline graphic , whereas the training input is the fusion of and . After model training, 2q binary classifiers of weighted series, and , can be used to conduct predictions. The prediction process of weighted series includes three steps: module prediction, module prediction and integration. Given a query mRNA sequence and its feature vector are Inline graphic , firstly, non-label models are used to conduct the prediction and generate the prediction probabilities ; would then be fused with as the input to conduct prediction of fusion-label models and output the probabilities ; finally, and would be integrated with a user-defined weight Inline graphic that ranges from 0 and 1. The detailed training and prediction procedures of the weighted series are illustrated in Figure 1 and outlined in Algorithm 1.

The workflow of the methodology of Clarion.

graphic file with name bbac467fx1.jpg

Performance evaluation metrics

The model evaluation for MLL tasks is more complicated than binary classification problems because the predictive performance for all labels should be taken into account. In this study, we employed six widely used MLL evaluation metrics [29, 39, 40] to evaluate the performance of Clarion, including example-based accuracy ( Inline graphic ), average precision, coverage, one-error, ranking loss and Hamming loss. Let be the learned multi-label classifier and be a multi-label instance, and represent the true and predicted label set for the instance, represents the complementary set of , accordingly the above metrics can be formulated as follows:

where Inline graphic represents the rank of y in Y based on the descending order, represents the cardinality of set while q is the cardinality of , ∆ stands for the symmetric difference of two sets, while counts the times that meet the condition.

Results and discussion

Statistical analysis of the dataset

The benchmark dataset used in this study contained a total of 36 971 mRNA sequences, each of which might be localized to a single or multiple subcellular compartments. 12 884 mRNAs had only one localization/compartment, 4060 mRNAs had two localizations, 3442 had three localizations, 3165 mRNAs had four localizations, 3518 mRNAs had five localizations, 4258 mRNAs had six localizations, 4079 mRNAs had seven localizations, 1443 mRNAs had eight localizations and 122 mRNAs had nine localizations, as illustrated in Figure 2A. In addition, we also plotted the distribution of the positive and negative samples in the nine compartments, as shown in Figure 2B. Compared with the nucleus, nucleoplasm, chromatin, nucleolus, cytosol, membrane and ribosome, the data of exosome and cytoplasm were unevenly distributed. For the exosome, the number of the positive samples was much larger than that of the negatives, while for cytoplasm, the number of the negative samples was much larger than that of the positives. In addition, there were 31 448 mRNAs in exosome, 21 439 in nucleus, 14 237 in nucleoplasm, 14 328 in chromatin, 4016 in cytoplasm, 11 124 in nucleolus, 16 312 in cytosol, 6739 in membrane and 8680 in ribosome, respectively. The distribution of the label (i.e., subcellular localization) number in these nine compartments is shown in Figure 2C.

Statistical distributions of mRNA entries in the benchmark dataset curated in this study. (A) The relative percentages of mRNAs with different labels in the benchmark dataset. (B) The distribution of the positive and negative samples in the nine subcellular compartments. (C) The distribution of mRNAs with different labels in the nine compartments.

We noticed that those mRNAs in the cytoplasm were mostly single-localized, mRNAs in the nucleoplasm, chromatin, nucleolus, cytosol, membrane and ribosome were mostly multi-localized, while mRNAs in exosome and nucleus appeared to be uniformly localized. In this study, we separated the benchmark dataset into the training_validation dataset and the independent test dataset by random sampling. Accordingly, 33,274 mRNAs were included in the training_validation dataset (90% of the total) and 3697 mRNAs in the independent test set (10% of the total). The former was used to compare the algorithm and optimize the parameters by tenfold cross-validation, while the latter was used to evaluate and validate the model performance. Detailed localization distributions of the training_validation and independent test datasets can be found in the Supplementary Table S1.

Selection of the base classifier

In this section, we performed a preliminary analysis to select a suitable base classifier. To do this, we evaluated seven popular machine learning algorithms, including k-nearest neighbour (KNN), logistic regression (LR), random forest (RF), LightGBM, XGBoost, CatBoost and multilayer perception (MLP) [41, 42] based on the WS framework to determine the optimal base classifier. For the sake of comparison, the weight Inline graphic of WS was assigned to one-quarter, two-quarters and three-quarters of its range, respectively. This can save the running time of the algorithm and accelerate the determination of the base classifier. We employed the default parameters of each algorithm in the Scikit-learn Package and conducted a 10-fold cross-validation test on the training_validation dataset for performance comparison. As can be seen from Figure 3A, in the case of w = 0.25, 0.5 and 0.75, XGBoost secured the best predictive performance among the seven different algorithms in terms of accuracy. In the case of w = 0.25, we found XGBoost achieved a slightly inferior performance to RF in terms of coverage and ranking loss but attained the best performance in terms of the other metrics (refer to Supplementary Table S2 for detailed results). In addition, XGBoost was still the best-performing algorithm for w = 0.5 and 0.75 in terms of all the evaluation metrics. Consequently, the XGBoost algorithm was adopted as the base classifier of the weighted series for developing Clarion. The relationship between the model performance and the weight values will be discussed in more detail in the following section.

(A) The example-based accuracies of seven different machine learning algorithms. (B) The performance comparison of the models trained with different weights in terms of Acc_exam, average precision, coverage, one-error, ranking loss and hamming loss. (C) The accuracies of the binary models in weighted series for predicting each mRNA subcellular localization.

Effect of weight w

In this section, we evaluated the effect of weight Inline graphic in weighted series structure on the model performance, which ranged from 0 to 1. In particular, we evaluated the model performance on the training data through a ten-fold cross-validation test with 19 different candidates ranging from 0.05 to 0.95 with a step size of 0.05. Figure 3B illustrates the model performance with different candidate weights in terms of all six performance metrics. There is a clear peak/valley in each metric sub-figure, whereas the corresponding candidate weights are not consistent. As it is much more difficult to directly determine the first-rank value of weighted series for Clarion, we employed a method by assigning the weights score. According to the performance ranking of each evaluation metric, we assigned 5, 4, 3, 2 and 1 points to the top five candidate weights and 0 to those after the sixth, respectively. For instance, the candidate Inline graphic was assigned 2 points on Acc_exam as the accuracy was the fourth highest out of the 19 candidates. Similarly, the candidate was assigned 5 points for the average precision, 0 points for the coverage, 5 points for the One-error, 0 points for the ranking loss and 4 points for the hamming loss, respectively. We then obtained an overall score of 16 points for the candidate Inline graphic by summing up all the above scores. Using this procedure, we calculated the overall scores of the other 18 candidate weights, whose detailed results and score statistics are provided in Supplementary Table S3 and Supplementary Table S4. The candidate reached the best score of 22 points out of the 19 candidates and accordingly, it was adopted as the fixed weight of weighted series for Clarion.

Performance comparison with other problem transformation strategies

To demonstrate the capacity of the weighted series strategy in dealing with the multi-label multi-class mRNA subcellular localization prediction tasks, we used XGBoost with the default parameters as the base classifier to benchmark and compare our proposed method with the other three well-known problem transformation methods, including BR, CC and LP, on the 10-fold cross-validation tests using the training_validation set. Specifically, we performed 10 times of experiments with 10 groups of randomly generated label orders for CC because the input label order could directly affect the quality of the CC model. Among these 10 experiments, the best result was used for the comparison with BR, LP and WS. From the performance comparison results provided in Table 2, we found that the WS strategy displayed the best performance in terms of the evaluation metrics of Acc_exam, average precision, coverage, ranking loss and hamming loss. Although WS achieved a slightly lower performance than LP in terms of one-error, it showed a clear superiority in predicting mRNA subcellular localizations. To further improve the model performance, we optimized the key parameters, including learning_rate, n_estimators and max_depth, for XGBoost based on the 10-fold cross-validation test. The specific hyperparameters are provided in Supplementary Table S5. Subsequently, we retrained the final model named Clarion on the whole training_validation set using the weighted series strategy with the optimized hyperparameters of XGBoost.

Table 2.

Performance comparison of weighted series with binary relevance, classifier chains and label powerset

Strategy	Acc_exam	Average precision	Coverage	One-error	Ranking loss	Hamming loss
BR	0.600	0.651	6.121	0.706	0.345	0.194
CC	0.456	0.580	7.384	0.838	0.598	0.299
LP	0.558	0.626	6.271	0.601	0.443	0.229
WS ()	0.627	0.670	6.029	0.629	0.344	0.182

Open in a new tab

Performance comparison with existing state-of-the-art tools

In this section, Clarion’s performance was benchmarked and compared with several state-of-the-art approaches for predicting mRNA subcellular localizations. Firstly, we compared Clarion with DM3Loc and Wang’s method, the only two multi-label predictors. Clarion has only five overlapping predictable compartments with these two predictors, including cytosol, exosome, membrane (cytoplasm for Wang’s method), nucleus and ribosome. Therefore, we compared their five-label prediction performance via an independent test. Clarion and DM3Loc achieved the Acc_exam of 0.722/0.441, average precision of 0.769/0.618, coverage of 3.019/3.957, one-error of 0.463/0.869, ranking loss of 0.204/0.533 and hamming loss of 0.150/0.330 on the independent dataset. Similarly, Clarion and Wang’s method achieved Acc_exam of 0.745/0.281, average precision of 0.767/0.679, coverage of 3.127/4.516, one-error of 0.445/0.882, ranking loss of 0.241/0.716 and hamming loss of 0.146/0.375. The above results indicated that Clarion outperformed DM3Loc and Wang’s method in multi-label prediction tasks.

Afterwards, we also compared Clarion’s single-label prediction performance with that of the other state-of-the-art methods. To facilitate the performance comparison, only the methods with accessible webservers were used for prediction and comparison, including iLoc-mRNA [19], mRNALoc [20], mRNALocator [21], DM3Loc [25] and Wang’s method [26]. In particular, the RNA sequences of the independent set were uploaded to their webservers, which then outputted the prediction labels. As shown in Table 3, Clarion clearly outperformed the other methods in predicting cytoplasm, cytosol, exosome, membrane, nucleus and ribosome. We also found that Clarion secured over 80% accuracies in almost all compartment predictions with the only exception of cytosol and nucleus. Notably, the mRNAs of the independent set may appear in the training set of other methods, which may account for Clarion’s slightly lower F1 scores than Wang’s method on the prediction of cytoplasm and ribosome (more details can be found in Supplementary Table S6 of the supplementary file). These comparison results further demonstrated Clarion’s superior prediction power on single-label tasks.

Table 3.

Performance comparison between Clarion and other state-of-art tools in terms of the prediction accuracy on the independent test dataset

Localization	iLoc-mRNA	mRNALoc	mRNALocator	Wang’s	DM3Loc	Clarion
Chromatin	N.A.	N.A.	N.A.	N.A.	N.A.	81.47%
Cytoplasm	N.A.	54.88%	38.90%	87.10%	N.A.	91.29%
Cytosol	N.A.	N.A.	N.A.	67.81%	57.37%	79.77%
Exosome	N.A.	N.A.	N.A.	16.18%	70.00%	92.10%
Membrane	N.A.	N.A.	N.A.	N.A.	70.92%	89.15%
Nucleolus	N.A.	N.A.	N.A.	N.A.	N.A.	83.74%
Nucleoplasm	N.A.	N.A.	N.A.	N.A.	N.A.	80.74%
Nucleus	N.A.	55.18%	57.42%	60.13%	69.52%	79.23%
Ribosome	73.41%	N.A.	N.A.	81.42%	69.03%	84.74%

Open in a new tab

N.A.: non-applicable

Effect of non-label and fusion-label modules in the weighted series structure

In this section, we further examined the effect of the non-label and fusion-label modules used in the WS framework. A total of 18 binary classifiers were trained in Clarion for predicting nine mRNA subcellular localizations respectively, including nine non-label models and nine fusion-label models. Here, we employed these binary models to predict the RNA sequences in the independent test dataset and accordingly evaluated the performance for each localization. Figure 3C illustrates the predictive performance of non-label models and fusion-label models in terms of accuracy. As a result, we found that the fusion-label models performed better than the non-label counterparts for all nine locations, highlighting the necessity and effectiveness of fusion-label models in the WS framework. However, it is noteworthy that the better performance of the fusion-label models originated from the outputs of non-label models because the fusion-label models used the prediction results of the non-label models as input features for the model training. Therefore, we conclude that the non-label and fusion-label modules are complementary and essential to the WS framework of Clarion.

Model interpretation

Shapley additive explanations (SHAP) is a powerful method based on the cooperative game theory that can interpret machine learning models [31] using the Shapley value, which can be used to rank and evaluate the importance of each feature and explain the predictions. SHAP has been successfully applied in a variety of bioinformatics tasks [43–46]. In this study, we used the Shapley value to assess the Inline graphic -mer fragments of mRNAs that are important for subcellular localization prediction. With the nine binary models of the non-label module, the Shapley value of each -mer feature was calculated and ranked using the SHAP Python package (https://shap.readthedocs.io/en/latest/index.html). Figure 4 and Supplementary Figures S1-S3 show the top 15 important Inline graphic -mer features of Clarion for predicting mRNA subcellular localizations. We found some features are only important for one certain localization, such as ‘TTG’, ‘GGGCGC’ in the nucleus, ‘GACGC’ and ‘GCGGCA’ in chromatin, and ‘GGATCT’ and ‘GGCCG’ in the ribosome. For example, ‘TTG’ was identified as the most important feature for nucleus prediction in terms of the SHAP value but did not appear in the list of the top 15 features for the prediction of the other eight localizations. For ‘TTG’ shown in Figure 4A, each point represents an instance with a value from small to large corresponding to the colour from blue to red. When ‘TTG’ took high Shapley values, it would have an influence on the model to make the positive prediction of nucleus and visa verse. Therefore, the effect of ‘TTG’ can be summarized in a way that its larger value promotes the nucleus localization prediction. In contrast, the larger value of ‘GGGCGC’ promotes the non-nucleus localization prediction. These Inline graphic -mer segments may be part of or related to protein recognition motifs for mRNA specific localization. Interestingly, the repeat motif ‘UUCAC’ has been found to be crucial for localization via binding with Vg1PBP [47–49], which corresponded to ‘TTCACC’ that was ranked thirteenth in exosome prediction.

Feature importance ranking based on the Shapley values. (A) Top 15 features for the nucleus, (B) top 15 features for the chromatin, (C) top 15 features for the ribosome and (D) top 15 features among all nine subcellular localizations.

In addition, it was also found that certain Inline graphic -mer features were important for the prediction of several subcellular localizations. For instance, ‘GCGGC’ was ranked second in nucleus prediction, tenth in chromatin prediction, and tenth in ribosome prediction, respectively, as shown in Figure 4A–C. This could be also observed in Figure 4D, which shows the top 15 features ranked according to the average of the absolute SHAP values and shows the important proportion in different mRNA subcellular localizations. The feature ‘GCGGC’ shown in Figure 4D is more important for the prediction of the nucleus, ribosome, chromatin, as well as nucleoplasm. Moreover, the feature ‘AAAAAA’ was important for almost all compartments of subcellular localizations, whose larger values drove the positive prediction of cytoplasm, but the negative predictions of exosome, chromatin, nucleoplasm, nucleolus, cytosol and ribosome.

Webserver implementation

To facilitate the wider research community to make subcellular localization predictions, we developed a web server for Clarion that is freely available at http://monash.bioweb.cloud.edu.au/Clarion/. The web server is maintained by Nectar Research Cloud and configured on a Linux server equipped with a 4-core CPU, 8-GB memory and 30-GB hard disk. The web page was implemented using PHP and has been tested on several popular web browsers, including Google Chrome, Microsoft Edge, Internet Explorer, Mozilla Firefox and Safari. Users are required to copy and paste query mRNA sequences in the textbox or alternatively upload a sequence file in the FASTA format via the file-selection dialogue box. It should be noted the query sequences will be truncated to ensure that no mRNA is greater than 6000 nt. All the generated prediction results will be saved in a table format containing detailed information regarding the sequences and predicted subcellular localization types. The web server also provides a probability score in the range of 0–1 to indicate the probability of the subcellular localization type. The prediction results can be easily exported to widely used file formats, including CSV, Excel, PDF and plain text. More detailed instructions for using the Clarion webserver can be found on the help page of the webserver.

Conclusion

In this study, we have introduced a novel ensemble model termed Clarion for the simultaneous prediction of nine compartments of mRNA subcellular localizations, including exosome, nucleus, nucleoplasm, chromatin, cytoplasm, nucleolus, cytosol, membrane and ribosome. Specifically, Clarion used the Inline graphic -mer nucleotide composition scheme to encode mRNA sequences and employed our proposed strategy, namely weighted series, as the ensemble framework. We selected XGBoost as the base classifier for the weighted series and optimized the important weight parameter for the weighted series. The cross-validation and independent tests illustrate the superiority of Clarion for predicting mRNA subcellular localizations, which outperformed several existing methods, including single-label predictors for iLoc-mRNA, mRNALoc and mRNALocator, and multi-label predictors of DM3Loc and Wang’s method. The performance improvement of Clarion can be attributed to two key factors: the first is the collection of more data than the existing methods to construct the benchmark dataset to facilitate high-quality model fitting; the second is use of the XGBoost-based weighted series method to incorporate multi-label priori information for model training and leverage such information to improve the predictive power. Moreover, we analyzed the most important Inline graphic -mer features for each localization prediction using the model interpretation algorithm SHAP and developed a user-friendly web server for Clarion. Clarion is expected to be a promising tool for multi-label mRNA subcellular localization prediction in the field of bioinformatics.

Nevertheless, the model performance can be further improved, especially for several compartments in the cytosol and nucleus. Only nine cell compartments were considered by Clarion; it is desirable to expand the size of predictable compartments. In our future work, we plan to develop advanced technologies for the prediction of mRNA subcellular localization. Apart from mRNAs, we will also make efforts to develop methods for predicting localizations of other types of RNAs, such as microRNAs and long non-coding RNAs. In addition, the co-localization between different types of RNAs will be another research direction in the future work.

Key Points

Characterization of mRNA subcellular localization can help elucidate gene regulatory networks and human disease mechanisms.
We proposed a novel ensemble method, termed Clarion, to predict nine subcellular localizations of mRNAs simultaneously.
Clarion achieved significantly better predictive performance in both single-label and multi-label predictions compared to state-of-the-art predictors.
The online webserver and local stand-alone tool of Clarion are publicly available at: http://monash.bioweb.cloud.edu.au/Clarion/.

Supplementary Material

Clarion_Supplementary_bbac467

Click here for additional data file.^{(679.8KB, docx)}

Author Biographies

Yue Bi is currently a PhD student in the Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Australia. Her interests include bioinformatics, computational biology, sequence analysis and machine learning.

Fuyi Li received his PhD in Bioinformatics from Monash University, Australia. He is currently a professor in the College of Information Engineering, Northwest A&F University, China. His research interests are bioinformatics, computational biology, machine learning, and data mining.

Xudong Guo received his MEng degree from Ningxia University, China. He is currently a research assistant at the College of Information Engineering, Northwest A&F University, China. His research interests are bioinformatics and data mining.

Zhikang Wang is currently a PhD student in the Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Australia. His interests are bioinformatics, computational pathology, pattern recognition and deep learning.

Tong Pan is currently a PhD student in the Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Australia. Her interests are bioinformatics, protein function analysis, deep learning and pattern recognition.

Yuming Guo is a professor of Global Environmental Health and Biostatistics & Head of the Monash Climate, Air Quality Research (CARE) Unit. His research focuses on environmental epidemiology, biostatistics, global environmental change, air pollution, climate change, urban design, residential environment, remote sensing modelling and infectious disease modelling.

Geoffrey I. Webb is a professor in the Faculty of Information Technology and a research director of the Monash Data Futures Institute at Monash University. His research interests include machine learning, data mining, computational biology and user modelling.

Jianhua Yao is a group leader in Tencent AI Lab, China. He received his PhD degree from John Hopkins University. His research interests include bioinformatics, computational biology, medical imaging and pattern recognition.

Cangzhi Jia is a professor in the college of science, Dalian Maritime University, China. She obtained her PhD degree in the school of Mathematical Sciences from the Dalian University of Technology in 2007. Her major research interests include mathematical modelling in bioinformatics and machine learning.

Jiangning Song is an associate professor and a group leader in the Monash Biomedicine Discovery Institute, Monash University. He is also affiliated with the Monash Data Futures Institute, Monash University. His research interests include bioinformatics, computational biomedicine, machine learning, data mining and pattern recognition.

Contributor Information

Yue Bi, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia; Monash Data Futures Institute, Monash University, Melbourne, Victoria 3800, Australia.

Fuyi Li, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia; College of Information Engineering, Northwest A&F University, Yangling, 712100, China; Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia.

Xudong Guo, College of Information Engineering, Northwest A&F University, Yangling, 712100, China.

Zhikang Wang, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia.

Tong Pan, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia.

Yuming Guo, Department of Epidemiology and Preventive Medicine, School of Public Health and Preventive Medicine, Monash University, Melbourne, Victoria 3004, Australia.

Geoffrey I Webb, Monash Data Futures Institute, Monash University, Melbourne, Victoria 3800, Australia.

Jianhua Yao, Tencent AI Lab, Shenzhen, China.

Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China.

Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia; Monash Data Futures Institute, Monash University, Melbourne, Victoria 3800, Australia.

Code and data availability

The source code and datasets of Clarion are publicly available at http://monash.bioweb.cloud.edu.au/Clarion/.

Funding

National Health and Medical Research Council of Australia (NHMRC) (APP1127948, APP1144652), Australian Research Council (ARC) (LP110200333, DP120104460), National Institute of Allergy and Infectious Diseases of the National Institutes of Health (R01 AI111965), Major and Seed Inter-Disciplinary Research (IDR) projects awarded by Monash University.

References

1. Jeffery WR, Tomlinson CR, Brodeur RD. Localization of actin messenger RNA during early ascidian development. Dev Biol 1983;99:408–17. [DOI] [PubMed] [Google Scholar]
2. Lawrence JB, Singer RH. Intracellular localization of messenger RNAs for cytoskeletal proteins. Cell 1986;45:407–15. [DOI] [PubMed] [Google Scholar]
3. Meyer C, Garzia A, Tuschl T. Simultaneous detection of the subcellular localization of RNAs and proteins in cultured cells by combined multicolor RNA-FISH and IF. Methods 2017;118-119:101–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Chin A, Lécuyer E. RNA localization: Making its way to the center stage. Biochimica et Biophysica Acta (BBA)-General Subjects 2017;1861:2956–70. [DOI] [PubMed] [Google Scholar]
5. Kloc M, Zearfoss NR, Etkin LD. Mechanisms of subcellular mRNA localization. Cell 2002;108:533–44. [DOI] [PubMed] [Google Scholar]
6. Li X, Franceschi VR, Okita TW. Segregation of storage protein mRNAs on the rough endoplasmic reticulum membranes of rice endosperm cells. Cell 1993;72:869–79. [DOI] [PubMed] [Google Scholar]
7. Katz ZB, Wells AL, Park HY, et al. beta-Actin mRNA compartmentalization enhances focal adhesion stability and directs cell migration. Genes Dev 2012;26:1885–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Kejiou NS, Palazzo AF. mRNA localization as a rheostat to regulate subcellular gene expression, Wiley Interdiscip Rev. RNA 2017;8:e1416. [DOI] [PubMed] [Google Scholar]
9. Liu D, Li G, Zuo Y. Function determinants of TET proteins: the arrangements of sequence motifs with specific codes. Brief Bioinform 2019;20:1826–35. [DOI] [PubMed] [Google Scholar]
10. Cooper TA, Wan L, Dreyfuss G. RNA and disease. Cell 2009;136:777–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Liu-Yesucevitz L, Bassell GJ, Gitler AD, et al. Local RNA translation at the synapse and in disease. J Neurosci 2011;31:16086–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Sprenkle NT, Sims SG, Sanchez CL, et al. Endoplasmic reticulum stress and inflammation in the central nervous system. Mol Neurodegener 2017;12:42. [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Dolezal JM, Dash AP, Prochownik EV. Diagnostic and prognostic implications of ribosomal protein transcript expression patterns in human cancers. BMC Cancer 2018;18:275. [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Engel KL, Arora A, Goering R, et al. Mechanisms and consequences of subcellular RNA localization across diverse cell types. Traffic 2020;21:404–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Zhang T, Tan P, Wang L, et al. RNALocate: a resource for RNA subcellular localizations. Nucleic Acids Res 2017;45:D135–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Mas-Ponte D, Carlevaro-Fita J, Palumbo E, et al. LncATLAS database for subcellular localization of long noncoding RNAs. RNA 2017;23:1080–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Wen X, Gao L, Guo X, et al. lncSLdb: a resource for long non-coding RNA subcellular localization. Database (Oxford) 2018;2018:1–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Yan Z, Lecuyer E, Blanchette M. Prediction of mRNA subcellular localization using deep recurrent neural networks. Bioinformatics 2019;35:i333–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Zhang ZY, Yang YH, Ding H, et al. Design powerful predictor for mRNA subcellular location prediction in Homo sapiens. Brief Bioinform 2021;22:526–35. [DOI] [PubMed] [Google Scholar]
20. Garg A, Singhal N, Kumar R, et al. mRNALoc: a novel machine-learning based in-silico tool to predict mRNA subcellular localization. Nucleic Acids Res 2020;48:W239–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Tang Q, Nie F, Kang J, et al. mRNALocater: Enhance the prediction accuracy of eukaryotic mRNA subcellular localization by using model fusion strategy. Mol Ther 2021;29:2617–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Li J, Zhang L, He S, et al. SubLocEP: a novel ensemble predictor of subcellular localization of eukaryotic mRNA based on machine learning. Brief Bioinform 2021;22:bbaa401. [DOI] [PubMed] [Google Scholar]
23. Lewis RA, Gagnon JA, Mowry KL. PTB/hnRNP I is required for RNP remodeling during RNA localization in Xenopus oocytes. Mol Cell Biol 2008;28:678–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Buskila AA, Kannaiah S, Amster-Choder O. RNA localization in bacteria. RNA Biol 2014;11:1051–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Wang D, Zhang Z, Jiang Y, et al. DM3Loc: multi-label mRNA subcellular localization prediction and analysis based on multi-head self-attention mechanism. Nucleic Acids Res 2021;49:e46. [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Wang H, Ding Y, Tang J, et al. Identify RNA-associated subcellular localizations based on multi-label learning using Chou's 5-steps rule. BMC Genomics 2021;22:56. [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Zhang M-L, Zhou Z-H. A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering 2014;26:1819–37. [Google Scholar]
28. Boutell MR, Luo J, Shen X, et al. Learning multi-label scene classification. Pattern Recognition 2004;37:1757–71. [Google Scholar]
29. Tsoumakas G, Vlahavas I. Random k-labelsets: An ensemble method for multilabel classification. In: European conference on machine learning. 2007, p. 406–17. Springer. [Google Scholar]
30. Read J, Pfahringer B, Holmes G, et al. Classifier chains for multi-label classification. Machine Learning 2011;85:333–59. [Google Scholar]
31. Lundberg SM, Erion G, Chen H, et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell 2020;2:56–67. [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Cui T, Dou Y, Tan P, et al. RNALocate v2.0: an updated resource for RNA subcellular localization with increased coverage and annotation. Nucleic Acids Res 2022;50:D333–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
33. Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006;22:1658–9. [DOI] [PubMed] [Google Scholar]
34. Chen Z, Liu X, Zhao P, et al. iFeatureOmega: an integrative platform for engineering, visualization and analysis of features from molecular sequences, structural and ligand data sets. Nucleic Acids Res 2022;50(W1):W434–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
35. Jiang P, Luo J, Wang Y, et al. kmcEx: memory-frugal and retrieval-efficient encoding of counted k-mers. Bioinformatics 2019;35:4871–8. [DOI] [PubMed] [Google Scholar]
36. Manavalan B, Basith S, Shin TH, et al. Computational prediction of species-specific yeast DNA replication origin via iterative feature representation. Brief Bioinform 2021;22(4):bbaa304. [DOI] [PMC free article] [PubMed] [Google Scholar]
37. Yan K, Lv H, Guo Y, et al. TPpred-ATMV: therapeutic peptide prediction by adaptive multi-view tensor learning model. Bioinformatics 2022;38:2712–8. [DOI] [PubMed] [Google Scholar]
38. Wei L, Su R, Luan S, et al. Iterative feature representations improve N4-methylcytosine site prediction. Bioinformatics 2019;35:4930–7. [DOI] [PubMed] [Google Scholar]
39. Ghamrawi N, McCallum A. Collective multi-label classification. In: Proceedings of the 14th ACM international conference on Information and knowledge management. 2005, p. 195–200.
40. Gopal S, Yang Y. Multilabel classification with meta-level features. In: Proceedings of the 33rd International ACM SIGIR conference on Research and development in information retrieval. 2010, p. 315–22.
41. Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research 2011;12:2825–30. [Google Scholar]
42. Wang X, Li F, Xu J, et al. ASPIRER: a new computational approach for identifying non-classical secreted proteins based on deep learning. Brief Bioinform 2022;23(2):bbac031. [DOI] [PMC free article] [PubMed] [Google Scholar]
43. Bi Y, Xiang D, Ge Z, et al. An interpretable prediction model for identifying N(7)-methylguanosine sites based on XGBoost and SHAP. Mol Ther Nucleic Acids 2020;22:362–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
44. Li F, Chen J, Ge Z, et al. Computational prediction and interpretation of both general and specific types of promoters in Escherichia coli by exploiting a stacked ensemble-learning framework. Brief Bioinform 2021;22:2126–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
45. Li F, Guo X, Jin P, et al. Porpoise: a new approach for accurate prediction of RNA pseudouridine sites. Brief Bioinform 2021;22(6):bbab245. [DOI] [PMC free article] [PubMed] [Google Scholar]
46. Li F, Guo X, Xiang D, et al. Computational analysis and prediction of PE_PGRS proteins using machine learning. Comput Struct Biotechnol J 2022;20:662–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
47. Kwon S, Abramson T, Munro TP, et al. UUCAC-and vera-dependent localization of VegT RNA in Xenopus oocytes. Curr Biol 2002;12:558–64. [DOI] [PubMed] [Google Scholar]
48. Gautreau D, Cote CA, Mowry KL. Two copies of a subelement from the Vg1 RNA localization sequence are sufficient to direct vegetal localization in Xenopus oocytes. Development 1997;124:5013–20. [DOI] [PubMed] [Google Scholar]
49. Bubunenko M, Kress TL, Vempati UD, et al. A consensus RNA signal that directs germ layer determinants to the vegetal cortex of Xenopus oocytes. Dev Biol 2002;248:82–92. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Clarion_Supplementary_bbac467

Click here for additional data file.^{(679.8KB, docx)}

Data Availability Statement

The source code and datasets of Clarion are publicly available at http://monash.bioweb.cloud.edu.au/Clarion/.

PERMALINK

Clarion is a multi-label problem transformation method for identifying mRNA subcellular localizations

Yue Bi

Fuyi Li

Xudong Guo

Zhikang Wang

Tong Pan

Yuming Guo

Geoffrey I Webb

Jianhua Yao

Cangzhi Jia

Jiangning Song

Abstract

Introduction

Table 1.

Material and methods

Benchmark dataset

Sequence vectorization

Weighted series

Figure 1.

Performance evaluation metrics

Results and discussion

Statistical analysis of the dataset

Figure 2.

Selection of the base classifier

Figure 3.

Effect of weight w

Performance comparison with other problem transformation strategies

Table 2.

Performance comparison with existing state-of-the-art tools

Table 3.

Effect of non-label and fusion-label modules in the weighted series structure

Model interpretation

Figure 4.

Webserver implementation

Conclusion

Key Points

Supplementary Material

Author Biographies

Contributor Information

Code and data availability

Funding

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases