Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2022 Nov 5;23(6):bbac467. doi: 10.1093/bib/bbac467

Clarion is a multi-label problem transformation method for identifying mRNA subcellular localizations

Yue Bi 1,2, Fuyi Li 3,4,5,, Xudong Guo 6, Zhikang Wang 7, Tong Pan 8, Yuming Guo 9, Geoffrey I Webb 10, Jianhua Yao 11,, Cangzhi Jia 12,, Jiangning Song 13,14,
PMCID: PMC10148739  PMID: 36341591

Abstract

Subcellular localization of messenger RNAs (mRNAs) plays a key role in the spatial regulation of gene activity. The functions of mRNAs have been shown to be closely linked with their localizations. As such, understanding of the subcellular localizations of mRNAs can help elucidate gene regulatory networks. Despite several computational methods that have been developed to predict mRNA localizations within cells, there is still much room for improvement in predictive performance, especially for the multiple-location prediction. In this study, we proposed a novel multi-label multi-class predictor, termed Clarion, for mRNA subcellular localization prediction. Clarion was developed based on a manually curated benchmark dataset and leveraged the weighted series method for multi-label transformation. Extensive benchmarking tests demonstrated Clarion achieved competitive predictive performance and the weighted series method plays a crucial role in securing superior performance of Clarion. In addition, the independent test results indicate that Clarion outperformed the state-of-the-art methods and can secure accuracy of 81.47, 91.29, 79.77, 92.10, 89.15, 83.74, 80.74, 79.23 and 84.74% for chromatin, cytoplasm, cytosol, exosome, membrane, nucleolus, nucleoplasm, nucleus and ribosome, respectively. The webserver and local stand-alone tool of Clarion is freely available at http://monash.bioweb.cloud.edu.au/Clarion/.

Keywords: mRNA, subcellular localization, sequence analysis, machine learning, multi-class classification, multi-label prediction

Introduction

Asymmetric mRNA distribution was first reported in the early 1980s, which found actin mRNA to be localized in the cytoplasm during the ascidian embryos [1]. Later, Lawrence and Singer also observed such a phenomenon in chicken fibroblasts via in situ hybridization [2]. In the past decade, there have been an increasing number of studies suggesting that transcripts localization is a common and efficient way to target gene products to specific intracellular regions [3–6]. A number of different regulatory processes, including cell polarity, cell motility, embryo development and asymmetric cell division, have been shown to be related to the mRNA subcellular localization [7–9]. Abnormal mRNA subcellular localization has been found to be associated with numerous diseases, such as fragile X syndrome, embryonal disorders, Alzheimer’s disease and cancer [10–14]. However, there is a significant gap regarding the understanding of the underlying mechanisms of how exactly mRNAs are transported within the cells. Characterization of mRNA subcellular localizations can help elucidate the RNA localization patterns and human diseases mechanisms. With the development of high-throughput RNA sequencing technologies, an increasing number of mRNA transcripts have been recently identified. Several popular online databases have been developed to provide annotations of RNA cellular localization data for public use, such as RNALocate [15], lncATLAS [16] and lncSLdb [17]. LncATLAS and lncSLdb mainly store localization information of long non-coding RNAs, while RNALocate collects subcellular localization data for almost all kinds of RNAs. With the advances in data curation, recent years have witnessed a proliferation of computational methods developed for predicting RNA localizations with low cost and high efficiency compared with wet-lab experimental methods. Such computational methods provide important complementation to the wet-lab methods for the identification of RNA localizations.

In the present study, we comprehensively surveyed the state-of-the-art computational approaches for mRNA subcellular localization prediction in terms of a wide range of aspects, including their data sources, benchmark datasets, sequence encoding schemes, feature selection methods, machine learning algorithms and performance evaluation strategies, which are listed in Table 1. The mRNA subcellular localization prediction problem can generally be regarded as a multi-class classification task. We categorized the existing computational predictors into two major types according to the machine learning scheme: (1) single-label multi-classification predictors and (2) multi-label multi-classification predictors. For the first type, there are five predictors that address the mRNA subcellular localization prediction as a single-label classification task. These included RNATracker [18], iLoc-mRNA [19], mRNALoc [20], mRNALocater [21] and SubLocEP [22]. Unlike single-label multi-class classification, multi-label learning (MLL) has been attracting increasing attention in recent years. This is relevant because real-world objects can often have multiple semantic meanings simultaneously, where an instance may be associated with multiple labels. Taking the existence of transcriptome as an example, the mRNA Vg1 formed in the nucleus can be re-modelled in the cytoplasm during vegetal localization [23]; the mRNA bglG is localized in the membrane to form a pre-complex when being co-transcribed with its sensor, while bglG is localized in the cell poles when being expressed on its own [24]. This widespread multiple localization phenomenon inspires the multi-label task to investigate the mRNA subcellular localization identification problem. Recently, a predictor called DM3Loc has been developed based on the multi-head self-attention mechanism, representing the first multi-label mRNA subcellular localization prediction model [25]. In another recent work, Wang et al. constructed multi-label prediction models with multiple kernel support vector machines for mRNA, lncRNA, miRNA and snoRNA, respectively [26].

Table 1.

Summary of the reviewed predictors for mRNA subcellular localization

Type Year Tool The sources of
mRNA subcellular localization/sequence
Subcellular
localization
Benchmark
dataset size
Encoding
scheme
Feature
selection
Algorithm Evaluation
strategy
Web server/Github
availability
Reference
Single-label 2019 RNATracker CeFra-Seq &
APEX-RIP
/Ensembl
Cytosol
Endoplasmic reticulum
Insoluble
Membranes
Mitochondrial
Nuclear
11 373 (Dataset 1)
13 860 (Dataset 2)
One-hot
RNA secondary structure
None CNN
LSTM
Attention
Tenfold cross-validation https://www.github.com/HarveyYan/RNATracker [18]
2020 iLoc-mRNA RNALocate
/GenBank
Cytoplasm
Cytosol
Dendrite
Endoplasmic reticulum
Exosome
Mitochondrion
Nucleus
Ribosome
4901 K-mer Binomial distribution
ANOVA
IFS
SVM Fivefold cross-validation http://lin-group.cn/server/iLoc-mRNA/ [19]
2020 mRNALoc RNALocate Cytoplasm
Endoplasmic reticulum
Extracellular region
Mitochondria
Nucleus
14 909 PseKNC None SVM Fivefold cross-validation
Independent test
http://proteininformatics.org/mkumar/mrnaloc [20]
2021 mRNALocater RNALocate Cytoplasm
Endoplasmic reticulum
Extracellular
Mitochondria
Nucleus
14 909 PseEIIP
PseKNC
Remove collinear features
SFS
CatBoost
XGBoost
LightGBM
Fivefold cross-validation
Independent test
http://bio-bigdata.cn/mRNALocater [21]
2021 SubLocEP RNALocate Cytoplasm
Endoplasmic reticulum
Extracellular
Mitochondria
Nucleus
14 909 PseEIIP
TNC
DNC
CKSNAP
PCPseDNC
PCPseTNC
SCPseDNC
SCPseTNC
DACC
None LightGBM Fivefold cross-validation
Independent test
http://lab.malab.cn/~lijing/SubLocEP.html [22]
Multi-label 2021 DM3Loc RNALocate
/GenBank & NCBI
Cytosol
Endoplasmic reticulum
Exosome
Membrane
Nucleus
Ribosome
17 870 One-hot None CNN
Attention
Fivefold cross-validation
Independent test
http://dm3loc.lin-group.cn/ [25]
2021 Wang’s RNALocate Cytoplasm
Cytosol
Endoplasmic reticulum
Exosome
Mitochondrion
Nucleus
Posterior
Pseudopodium
Ribosome
13475 K-mer
RCKmer
NAC
DNC
TNC
CKSNAP
None SVM Tenfold cross-validation http://lbci.tju.edu.cn/Online_services.htm [26]

Abbreviations: PseKNC, pseudo k-tuple nucleotide composition; PseEIIP, electron-ion interaction pseudopotential values of trinucleotide; RCkmer, reverse compliment k-mer; NAC, nucleic acid composition; DNC, dinucleotide composition; TNC, trinucleotide composition; CKSNAP, composition of k-spaced nucleic acid pairs; PCPseDNC, parallel correlation pseudo dinucleotide composition; PCPseTNC, parallel correlation pseudo trinucleotide composition; SCPseDNC, series correlation pseudo dinucleotide composition; SCPseTNC, series correlation pseudo trinucleotide composition; DACC, dinucleotide-based auto-cross covariance; ANOVA, analysis of variance; IFS, incremental feature selection; SFS, sequential forward search.

To deal with MLL tasks, problem transformation algorithms such as binary relevance (BR), label powerset (LP) and classifier chains (CC) [27] are popular strategies, which essentially transform the multi-label problems into one or more single-label tasks. Amongst these, BR is the most widely used problem transformation algorithm and is able to transform a multi-label problem into a binary problem for each label [28]. However, a weakness of BR is that it ignores the correlation between the labels, thereby limiting its utility. In contrast, LP and CC are designed with the awareness of correlations between different data labels. The basic idea of LP is that each unique combination of labels present in a multi-label training set is regarded as a new class of a new single-label multi-class task [29]. CC generates a chain of binary classifiers, one for each label. The subsequent binary classifiers in the chain are further augmented by all preceding binary relevance predictions in the chain [30]. However, these two methods also have certain limitations: For LP, less frequent combinations would lead to the sample imbalance, and it cannot predict the label combinations that do not appear in the training set. In the case of CC, the order of the input labels can affect the model quality and prediction performance [27]. If the first model of the chain predicts inaccurately, an error may propagate along the chain. To overcome this problem, Read et al. [30] proposed the ensembles of classifier chain (ECC) method. The core idea of ECC is to average the predictions of CC models over a group of random chain ordering; however, it may still impose some restrictions on the computational capacity and time cost.

Herein, we introduced a novel computational method termed Clarion (subcellular localization predictor), which is capable of identifying multiple subcellular localizations of mRNAs simultaneously. Firstly, we established a multi-label benchmark dataset extracted from RNALocate, consisting of nine different compartments of mRNA subcellular localizations. The Inline graphic-mer nucleotide composition scheme was used to encode mRNA sequences. Next, we applied the weighted series approach as the ensemble framework of Clarion, which is a problem transformation method proposed to tackle multi-label tasks. The weighted series algorithm incorporates the prior information of the labels during model training to improve the prediction performance. Then, after the performance comparison with several machine learning algorithms, we selected XGBoost as the base classifier of Clarion. We optimized the weight of the weighted series and key parameters of XGBoost through 10-fold cross-validation. Additional independent tests illustrate that Clarion outperformed the existing state-of-the-art tools for identifying mRNA subcellular localizations. In addition, we also employed the SHAP (Shapley Addictive exPlanation) algorithm [31] to identify and interpret the most important Inline graphic-mer features for each type of mRNA subcellular localization that made the most important contributions to the model predictions.

Material and methods

Benchmark dataset

In this study, all subcellular localization annotations and mRNA sequences were collected from the RNALocate database (version 2.0) [32]. The latest version of RNALocate (updated in June 2021) contained more than 210 000 RNA subcellular localization entries, encompassing more than 110 000 RNAs with 171 subcellular localizations across 104 different species. Its version 2.0 provides more accurate localization annotations than the first version, facilitating the construction of a reliable benchmark dataset. More specifically, the benchmark dataset was constructed according to the five following major steps:

  • 1) We downloaded all RNA subcellular localization annotation entries from RNALocate (version 2.0) and accordingly collected 84 792 mRNA subcellular localization entries as the initial dataset.

  • 2) According to the statistics of the initial dataset, there were 150 different types of annotated subcellular localizations. However, some had minimal and incomplete entries and as such, we removed those subcellular localization types whose corresponding entry numbers were less than 3000. As a result, we obtained nine types of subcellular localizations with 152 887 unique transcripts including exosome, nucleus, cytosol, chromatin, nucleoplasm, ribosome, nucleolus, cytoplasm and membrane.

  • 3) Next, we redefined the mapping relationships between mRNAs and subcellular localizations based on multiple localizations of mRNAs in the transcriptome. In particular, an mRNA can be labelled with multiple subcellular localizations instead of being only labelled with one subcellular localization.

  • 4) To reduce the effect of sequence redundancy on the performance of the classifier, we applied CD-HIT-EST [33] to remove the redundant sequences with the 80% sequence identity threshold to ensure the similarity between any two nucleotide sequences was less than 80%. Finally, 36 971 mRNAs were obtained and used as the benchmark dataset.

  • 5) We analyzed the distribution of sequence length of these 36 971 mRNAs, which varied from 119 nt to 12 000 nt. In view of the computing complexity and limitation of the feature engineering algorithms, the mRNA lengths were adjusted to no longer than 6000 nt. Specifically, for those mRNAs with more than 6000 nt, the first 3000 nt and the last 3000 nt were extracted and merged.

Sequence vectorization

mRNA sequences need to be encoded as numeric vectors prior to the training of machine learning models. The Inline graphic-mer nucleotide composition is one of the widely used sequence encoding methods, which has been successfully applied in a variety of bioinformatics studies [34–38]. Given an mRNA sequence Inline graphic with length Inline graphic nt, Inline graphic, where Inline graphic and Inline graphic represents the nucleotide acid at position Inline graphic, one type of adenine (A), cytosine (C), guanine (G) and uracil/thymine (U/T). Accordingly, when using the Inline graphic-mer (Inline graphic) method to encode features, the feature vectors can be calculated as:

graphic file with name DmEquation1.gif

where Inline graphic represents the number of Inline graphic-mer type along the sequence. From the equation, we can observe that the Inline graphic-mer vector dimension increases exponentially with the increase of Inline graphic value. These numerous features may contain a great deal of redundancy and noise, which may cause extra training time and even have a negative influence on the model quality. In view of the dimension restriction and training effectiveness, we used 1-mer, 2-mer, 3-mer, 4-mer, 5-mer and 6-mer in this study.

Weighted series

In this study, we proposed a novel problem transformation method named weighted series (WS) to tackle multi-label learning problems, which is specific for the subcellular localization identification of mRNAs. The weighted series method involves two modules of binary classifiers, including a non-label module and a fusion-label module. The non-label module is concerned with training the model only from pre-extracted features. While the fusion-label module incorporates the priori information about the labels in the model training, whose learned priori label distributions could contribute to the model predictions. The final predictions of the two modules are combined by a weight Inline graphic (ranging from 0 to 1) that requires user customization, reflecting the labels' relevance. When there is a strong correlation between the labels, a smaller value of Inline graphic will promote the prediction performance.

Suppose Inline graphic represents the Inline graphic dimensional instance space and Inline graphic represents the Inline graphic dimensional label space. Given a multi-label data set Inline graphic that containing Inline graphic samples, where Inline graphic and Inline graphic. Both non-label module and fusion-label module require training q binary classifiers. When training the Inline graphic-th model Inline graphic (Inline graphic) of non-label module, the training label is Inline graphic, and the training input is the feature vector of the Inline graphic training samples, i.e., Inline graphic. With regard to training the Inline graphic-th model Inline graphic (Inline graphic) of the fusion-label module, the priori information on other labels is added to the training process by combining features Inline graphic as input. That is to say, the label is still Inline graphic, whereas the training input is the fusion of Inline graphic and Inline graphic. After model training, 2q binary classifiers of weighted series, Inline graphic and Inline graphic, can be used to conduct predictions. The prediction process of weighted series includes three steps: Inline graphic module prediction, Inline graphic module prediction and integration. Given a query mRNA sequence and its feature vector are Inline graphic, firstly, non-label models Inline graphic are used to conduct the prediction and generate the prediction probabilities Inline graphic; Inline graphic would then be fused with Inline graphic as the input to conduct prediction of fusion-label models Inline graphic and output the probabilities Inline graphic; finally, Inline graphic and Inline graphic would be integrated with a user-defined weight Inline graphic that ranges from 0 and 1. The detailed training and prediction procedures of the weighted series are illustrated in Figure 1 and outlined in Algorithm 1.

Figure 1.

Figure 1

The workflow of the methodology of Clarion.

graphic file with name bbac467fx1.jpg

Performance evaluation metrics

The model evaluation for MLL tasks is more complicated than binary classification problems because the predictive performance for all labels should be taken into account. In this study, we employed six widely used MLL evaluation metrics [29, 39, 40] to evaluate the performance of Clarion, including example-based accuracy (Inline graphic), average precision, coverage, one-error, ranking loss and Hamming loss. Let Inline graphic be the learned multi-label classifier and Inline graphic be a multi-label instance, Inline graphic and Inline graphic represent the true and predicted label set for the instance,Inline graphic represents the complementary set of Inline graphic, accordingly the above metrics can be formulated as follows:

graphic file with name DmEquation2.gif
graphic file with name DmEquation3.gif
graphic file with name DmEquation4.gif
graphic file with name DmEquation5.gif
graphic file with name DmEquation6.gif
graphic file with name DmEquation7.gif

where Inline graphic represents the rank of y in Y based on the descending order, Inline graphic represents the cardinality of set while q is the cardinality of Inline graphic, ∆ stands for the symmetric difference of two sets, while Inline graphic counts the times that meet the condition.

Results and discussion

Statistical analysis of the dataset

The benchmark dataset used in this study contained a total of 36 971 mRNA sequences, each of which might be localized to a single or multiple subcellular compartments. 12 884 mRNAs had only one localization/compartment, 4060 mRNAs had two localizations, 3442 had three localizations, 3165 mRNAs had four localizations, 3518 mRNAs had five localizations, 4258 mRNAs had six localizations, 4079 mRNAs had seven localizations, 1443 mRNAs had eight localizations and 122 mRNAs had nine localizations, as illustrated in Figure 2A. In addition, we also plotted the distribution of the positive and negative samples in the nine compartments, as shown in Figure 2B. Compared with the nucleus, nucleoplasm, chromatin, nucleolus, cytosol, membrane and ribosome, the data of exosome and cytoplasm were unevenly distributed. For the exosome, the number of the positive samples was much larger than that of the negatives, while for cytoplasm, the number of the negative samples was much larger than that of the positives. In addition, there were 31 448 mRNAs in exosome, 21 439 in nucleus, 14 237 in nucleoplasm, 14 328 in chromatin, 4016 in cytoplasm, 11 124 in nucleolus, 16 312 in cytosol, 6739 in membrane and 8680 in ribosome, respectively. The distribution of the label (i.e., subcellular localization) number in these nine compartments is shown in Figure 2C.

Figure 2.

Figure 2

Statistical distributions of mRNA entries in the benchmark dataset curated in this study. (A) The relative percentages of mRNAs with different labels in the benchmark dataset. (B) The distribution of the positive and negative samples in the nine subcellular compartments. (C) The distribution of mRNAs with different labels in the nine compartments.

We noticed that those mRNAs in the cytoplasm were mostly single-localized, mRNAs in the nucleoplasm, chromatin, nucleolus, cytosol, membrane and ribosome were mostly multi-localized, while mRNAs in exosome and nucleus appeared to be uniformly localized. In this study, we separated the benchmark dataset into the training_validation dataset and the independent test dataset by random sampling. Accordingly, 33,274 mRNAs were included in the training_validation dataset (90% of the total) and 3697 mRNAs in the independent test set (10% of the total). The former was used to compare the algorithm and optimize the parameters by tenfold cross-validation, while the latter was used to evaluate and validate the model performance. Detailed localization distributions of the training_validation and independent test datasets can be found in the Supplementary Table S1.

Selection of the base classifier

In this section, we performed a preliminary analysis to select a suitable base classifier. To do this, we evaluated seven popular machine learning algorithms, including k-nearest neighbour (KNN), logistic regression (LR), random forest (RF), LightGBM, XGBoost, CatBoost and multilayer perception (MLP) [41, 42] based on the WS framework to determine the optimal base classifier. For the sake of comparison, the weight Inline graphic of WS was assigned to one-quarter, two-quarters and three-quarters of its range, respectively. This can save the running time of the algorithm and accelerate the determination of the base classifier. We employed the default parameters of each algorithm in the Scikit-learn Package and conducted a 10-fold cross-validation test on the training_validation dataset for performance comparison. As can be seen from Figure 3A, in the case of w = 0.25, 0.5 and 0.75, XGBoost secured the best predictive performance among the seven different algorithms in terms of accuracy. In the case of w = 0.25, we found XGBoost achieved a slightly inferior performance to RF in terms of coverage and ranking loss but attained the best performance in terms of the other metrics (refer to Supplementary Table S2 for detailed results). In addition, XGBoost was still the best-performing algorithm for w = 0.5 and 0.75 in terms of all the evaluation metrics. Consequently, the XGBoost algorithm was adopted as the base classifier of the weighted series for developing Clarion. The relationship between the model performance and the weight values will be discussed in more detail in the following section.

Figure 3.

Figure 3

(A) The example-based accuracies of seven different machine learning algorithms. (B) The performance comparison of the models trained with different weights in terms of Accexam, average precision, coverage, one-error, ranking loss and hamming loss. (C) The accuracies of the binary models in weighted series for predicting each mRNA subcellular localization.

Effect of weight w

In this section, we evaluated the effect of weight Inline graphic in weighted series structure on the model performance, which ranged from 0 to 1. In particular, we evaluated the model performance on the training data through a ten-fold cross-validation test with 19 different Inline graphic candidates ranging from 0.05 to 0.95 with a step size of 0.05. Figure 3B illustrates the model performance with different candidate weights in terms of all six performance metrics. There is a clear peak/valley in each metric sub-figure, whereas the corresponding candidate weights are not consistent. As it is much more difficult to directly determine the first-rank value of weighted series for Clarion, we employed a method by assigning the weights score. According to the performance ranking of each evaluation metric, we assigned 5, 4, 3, 2 and 1 points to the top five candidate weights and 0 to those after the sixth, respectively. For instance, the candidate Inline graphic was assigned 2 points on Accexam as the accuracy was the fourth highest out of the 19 candidates. Similarly, the candidate Inline graphic was assigned 5 points for the average precision, 0 points for the coverage, 5 points for the One-error, 0 points for the ranking loss and 4 points for the hamming loss, respectively. We then obtained an overall score of 16 points for the candidate Inline graphic by summing up all the above scores. Using this procedure, we calculated the overall scores of the other 18 candidate weights, whose detailed results and score statistics are provided in Supplementary Table S3 and Supplementary Table S4. The candidate Inline graphic reached the best score of 22 points out of the 19 candidates and accordingly, it was adopted as the fixed weight of weighted series for Clarion.

Performance comparison with other problem transformation strategies

To demonstrate the capacity of the weighted series strategy in dealing with the multi-label multi-class mRNA subcellular localization prediction tasks, we used XGBoost with the default parameters as the base classifier to benchmark and compare our proposed method with the other three well-known problem transformation methods, including BR, CC and LP, on the 10-fold cross-validation tests using the training_validation set. Specifically, we performed 10 times of experiments with 10 groups of randomly generated label orders for CC because the input label order could directly affect the quality of the CC model. Among these 10 experiments, the best result was used for the comparison with BR, LP and WS. From the performance comparison results provided in Table 2, we found that the WS strategy displayed the best performance in terms of the evaluation metrics of Accexam, average precision, coverage, ranking loss and hamming loss. Although WS achieved a slightly lower performance than LP in terms of one-error, it showed a clear superiority in predicting mRNA subcellular localizations. To further improve the model performance, we optimized the key parameters, including learning_rate, n_estimators and max_depth, for XGBoost based on the 10-fold cross-validation test. The specific hyperparameters are provided in Supplementary Table S5. Subsequently, we retrained the final model named Clarion on the whole training_validation set using the weighted series strategy with the optimized hyperparameters of XGBoost.

Table 2.

Performance comparison of weighted series with binary relevance, classifier chains and label powerset

Strategy Accexam Average precision Coverage One-error Ranking loss Hamming loss
BR 0.600 0.651 6.121 0.706 0.345 0.194
CC 0.456 0.580 7.384 0.838 0.598 0.299
LP 0.558 0.626 6.271 0.601 0.443 0.229
WS (Inline graphic) 0.627 0.670 6.029 0.629 0.344 0.182

Performance comparison with existing state-of-the-art tools

In this section, Clarion’s performance was benchmarked and compared with several state-of-the-art approaches for predicting mRNA subcellular localizations. Firstly, we compared Clarion with DM3Loc and Wang’s method, the only two multi-label predictors. Clarion has only five overlapping predictable compartments with these two predictors, including cytosol, exosome, membrane (cytoplasm for Wang’s method), nucleus and ribosome. Therefore, we compared their five-label prediction performance via an independent test. Clarion and DM3Loc achieved the Accexam of 0.722/0.441, average precision of 0.769/0.618, coverage of 3.019/3.957, one-error of 0.463/0.869, ranking loss of 0.204/0.533 and hamming loss of 0.150/0.330 on the independent dataset. Similarly, Clarion and Wang’s method achieved Accexam of 0.745/0.281, average precision of 0.767/0.679, coverage of 3.127/4.516, one-error of 0.445/0.882, ranking loss of 0.241/0.716 and hamming loss of 0.146/0.375. The above results indicated that Clarion outperformed DM3Loc and Wang’s method in multi-label prediction tasks.

Afterwards, we also compared Clarion’s single-label prediction performance with that of the other state-of-the-art methods. To facilitate the performance comparison, only the methods with accessible webservers were used for prediction and comparison, including iLoc-mRNA [19], mRNALoc [20], mRNALocator [21], DM3Loc [25] and Wang’s method [26]. In particular, the RNA sequences of the independent set were uploaded to their webservers, which then outputted the prediction labels. As shown in Table 3, Clarion clearly outperformed the other methods in predicting cytoplasm, cytosol, exosome, membrane, nucleus and ribosome. We also found that Clarion secured over 80% accuracies in almost all compartment predictions with the only exception of cytosol and nucleus. Notably, the mRNAs of the independent set may appear in the training set of other methods, which may account for Clarion’s slightly lower F1 scores than Wang’s method on the prediction of cytoplasm and ribosome (more details can be found in Supplementary Table S6 of the supplementary file). These comparison results further demonstrated Clarion’s superior prediction power on single-label tasks.

Table 3.

Performance comparison between Clarion and other state-of-art tools in terms of the prediction accuracy on the independent test dataset

Localization iLoc-mRNA mRNALoc mRNALocator Wang’s DM3Loc Clarion
Chromatin N.A. N.A. N.A. N.A. N.A. 81.47%
Cytoplasm N.A. 54.88% 38.90% 87.10% N.A. 91.29%
Cytosol N.A. N.A. N.A. 67.81% 57.37% 79.77%
Exosome N.A. N.A. N.A. 16.18% 70.00% 92.10%
Membrane N.A. N.A. N.A. N.A. 70.92% 89.15%
Nucleolus N.A. N.A. N.A. N.A. N.A. 83.74%
Nucleoplasm N.A. N.A. N.A. N.A. N.A. 80.74%
Nucleus N.A. 55.18% 57.42% 60.13% 69.52% 79.23%
Ribosome 73.41% N.A. N.A. 81.42% 69.03% 84.74%

N.A.: non-applicable

Effect of non-label and fusion-label modules in the weighted series structure

In this section, we further examined the effect of the non-label and fusion-label modules used in the WS framework. A total of 18 binary classifiers were trained in Clarion for predicting nine mRNA subcellular localizations respectively, including nine non-label models and nine fusion-label models. Here, we employed these binary models to predict the RNA sequences in the independent test dataset and accordingly evaluated the performance for each localization. Figure 3C illustrates the predictive performance of non-label models and fusion-label models in terms of accuracy. As a result, we found that the fusion-label models performed better than the non-label counterparts for all nine locations, highlighting the necessity and effectiveness of fusion-label models in the WS framework. However, it is noteworthy that the better performance of the fusion-label models originated from the outputs of non-label models because the fusion-label models used the prediction results of the non-label models as input features for the model training. Therefore, we conclude that the non-label and fusion-label modules are complementary and essential to the WS framework of Clarion.

Model interpretation

Shapley additive explanations (SHAP) is a powerful method based on the cooperative game theory that can interpret machine learning models [31] using the Shapley value, which can be used to rank and evaluate the importance of each feature and explain the predictions. SHAP has been successfully applied in a variety of bioinformatics tasks [43–46]. In this study, we used the Shapley value to assess the Inline graphic-mer fragments of mRNAs that are important for subcellular localization prediction. With the nine binary models of the non-label module, the Shapley value of each Inline graphic-mer feature was calculated and ranked using the SHAP Python package (https://shap.readthedocs.io/en/latest/index.html). Figure 4 and Supplementary Figures S1-S3 show the top 15 important Inline graphic-mer features of Clarion for predicting mRNA subcellular localizations. We found some features are only important for one certain localization, such as ‘TTG’, ‘GGGCGC’ in the nucleus, ‘GACGC’ and ‘GCGGCA’ in chromatin, and ‘GGATCT’ and ‘GGCCG’ in the ribosome. For example, ‘TTG’ was identified as the most important feature for nucleus prediction in terms of the SHAP value but did not appear in the list of the top 15 features for the prediction of the other eight localizations. For ‘TTG’ shown in Figure 4A, each point represents an instance with a value from small to large corresponding to the colour from blue to red. When ‘TTG’ took high Shapley values, it would have an influence on the model to make the positive prediction of nucleus and visa verse. Therefore, the effect of ‘TTG’ can be summarized in a way that its larger value promotes the nucleus localization prediction. In contrast, the larger value of ‘GGGCGC’ promotes the non-nucleus localization prediction. These Inline graphic-mer segments may be part of or related to protein recognition motifs for mRNA specific localization. Interestingly, the repeat motif ‘UUCAC’ has been found to be crucial for localization via binding with Vg1PBP [47–49], which corresponded to ‘TTCACC’ that was ranked thirteenth in exosome prediction.

Figure 4.

Figure 4

Feature importance ranking based on the Shapley values. (A) Top 15 features for the nucleus, (B) top 15 features for the chromatin, (C) top 15 features for the ribosome and (D) top 15 features among all nine subcellular localizations.

In addition, it was also found that certain Inline graphic-mer features were important for the prediction of several subcellular localizations. For instance, ‘GCGGC’ was ranked second in nucleus prediction, tenth in chromatin prediction, and tenth in ribosome prediction, respectively, as shown in Figure 4A–C. This could be also observed in Figure 4D, which shows the top 15 features ranked according to the average of the absolute SHAP values and shows the important proportion in different mRNA subcellular localizations. The feature ‘GCGGC’ shown in Figure 4D is more important for the prediction of the nucleus, ribosome, chromatin, as well as nucleoplasm. Moreover, the feature ‘AAAAAA’ was important for almost all compartments of subcellular localizations, whose larger values drove the positive prediction of cytoplasm, but the negative predictions of exosome, chromatin, nucleoplasm, nucleolus, cytosol and ribosome.

Webserver implementation

To facilitate the wider research community to make subcellular localization predictions, we developed a web server for Clarion that is freely available at http://monash.bioweb.cloud.edu.au/Clarion/. The web server is maintained by Nectar Research Cloud and configured on a Linux server equipped with a 4-core CPU, 8-GB memory and 30-GB hard disk. The web page was implemented using PHP and has been tested on several popular web browsers, including Google Chrome, Microsoft Edge, Internet Explorer, Mozilla Firefox and Safari. Users are required to copy and paste query mRNA sequences in the textbox or alternatively upload a sequence file in the FASTA format via the file-selection dialogue box. It should be noted the query sequences will be truncated to ensure that no mRNA is greater than 6000 nt. All the generated prediction results will be saved in a table format containing detailed information regarding the sequences and predicted subcellular localization types. The web server also provides a probability score in the range of 0–1 to indicate the probability of the subcellular localization type. The prediction results can be easily exported to widely used file formats, including CSV, Excel, PDF and plain text. More detailed instructions for using the Clarion webserver can be found on the help page of the webserver.

Conclusion

In this study, we have introduced a novel ensemble model termed Clarion for the simultaneous prediction of nine compartments of mRNA subcellular localizations, including exosome, nucleus, nucleoplasm, chromatin, cytoplasm, nucleolus, cytosol, membrane and ribosome. Specifically, Clarion used the Inline graphic-mer nucleotide composition scheme to encode mRNA sequences and employed our proposed strategy, namely weighted series, as the ensemble framework. We selected XGBoost as the base classifier for the weighted series and optimized the important weight parameter for the weighted series. The cross-validation and independent tests illustrate the superiority of Clarion for predicting mRNA subcellular localizations, which outperformed several existing methods, including single-label predictors for iLoc-mRNA, mRNALoc and mRNALocator, and multi-label predictors of DM3Loc and Wang’s method. The performance improvement of Clarion can be attributed to two key factors: the first is the collection of more data than the existing methods to construct the benchmark dataset to facilitate high-quality model fitting; the second is use of the XGBoost-based weighted series method to incorporate multi-label priori information for model training and leverage such information to improve the predictive power. Moreover, we analyzed the most important Inline graphic-mer features for each localization prediction using the model interpretation algorithm SHAP and developed a user-friendly web server for Clarion. Clarion is expected to be a promising tool for multi-label mRNA subcellular localization prediction in the field of bioinformatics.

Nevertheless, the model performance can be further improved, especially for several compartments in the cytosol and nucleus. Only nine cell compartments were considered by Clarion; it is desirable to expand the size of predictable compartments. In our future work, we plan to develop advanced technologies for the prediction of mRNA subcellular localization. Apart from mRNAs, we will also make efforts to develop methods for predicting localizations of other types of RNAs, such as microRNAs and long non-coding RNAs. In addition, the co-localization between different types of RNAs will be another research direction in the future work.

Key Points

  • Characterization of mRNA subcellular localization can help elucidate gene regulatory networks and human disease mechanisms.

  • We proposed a novel ensemble method, termed Clarion, to predict nine subcellular localizations of mRNAs simultaneously.

  • Clarion achieved significantly better predictive performance in both single-label and multi-label predictions compared to state-of-the-art predictors.

  • The online webserver and local stand-alone tool of Clarion are publicly available at: http://monash.bioweb.cloud.edu.au/Clarion/.

Supplementary Material

Clarion_Supplementary_bbac467

Author Biographies

Yue Bi is currently a PhD student in the Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Australia. Her interests include bioinformatics, computational biology, sequence analysis and machine learning.

Fuyi Li received his PhD in Bioinformatics from Monash University, Australia. He is currently a professor in the College of Information Engineering, Northwest A&F University, China. His research interests are bioinformatics, computational biology, machine learning, and data mining.

Xudong Guo received his MEng degree from Ningxia University, China. He is currently a research assistant at the College of Information Engineering, Northwest A&F University, China. His research interests are bioinformatics and data mining.

Zhikang Wang is currently a PhD student in the Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Australia. His interests are bioinformatics, computational pathology, pattern recognition and deep learning.

Tong Pan is currently a PhD student in the Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Australia. Her interests are bioinformatics, protein function analysis, deep learning and pattern recognition.

Yuming Guo is a professor of Global Environmental Health and Biostatistics & Head of the Monash Climate, Air Quality Research (CARE) Unit. His research focuses on environmental epidemiology, biostatistics, global environmental change, air pollution, climate change, urban design, residential environment, remote sensing modelling and infectious disease modelling.

Geoffrey I. Webb is a professor in the Faculty of Information Technology and a research director of the Monash Data Futures Institute at Monash University. His research interests include machine learning, data mining, computational biology and user modelling.

Jianhua Yao is a group leader in Tencent AI Lab, China. He received his PhD degree from John Hopkins University. His research interests include bioinformatics, computational biology, medical imaging and pattern recognition.

Cangzhi Jia is a professor in the college of science, Dalian Maritime University, China. She obtained her PhD degree in the school of Mathematical Sciences from the Dalian University of Technology in 2007. Her major research interests include mathematical modelling in bioinformatics and machine learning.

Jiangning Song is an associate professor and a group leader in the Monash Biomedicine Discovery Institute, Monash University. He is also affiliated with the Monash Data Futures Institute, Monash University. His research interests include bioinformatics, computational biomedicine, machine learning, data mining and pattern recognition.

Contributor Information

Yue Bi, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia; Monash Data Futures Institute, Monash University, Melbourne, Victoria 3800, Australia.

Fuyi Li, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia; College of Information Engineering, Northwest A&F University, Yangling, 712100, China; Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia.

Xudong Guo, College of Information Engineering, Northwest A&F University, Yangling, 712100, China.

Zhikang Wang, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia.

Tong Pan, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia.

Yuming Guo, Department of Epidemiology and Preventive Medicine, School of Public Health and Preventive Medicine, Monash University, Melbourne, Victoria 3004, Australia.

Geoffrey I Webb, Monash Data Futures Institute, Monash University, Melbourne, Victoria 3800, Australia.

Jianhua Yao, Tencent AI Lab, Shenzhen, China.

Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China.

Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia; Monash Data Futures Institute, Monash University, Melbourne, Victoria 3800, Australia.

Code and data availability

The source code and datasets of Clarion are publicly available at http://monash.bioweb.cloud.edu.au/Clarion/.

Funding

National Health and Medical Research Council of Australia (NHMRC) (APP1127948, APP1144652), Australian Research Council (ARC) (LP110200333, DP120104460), National Institute of Allergy and Infectious Diseases of the National Institutes of Health (R01 AI111965), Major and Seed Inter-Disciplinary Research (IDR) projects awarded by Monash University.

References

  • 1. Jeffery  WR, Tomlinson  CR, Brodeur  RD. Localization of actin messenger RNA during early ascidian development. Dev Biol  1983;99:408–17. [DOI] [PubMed] [Google Scholar]
  • 2. Lawrence  JB, Singer  RH. Intracellular localization of messenger RNAs for cytoskeletal proteins. Cell  1986;45:407–15. [DOI] [PubMed] [Google Scholar]
  • 3. Meyer  C, Garzia  A, Tuschl  T. Simultaneous detection of the subcellular localization of RNAs and proteins in cultured cells by combined multicolor RNA-FISH and IF. Methods  2017;118-119:101–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Chin  A, Lécuyer  E. RNA localization: Making its way to the center stage. Biochimica et Biophysica Acta (BBA)-General Subjects  2017;1861:2956–70. [DOI] [PubMed] [Google Scholar]
  • 5. Kloc  M, Zearfoss  NR, Etkin  LD. Mechanisms of subcellular mRNA localization. Cell  2002;108:533–44. [DOI] [PubMed] [Google Scholar]
  • 6. Li  X, Franceschi  VR, Okita  TW. Segregation of storage protein mRNAs on the rough endoplasmic reticulum membranes of rice endosperm cells. Cell  1993;72:869–79. [DOI] [PubMed] [Google Scholar]
  • 7. Katz  ZB, Wells  AL, Park  HY, et al.  beta-Actin mRNA compartmentalization enhances focal adhesion stability and directs cell migration. Genes Dev  2012;26:1885–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Kejiou  NS, Palazzo  AF. mRNA localization as a rheostat to regulate subcellular gene expression, Wiley Interdiscip Rev. RNA  2017;8:e1416. [DOI] [PubMed] [Google Scholar]
  • 9. Liu  D, Li  G, Zuo  Y. Function determinants of TET proteins: the arrangements of sequence motifs with specific codes. Brief Bioinform  2019;20:1826–35. [DOI] [PubMed] [Google Scholar]
  • 10. Cooper  TA, Wan  L, Dreyfuss  G. RNA and disease. Cell  2009;136:777–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Liu-Yesucevitz  L, Bassell  GJ, Gitler  AD, et al.  Local RNA translation at the synapse and in disease. J Neurosci  2011;31:16086–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Sprenkle  NT, Sims  SG, Sanchez  CL, et al.  Endoplasmic reticulum stress and inflammation in the central nervous system. Mol Neurodegener  2017;12:42. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Dolezal  JM, Dash  AP, Prochownik  EV. Diagnostic and prognostic implications of ribosomal protein transcript expression patterns in human cancers. BMC Cancer  2018;18:275. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Engel  KL, Arora  A, Goering  R, et al.  Mechanisms and consequences of subcellular RNA localization across diverse cell types. Traffic  2020;21:404–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Zhang  T, Tan  P, Wang  L, et al.  RNALocate: a resource for RNA subcellular localizations. Nucleic Acids Res  2017;45:D135–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Mas-Ponte  D, Carlevaro-Fita  J, Palumbo  E, et al.  LncATLAS database for subcellular localization of long noncoding RNAs. RNA  2017;23:1080–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Wen  X, Gao  L, Guo  X, et al.  lncSLdb: a resource for long non-coding RNA subcellular localization. Database (Oxford)  2018;2018:1–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Yan  Z, Lecuyer  E, Blanchette  M. Prediction of mRNA subcellular localization using deep recurrent neural networks. Bioinformatics  2019;35:i333–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Zhang  ZY, Yang  YH, Ding  H, et al.  Design powerful predictor for mRNA subcellular location prediction in Homo sapiens. Brief Bioinform  2021;22:526–35. [DOI] [PubMed] [Google Scholar]
  • 20. Garg  A, Singhal  N, Kumar  R, et al.  mRNALoc: a novel machine-learning based in-silico tool to predict mRNA subcellular localization. Nucleic Acids Res  2020;48:W239–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Tang  Q, Nie  F, Kang  J, et al.  mRNALocater: Enhance the prediction accuracy of eukaryotic mRNA subcellular localization by using model fusion strategy. Mol Ther  2021;29:2617–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Li  J, Zhang  L, He  S, et al.  SubLocEP: a novel ensemble predictor of subcellular localization of eukaryotic mRNA based on machine learning. Brief Bioinform  2021;22:bbaa401. [DOI] [PubMed] [Google Scholar]
  • 23. Lewis  RA, Gagnon  JA, Mowry  KL. PTB/hnRNP I is required for RNP remodeling during RNA localization in Xenopus oocytes. Mol Cell Biol  2008;28:678–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Buskila  AA, Kannaiah  S, Amster-Choder  O. RNA localization in bacteria. RNA Biol  2014;11:1051–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Wang  D, Zhang  Z, Jiang  Y, et al.  DM3Loc: multi-label mRNA subcellular localization prediction and analysis based on multi-head self-attention mechanism. Nucleic Acids Res  2021;49:e46. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Wang  H, Ding  Y, Tang  J, et al.  Identify RNA-associated subcellular localizations based on multi-label learning using Chou's 5-steps rule. BMC Genomics  2021;22:56. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Zhang  M-L, Zhou  Z-H. A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering  2014;26:1819–37. [Google Scholar]
  • 28. Boutell  MR, Luo  J, Shen  X, et al.  Learning multi-label scene classification. Pattern Recognition  2004;37:1757–71. [Google Scholar]
  • 29. Tsoumakas  G, Vlahavas  I. Random k-labelsets: An ensemble method for multilabel classification. In: European conference on machine learning. 2007, p. 406–17. Springer. [Google Scholar]
  • 30. Read  J, Pfahringer  B, Holmes  G, et al.  Classifier chains for multi-label classification. Machine Learning  2011;85:333–59. [Google Scholar]
  • 31. Lundberg  SM, Erion  G, Chen  H, et al.  From local explanations to global understanding with explainable AI for trees. Nat Mach Intell  2020;2:56–67. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Cui  T, Dou  Y, Tan  P, et al.  RNALocate v2.0: an updated resource for RNA subcellular localization with increased coverage and annotation. Nucleic Acids Res  2022;50:D333–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Li  W, Godzik  A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics  2006;22:1658–9. [DOI] [PubMed] [Google Scholar]
  • 34. Chen  Z, Liu  X, Zhao  P, et al.  iFeatureOmega: an integrative platform for engineering, visualization and analysis of features from molecular sequences, structural and ligand data sets. Nucleic Acids Res  2022;50(W1):W434–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Jiang  P, Luo  J, Wang  Y, et al.  kmcEx: memory-frugal and retrieval-efficient encoding of counted k-mers. Bioinformatics  2019;35:4871–8. [DOI] [PubMed] [Google Scholar]
  • 36. Manavalan  B, Basith  S, Shin  TH, et al.  Computational prediction of species-specific yeast DNA replication origin via iterative feature representation. Brief Bioinform  2021;22(4):bbaa304. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Yan  K, Lv  H, Guo  Y, et al.  TPpred-ATMV: therapeutic peptide prediction by adaptive multi-view tensor learning model. Bioinformatics  2022;38:2712–8. [DOI] [PubMed] [Google Scholar]
  • 38. Wei  L, Su  R, Luan  S, et al.  Iterative feature representations improve N4-methylcytosine site prediction. Bioinformatics  2019;35:4930–7. [DOI] [PubMed] [Google Scholar]
  • 39. Ghamrawi  N, McCallum  A. Collective multi-label classification. In: Proceedings of the 14th ACM international conference on Information and knowledge management. 2005, p. 195–200.
  • 40. Gopal  S, Yang  Y. Multilabel classification with meta-level features. In: Proceedings of the 33rd International ACM SIGIR conference on Research and development in information retrieval. 2010, p. 315–22.
  • 41. Pedregosa  F, Varoquaux  G, Gramfort  A, et al.  Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research  2011;12:2825–30. [Google Scholar]
  • 42. Wang  X, Li  F, Xu  J, et al.  ASPIRER: a new computational approach for identifying non-classical secreted proteins based on deep learning. Brief Bioinform  2022;23(2):bbac031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Bi  Y, Xiang  D, Ge  Z, et al.  An interpretable prediction model for identifying N(7)-methylguanosine sites based on XGBoost and SHAP. Mol Ther Nucleic Acids  2020;22:362–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Li  F, Chen  J, Ge  Z, et al.  Computational prediction and interpretation of both general and specific types of promoters in Escherichia coli by exploiting a stacked ensemble-learning framework. Brief Bioinform  2021;22:2126–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Li  F, Guo  X, Jin  P, et al.  Porpoise: a new approach for accurate prediction of RNA pseudouridine sites. Brief Bioinform  2021;22(6):bbab245. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Li  F, Guo  X, Xiang  D, et al.  Computational analysis and prediction of PE_PGRS proteins using machine learning. Comput Struct Biotechnol J  2022;20:662–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Kwon  S, Abramson  T, Munro  TP, et al.  UUCAC-and vera-dependent localization of VegT RNA in Xenopus oocytes. Curr Biol  2002;12:558–64. [DOI] [PubMed] [Google Scholar]
  • 48. Gautreau  D, Cote  CA, Mowry  KL. Two copies of a subelement from the Vg1 RNA localization sequence are sufficient to direct vegetal localization in Xenopus oocytes. Development  1997;124:5013–20. [DOI] [PubMed] [Google Scholar]
  • 49. Bubunenko  M, Kress  TL, Vempati  UD, et al.  A consensus RNA signal that directs germ layer determinants to the vegetal cortex of Xenopus oocytes. Dev Biol  2002;248:82–92. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Clarion_Supplementary_bbac467

Data Availability Statement

The source code and datasets of Clarion are publicly available at http://monash.bioweb.cloud.edu.au/Clarion/.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES