×

Bayesian hidden Markov models to identify RNA-protein interaction sites in PAR-CLIP. (English) Zbl 1419.62486

Summary: The photoactivatable ribonucleoside enhanced cross-linking immunoprecipitation (PAR-CLIP) has been increasingly used for the global mapping of RNA-protein interaction sites. There are two key features of the PAR-CLIP experiments: The sequence read tags are likely to form an enriched peak around each RNA-protein interaction site; and the cross-linking procedure is likely to introduce a specific mutation in each sequence read tag at the interaction site. Several ad hoc methods have been developed to identify the RNA-protein interaction sites using either sequence read counts or mutation counts alone; however, rigorous statistical methods for analyzing PAR-CLIP are still lacking. In this article, we propose an integrative model to establish a joint distribution of observed read and mutation counts. To pinpoint the interaction sites at single base-pair resolution, we developed a novel modeling approach that adopts non-homogeneous hidden Markov models to incorporate the nucleotide sequence at each genomic location. Both simulation studies and data application showed that our method outperforms the ad hoc methods, and provides reliable inferences for the RNA-protein binding sites from PAR-CLIP data.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
62M05 Markov processes: estimation; hidden Markov models
62F15 Bayesian inference
92D20 Protein sequences, DNA sequences

Software:

CLIPZ; ChIPDiff; DEseq; HPeak

References:

[1] Anders, S. and Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biology11, R106.
[2] Ascano, M., Jr., Mukherjee, N., Bandaru, P., Miller, J. B., Nusbaum, J. D., Corcoran, D. L., et al. (2012). FMRP targets distinct mRNA sequence elements to regulate protein expression. Nature492, 382-386.
[3] Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate—A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B57, 289-300. · Zbl 0809.62014
[4] Corcoran, D., Georgiev, S., Mukherjee, N., Gottwein, E., Skalsky, R., Keene, J., et al. (2011). PARalyzer: Definition of RNA binding sites from PAR‐CLIP short‐read sequence data. Genome Biology12, R79.
[5] Crozat, A., Aman, P., Mandahl, N., and Ron, D. (1993). Fusion of CHOP to a novel RNA‐binding protein in human myxoid liposarcoma. Nature363, 640-644.
[6] Gelfond, J. A. L., Gupta, M., and Ibrahim, J. G. (2009). A Bayesian hidden Markov model for motif discovery through joint modeling of genomic sequence and ChIP‐chip data. Biometrics65, 1087-1095. · Zbl 1180.62169
[7] Gelman, A. and Rubin, D. (1992). Inference from iterative simulation using multiple sequences. Statistical Science7, 457-551. · Zbl 1386.65060
[8] Gelman, A., Meng, X. L., and Stern, H. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica6, 733-760. · Zbl 0859.62028
[9] Guha, S., Li, Y., and Neuberg, D. (2008). Bayesian hidden Markov modeling of array CGH data. Journal of the American Statistical Association103, 485-497. · Zbl 1469.62368
[10] Hafner, M., Landthaler, M., Burger, L., Khorshid, M., Hausser, J., Berninger, et al. (2010). Transcriptome‐wide identification of RNA‐binding protein and microRNA target sites by PAR‐CLIP. Cell141, 129-141.
[11] Hall, D. B. (2000). Zero‐inflated Poisson and binomial regression with random effects: A case study. Biometrics56, 1030-1039. · Zbl 1060.62535
[12] Han, T. W., Kato, M., Xie, S., Wu, L. C., Mirzaei, H., Pei, J., et al. (2012). Cell‐free formation of RNA granules: Bound RNAs identify features and components of cellular assemblies. Cell149, 768-779.
[13] Hoell, J. I., Larsson, E., Runge, S., Nusbaum, J. D., Duggimpudi, S., Farazi, T. A., et al. (2011). RNA targets of wild‐type and mutant FET family proteins. Nature Structural & Molecular Biology18, 1428-1431.
[14] Jaskiewicz, L., Bilen, B., Hausser, J., and Zavolan, M. (2012). Argonaute CLIP—A method to identify in vivo targets of miRNAs. Methods58, 106-112.
[15] Keles, S. (2007). Mixture modeling for genome‐wide localization of transcription factors. Biometrics63, 10-21. · Zbl 1206.62170
[16] Khorshid, M., Rodak, C., and Zavolan, M. (2011). CLIPZ: A database and analysis environment for experimentally determined binding sites of RNA‐binding proteins. Nucleic Acids Research39, D245-D252.
[17] Kishore, S., Jaskiewicz, L., Burger, L., Hausser, J., Khorshid, M., and Zavolan, M. (2011). A quantitative analysis of CLIP methods for identifying binding sites of RNA‐binding proteins. Nature Methods8, 559-564.
[18] Licatalosi, D. D. and Darnell, R. B. (2010). Applications of next‐generation sequencing RNA processing and its regulation: Global insights into biological networks. Nature Reviews Genetics11, 75-87.
[19] Licatalosi, D. D., Mele, A., Fak, J. J., Ule, J., Kayikci, M., Chi, S. W., et al. (2008). HITS‐CLIP yields genome‐wide insights into brain alternative RNA processing. Nature456, 464-469.
[20] Mo, Q. (2011). A fully Bayesian hidden Ising model for ChIP‐seq data analysis. Biostatistics13, 113-28. · Zbl 1241.62162
[21] Mo, Q. and Liang, F. (2010). Bayesian modeling of ChIP‐chip data through a high‐order Ising model. Biometrics66, 1284-1294. · Zbl 1208.62173
[22] Neumann, M., Bentmann, E., Dormann, D., Jawaid, A., DeJesus‐Hernandez, M., Ansorge, O., et al. (2011). FET proteins TAF15 and EWS are selective markers that distinguish FTLD with FUS pathology from amyotrophic lateral sclerosis with FUS mutations. Brain134, 2595-2609.
[23] Newton, M. A., Noueiry, A., Sarkar, D., and Ahlquist, P. (2004). Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics5, 155-176. · Zbl 1096.62124
[24] Powers, C. A., Mathur, M., Raaka, B. M., Ron, D., and Samuels, H. H. (1998). TLS (translocated‐in‐liposarcoma) is a high‐affinity interactor for steroid, thyroid hormone, and retinoid receptors. Molecular Endocrinology12, 4-18.
[25] Qin, Z. S., Yu, J., Shen, J., Maher, C. A., Hu, M., Kalyana‐Sundaram, S., et al. (2010). Hpeak: An HMM‐based algorithm for defining read‐enriched regions in ChIP‐seq data. BMC Bioinformatics11, 369.
[26] Rabiner, L. R. (1989). A tutorial on hidden Markov‐models and selected applications in speech recognition. Proceedings of the IEEE77, 257-286.
[27] Scott, S. L. (2002). Bayesian methods for hidden Markov models: Recursive computing in the 21st century. Journal of the American Statistical Association97, 337-351. · Zbl 1073.65503
[28] Sharp, P. A. (2009). The centrality of RNA. Cell136, 577-80.
[29] Sievers, C., Schlumpf, T., Sawarkar, R., Comoglio, F., and Paro, R. (2012). Mixture models and wavelet transforms reveal high confidence RNA-protein interaction sites in MOV10 PAR‐CLIP data. Nucleic Acids Research40, e160.
[30] Uniacke, J., Holterman, C. E., Lachance, G., Franovic, A., Jacob, M. D., Fabian, M. R., et al. (2012). An oxygen‐regulated switch in the protein synthesis machinery. Nature486, 126-129.
[31] Uren, P. J., Bahrami‐Samani, E., Burns, S. C., Qiao, M., Karginov, F. V., Hodges, E., et al. (2012). Site identification in high‐throughput RNA-protein interaction data. Bioinformatics28, 3013-3020.
[32] Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory13, 260-269. · Zbl 0148.40501
[33] Wang, Z. (2011). One mixed negative binomial distribution with application. Journal of Statistical Planning and Inference141, 1153-1160. · Zbl 1206.62014
[34] Wang, X., Arai, S., Song, X., Reichart, D., Du, K., Pascual, G., et al. (2008). Induced ncRNAs allosterically modify RNA‐binding proteins in cis to inhibit transcription. Nature454, 126-130.
[35] Weinberg, C. R. and Gladen, B. C. (1986). The beta‐geometric distribution applied to comparative fecundability studies. Biometrics42, 547-560.
[36] Wen, J., Parker, B. J., Jacobsen, A., and Krogh, A. (2011). MicroRNA transfection and AGO‐bound CLIP‐seq data sets reveal distinct determinants of miRNA action. RNA17, 820-34.
[37] Xie, Y., Pan, W., Jeong, K. S., Xiao, G., and Khodursky, A. B. (2010). A Bayesian approach to joint modeling of protein-DNA binding, gene expression and sequence data. Statistics in Medicine, 29, 489-503.
[38] Xu, H., Wei, C. L., Lin, F., and Sung, W. K. (2008). An HMM approach to genome‐wide identification of differential histone modification sites from ChIP‐seq data. Bioinformatics24, 2344-2349.
[39] Zagordi, O., Klein, R., Daumer, M., and Beerenwinkel, N. (2010). Error correction of next‐generation sequencing data and reliable estimation of HIV quasispecies. Nucleic Acids Research38, 7400-7409.
[40] Zhang, C. L. and Darnell, R. B. (2011). Mapping in vivo protein-RNA interactions at single‐nucleotide resolution from HITS‐CLIP data. Nature Biotechnology29, 607-614.
[41] Zhang, C., Frias, M. A., Mele, A., Ruggiu, M., Eom, T., Marney, C. B., et al. (2010). Integrative modeling defines the Nova splicing‐regulatory network and its combinatorial controls. Science329, 439-443.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.