×

HmmSeq: a hidden Markov model for detecting differentially expressed genes from RNA-seq data. (English) Zbl 1454.62323

Summary: We introduce hmmSeq, a model-based hierarchical Bayesian technique for detecting differentially expressed genes from RNA-seq data. Our novel hmmSeq methodology uses hidden Markov models to account for potential co-expression of neighboring genes. In addition, hmmSeq employs an integrated approach to studies with technical or biological replicates, automatically adjusting for any extra-Poisson variability. Moreover, for cases when paired data are available, hmmSeq includes a paired structure between treatments that incoporates subject-specific effects. To perform parameter estimation for the hmmSeq model, we develop an efficient Markov chain Monte Carlo algorithm. Further, we develop a procedure for detection of differentially expressed genes that automatically controls false discovery rate. A simulation study shows that the hmmSeq methodology performs better than competitors in terms of receiver operating characteristic curves. Finally, the analyses of three publicly available RNA-seq data sets demonstrate the power and flexibility of the hmmSeq methodology. An R package implementing the hmmSeq framework will be submitted to CRAN upon publication of the manuscript.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
62F15 Bayesian inference
62M05 Markov processes: estimation; hidden Markov models

References:

[1] Agrawal, R. and Gomez-Pinilla, F. (2012). ‘Metabolic syndrome’ in the brain: Deficiency in omega-3 fatty acid exacerbates dysfunctions in insulin receptor signalling and cognition. J. Gen. Physiol. 590 2485-2499.
[2] Alberts, B., Bray, D., Lewis, J., Raff, M., Roberts, K. and Watson, J. D. (1994). Molecular Biology of the Cell . Garland Science, New York.
[3] Auer, P. L. and Doerge, R. W. (2011). A two-stage Poisson model for testing RNA-Seq data. Stat. Appl. Genet. Mol. Biol. 10 Art. 26, 28. · Zbl 1296.92139 · doi:10.2202/1544-6115.1627
[4] Auer, P. L., Srivastava, S. and Doerge, R. W. (2012). Differential expression-the next generation and beyond. Brief. Funct. Genomics 11 57-62.
[5] Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289-300. · Zbl 0809.62014
[6] Blekhman, R., Marioni, J. C., Zumbo, P., Stephens, M. and Gilad, Y. (2010). Sex-specific and lineage-specific alternative splicing in primates. Genome Res. 20 180-189.
[7] Bullard, J. H., Purdom, E., Hansen, K. D. and Dudoit, S. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-seq experiments. BMC Bioinformatics 11 94.
[8] Caron, H., van Schaik, B., van der Mee, M., Baas, F., Riggins, G., van Sluis, P., Hermus, M.-C., van Asperen, R., Boon, K., Voute, P. A. et al. (2001). The human transcriptome map: Clustering of highly expressed genes in chromosomal domains. Science 291 1289-1292.
[9] Chib, S. and Greenberg, E. (1994). Bayes inference in regression models with ARMA\((p,q)\) errors. J. Econometrics 64 183-206. · Zbl 0807.62065 · doi:10.1016/0304-4076(94)90063-9
[10] Cui, S., Guha, S., Ferreira, M. and Tegge, A. N. (2015). Supplement to “hmmSeq: A hidden Markov model for detecting differentially expressed genes from RNA-seq data.” . · Zbl 1454.62323
[11] Edelman, L. B. and Fraser, P. (2012). Transcription factories: Genetic programming in three dimensions. Curr. Opin. Genet. Dev. 22 110-114.
[12] Frühwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models . Springer, New York. · Zbl 1108.62002 · doi:10.1007/978-0-387-35768-3
[13] Gamerman, D. and Lopes, H. F. (2006). Markov Chain Monte Carlo : Stochastic Simulation for Bayesian Inference , 2nd ed. Chapman & Hall/CRC, Boca Raton, FL. · Zbl 1137.62011
[14] Gogolla, N., Galimberti, I., Deguchi, Y. and Caroni, P. (2009). Wnt signaling mediates experience-related regulation of synapse numbers and mossy fiber connectivities in the adult hippocampus. Neuron 62 510-525.
[15] Guha, S., Li, Y. and Neuberg, D. (2008). Bayesian hidden Markov modeling of array CGH data. J. Amer. Statist. Assoc. 103 485-497. · Zbl 1469.62368 · doi:10.1198/016214507000000923
[16] Hardcastle, T. J. (2009). baySeq: Empirical Bayesian analysis of patterns of differential expression in count data. R package version 1.10.0.
[17] Hardcastle, T. J. and Kelly, K. A. (2010). baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics 11 422.
[18] Henn, A. D., Wu, S., Qiu, X., Ruda, M., Stover, M., Yang, H., Liu, Z., Welle, S. L., Holden-Wiltse, J., Wu, H. and Zand, M. S. (2013). High-resolution temporal response patterns to influenza vaccine reveal a distinct human plasma cell gene signature. Sci. Rep. 3 2327.
[19] Huang, D. W., Sherman, B. T. and Lempicki, R. A. (2009a). Bioinformatics enrichment tools: Paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37 1-13.
[20] Huang, D. W., Sherman, B. T. and Lempicki, R. A. (2009b). Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4 44-57.
[21] Hurst, L. D., Pál, C. and Lercher, M. J. (2004). The evolutionary dynamics of eukaryotic gene order. Nat. Rev. Genet. 5 299-310.
[22] Kalita, A., Gupta, S., Singh, P., Surolia, A. and Banerjee, K. (2013). IGF-1 stimulated upregulation of cyclin D1 is mediated via STAT5 signaling pathway in neuronal cells. IUBMB Life 65 462-471.
[23] Karlebach, G. and Shamir, R. (2008). Modelling and analysis of gene regulatory networks. Nat. Rev. , Mol. Cell Biol. 9 770-780.
[24] Kvam, V. M., Liu, P. and Si, Y. (2012). A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data. Am. J. Bot. 99 248-256.
[25] Langmead, B., Hansen, K. D., Leek, J. T. et al. (2010). Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol. 11 R83.
[26] Lee, J., Ji, Y., Liang, S., Cai, G. and Müller, P. (2011). On differential gene expression using RNA-seq data. Cancer Inform. 10 205-215.
[27] Louhimo, R. and Hautaniemi, S. (2011). CNAmet: An R package for integrating copy number, methylation and expression data. Bioinformatics 27 887-888.
[28] MacDonald, I. L. and Zucchini, W. (1997). Hidden Markov and Other Models for Discrete-Valued Time Series. Monographs on Statistics and Applied Probability 70 . Chapman & Hall, London. · Zbl 0868.60036
[29] Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M. and Gilad, Y. (2008). RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18 1509-1517.
[30] Mercer, T. R. and Mattick, J. S. (2013). Understanding the regulatory and transcriptional complexity of the genome through structure. Genome Res. 23 1081-1088.
[31] Michalak, P. (2008). Coexpression, coregulation, and cofunctionality of neighboring genes in eukaryotic genomes. Genomics 91 243-248.
[32] Müller, P., Parmigiani, G. and Rice, K. (2007). FDR and Bayesian multiple comparisons rules. In Bayesian Statistics 8 (J. M. Bernardo, S. Bayarri, J. O. Berger, A. Dawid, D. Heckerman, A. F. M. Smith and M. West, eds.) 349-370. Oxford Univ. Press, Oxford. · Zbl 1252.62025
[33] Newton, M. A., Noueiry, A., Sarkar, D. and Ahlquist, P. (2004). Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics 5 155-176. · Zbl 1096.62124 · doi:10.1093/biostatistics/5.2.155
[34] Pe’er, D. and Hacohen, N. (2011). Principles and strategies for developing network models in cancer. Cell 144 864-873.
[35] Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77 257-286.
[36] Robertson, S. D., Matthies, H. J., Owens, W. A., Sathananthan, V., Christianson, N. S. B., Kennedy, J. P., Lindsley, C. W., Daws, L. C. and Galli, A. (2010). Insulin reveals akt signaling as a novel regulator of norepinephrine transporter trafficking and norepinephrine homeostasis. J. Neurosci. 30 11305-11316.
[37] Robinson, M. D., McCarthy, D. J. and Smyth, G. K. (2010). edgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26 139-140.
[38] Robinson, M. D. and Smyth, G. K. (2007). Moderated statistical tests for assessing differences in tag abundance. Bioinformatics 23 2881-2887.
[39] Robinson, M. D. and Smyth, G. K. (2008). Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 9 321-332. · Zbl 1143.62312 · doi:10.1093/biostatistics/kxm030
[40] Scott, S. L. (2002). Bayesian methods for hidden Markov models. J. Amer. Statist. Assoc. 97 337-351. · Zbl 1073.65503 · doi:10.1198/016214502753479464
[41] Si, Y. and Liu, P. (2013). An optimal test with maximum average power while controlling FDR with application to RNA-Seq data. Biometrics 69 594-605. · Zbl 1418.62066 · doi:10.1111/biom.12036
[42] Singer, G. A., Lloyd, A. T., Huminiecki, L. B. and Wolfe, K. H. (2005). Clusters of co-expressed genes in mammalian genomes are conserved by natural selection. Mol. Biol. Evol. 22 767-775.
[43] Spiegelhalter, D. J., Best, N. G., Carlin, B. P. and van der Linde, A. (2002). Bayesian measures of model complexity and fit. J. R. Stat. Soc. Ser. B. Stat. Methodol. 64 583-639. · Zbl 1067.62010 · doi:10.1111/1467-9868.00353
[44] Storey, J. D. and Tibshirani, R. (2003). Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. USA 100 9440-9445. · Zbl 1130.62385 · doi:10.1073/pnas.1530509100
[45] Tegge, A. N., Caldwell, C. W. and Xu, D. (2012). Pathway correlation profile of gene-gene co-expression for identifying pathway perturbation. PLoS ONE 7 e52127.
[46] Titterington, D. M., Smith, A. F. M. and Makov, U. E. (1985). Statistical Analysis of Finite Mixture Distributions . Wiley, Chichester. · Zbl 0646.62013
[47] van Arensbergen, J., van Steensel, B. and Bussemaker, H. J. (2014). In search of the determinants of enhancer-promoter interaction specificity. Trends Cell Biol. 24 695-702.
[48] Wilhelm, S. and Manjunath, B. G. (2013). tmvtnorm: Truncated multivariate normal and student t distribution. R package version 1.4-8.
[49] Zeger, S. L. and Karim, M. R. (1991). Generalized linear models with random effects; a Gibbs sampling approach. J. Amer. Statist. Assoc. 86 79-86. · doi:10.1080/01621459.1991.10475006
[50] Zeng, J., Konopka, G., Hunt, B. G., Preuss, T. M., Geschwind, D. and Yi, S. V. (2012). Divergent whole-genome methylation maps of human and chimpanzee brains reveal epigenetic basis of human regulatory evolution. Am. J. Hum. Genet. 91 455-465.
[51] Zhao, S., Fung-Leung, W.-P., Bittner, A., Ngo, K. and Liu, X. (2014). Comparison of RNA-seq and microarray in transcriptome profiling of activated t cells. PLoS ONE 9 e78644.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.