×

Multiscale Poisson process approaches for detecting and estimating differences from high-throughput sequencing assays. (English) Zbl 07923906

Summary: Estimating and testing for differences in molecular phenotypes (e.g., gene expression, chromatin accessibility, transcription factor binding) across conditions is an important part of understanding the molecular basis of gene regulation. These phenotypes are commonly measured using high-throughput sequencing assays (e.g., RNA-seq, ATAC-seq, ChIP-seq), which provide high-resolution count data that reflect how the phenotypes vary along the genome. Multiple methods have been proposed to help exploit these high-resolution measurements for differential expression analysis. However, they ignore the count nature of the data, instead using normal distributions that work well only for data with large sample sizes or high counts. Here we develop count-based methods to address this problem. We model the data for each sample using an inhomogeneous Poisson process with spatially structured underlying intensity function and then, building on multiscale models for the Poisson process, estimate and test for differences in the underlying intensity function across samples (or groups of samples). Using both simulation and real ATAC-seq data, we show that our method outperforms previous normal-based methods, especially in situations with small sample sizes or low counts.

MSC:

62Pxx Applications of statistics

References:

[1] BARSKI, A., CUDDAPAH, S., CUI, K., ROH, T.-Y., SCHONES, D. E., WANG, Z., WEI, G., CHEPELEV, I. and ZHAO, K. (2007). High-resolution profiling of histone methylations in the human genome. Cell 129 823-37. Digital Object Identifier: 10.1016/j.cell.2007.05.009 Google Scholar: Lookup Link · doi:10.1016/j.cell.2007.05.009
[2] BOYLE, A. P., DAVIS, S., SHULHA, H. P., MELTZER, P., MARGULIES, E. H., WENG, Z., FUREY, T. S. and CRAWFORD, G. E. (2008). High-resolution mapping and characterization of open chromatin across the genome. Cell 132 311-22. Digital Object Identifier: 10.1016/j.cell.2007.12.014 Google Scholar: Lookup Link · doi:10.1016/j.cell.2007.12.014
[3] BUENROSTRO, J. D., GIRESI, P. G., ZABA, L. C., CHANG, H. Y. and GREENLEAF, W. J. (2013). Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat. Methods 10 1213-1218. MathSciNet: MR3211372
[4] BUSBY, M. A., STEWART, C., MILLER, C. A., GRZEDA, K. R. and MARTH, G. T. (2013). Scotty: A web tool for designing RNA-Seq experiments to measure differential gene expression. Bioinformatics 29 656-657. Digital Object Identifier: 10.1093/bioinformatics/btt015 Google Scholar: Lookup Link · doi:10.1093/bioinformatics/btt015
[5] COIFMAN, R. R. and DONOHO, D. L. (1995). Translation-invariant de-noising. In Wavelets and Statistics 125-150. Springer, Berlin. · Zbl 0866.94008
[6] COLLADO-TORRES, L., NELLORE, A., FRAZEE, A. C., WILKS, C., LOVE, M. I., LANGMEAD, B., IRIZARRY, R. A., LEEK, J. T. and JAFFE, A. E. (2017). Flexible expressed region analysis for RNA-seq with derfinder. Nucleic Acids Res. 45 e9. Digital Object Identifier: 10.1093/nar/gkw852 Google Scholar: Lookup Link · doi:10.1093/nar/gkw852
[7] CROUSE, M. S., NOWAK, R. D. and BARANIUK, R. G. (1998). Wavelet-based statistical signal processing using hidden Markov models. IEEE Trans. Signal Process. 46 886-902. Digital Object Identifier: 10.1109/78.668544 Google Scholar: Lookup Link MathSciNet: MR1665651 · doi:10.1109/78.668544
[8] DEGNER, J. F., PAI, A. A., PIQUE-REGI, R., VEYRIERAS, J.-B., GAFFNEY, D. J., PICKRELL, J. K., DE LEON, S., MICHELINI, K., LEWELLEN, N. et al. (2012). DNase I sensitivity QTLs are a major determinant of human expression variation. Nature 482 390-4. Digital Object Identifier: 10.1038/nature10808 Google Scholar: Lookup Link · doi:10.1038/nature10808
[9] Donoho, D. L. and Johnstone, I. M. (1995). Adapting to unknown smoothness via wavelet shrinkage. J. Amer. Statist. Assoc. 90 1200-1224. MathSciNet: MR1379464 · Zbl 0869.62024
[10] FRAZEE, A. C., SABUNCIYAN, S., HANSEN, K. D., IRIZARRY, R. A. and LEEK, J. T. (2014). Differential expression analysis of RNA-seq data at single-base resolution. Biostatistics 15 413-426. Digital Object Identifier: 10.1093/biostatistics/kxt053 Google Scholar: Lookup Link · doi:10.1093/biostatistics/kxt053
[11] HESSELBERTH, J. R., CHEN, X., ZHANG, Z., SABO, P. J., SANDSTROM, R., REYNOLDS, A. P., THURMAN, R. E., NEPH, S., KUEHN, M. S. et al. (2009). Global mapping of protein-DNA interactions in vivo by digital genomic footprinting. Nat. Methods 6 283-9. Digital Object Identifier: 10.1038/nmeth.1313 Google Scholar: Lookup Link · doi:10.1038/nmeth.1313
[12] JOHNSON, D. S., MORTAZAVI, A., MYERS, R. M. and WOLD, B. (2007). Genome-wide mapping of in vivo protein-DNA interactions. Science 316 1497-502. Digital Object Identifier: 10.1126/science.1141319 Google Scholar: Lookup Link · doi:10.1126/science.1141319
[13] KOLACZYK, E. D. (1999). Bayesian multiscale models for Poisson processes. J. Amer. Statist. Assoc. 94 920-933. Digital Object Identifier: 10.2307/2670007 Google Scholar: Lookup Link MathSciNet: MR1723303 · Zbl 1072.62630 · doi:10.2307/2670007
[14] Law, C. W., Chen, Y., Shi, W. and Smyth, G. K. (2014). Voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15 1.
[15] Lee, W. and Morris, J. S. (2016). Identification of differentially methylated loci using wavelet-based functional mixed models. Bioinformatics 32 664-672.
[16] LIU, Y., ZHOU, J. and WHITE, K. P. (2014). RNA-seq differential expression studies: More sequence or more replication? Bioinformatics 30 301-304. Digital Object Identifier: 10.1093/bioinformatics/btt688 Google Scholar: Lookup Link · doi:10.1093/bioinformatics/btt688
[17] LOVE, M. I., HUBER, W. and ANDERS, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15 1-21.
[18] LUCA, F., MARANVILLE, J. C., RICHARDS, A. L., WITONSKY, D. B., STEPHENS, M. and RIENZO, A. D. (2013). Genetic, functional and molecular features of glucocorticoid receptor binding. PLoS ONE 8 e61654. Digital Object Identifier: 10.1371/journal.pone.0061654 Google Scholar: Lookup Link · doi:10.1371/journal.pone.0061654
[19] MA, L. and SORIANO, J. (2018). Analysis of distributional variation through graphical multi-scale beta-binomial models. J. Comput. Graph. Statist. 27 529-541. Digital Object Identifier: 10.1080/10618600.2017.1402774 Google Scholar: Lookup Link MathSciNet: MR3863755 · Zbl 07498930 · doi:10.1080/10618600.2017.1402774
[20] MARIONI, J. C., MASON, C. E., MANE, S. M., STEPHENS, M. and GILAD, Y. (2008). RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18 1509-17. Digital Object Identifier: 10.1101/gr.079558.108 Google Scholar: Lookup Link · doi:10.1101/gr.079558.108
[21] MIKKELSEN, T. S., KU, M., JAFFE, D. B., ISSAC, B., LIEBERMAN, E., GIANNOUKOS, G., ALVAREZ, P., BROCKMAN, W., KIM, T.-K. et al. (2007). Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature 448 553-60. Digital Object Identifier: 10.1038/nature06008 Google Scholar: Lookup Link · doi:10.1038/nature06008
[22] Morris, J. S., Brown, P. J., Herrick, R. C., Baggerly, K. A. and Coombes, K. R. (2008). Bayesian analysis of mass spectrometry proteomic data using wavelet-based functional mixed models. Biometrics 64 479-489, 667. Digital Object Identifier: 10.1111/j.1541-0420.2007.00895.x Google Scholar: Lookup Link MathSciNet: MR2432418 · Zbl 1137.62399 · doi:10.1111/j.1541-0420.2007.00895.x
[23] MORTAZAVI, A., WILLIAMS, B. A., MCCUE, K., SCHAEFFER, L. and WOLD, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5 621-8. Digital Object Identifier: 10.1038/nmeth.1226 Google Scholar: Lookup Link · doi:10.1038/nmeth.1226
[24] MOYERBRAILEAN, G. A., DAVIS, G. O., HARVEY, C. T., WATZA, D., WEN, X., PIQUE-REGI, R. and LUCA, F. (2015). A high-throughput RNA-seq approach to profile transcriptional responses. Sci. Rep. 5 14976. Digital Object Identifier: 10.1038/srep14976 Google Scholar: Lookup Link · doi:10.1038/srep14976
[25] PIQUE-REGI, R., DEGNER, J. F., PAI, A. A., BOYLE, A. P., SONG, L., LEE, B.-K., GAFFNEY, D. J., GILAD, Y. and PRITCHARD, J. K. (2011). Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res. 21 447-55. Digital Object Identifier: 10.1101/gr.112623.110 Google Scholar: Lookup Link · doi:10.1101/gr.112623.110
[26] ROBINSON, D. G. and STOREY, J. D. (2014). subSeq: Determining appropriate sequencing depth through efficient read subsampling. Bioinformatics 30 3424-3426. Digital Object Identifier: 10.1093/bioinformatics/btu552 Google Scholar: Lookup Link · doi:10.1093/bioinformatics/btu552
[27] Robinson, M. D., McCarthy, D. J. and Smyth, G. K. (2010). edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26 139-140.
[28] SHIM, H. and STEPHENS, M. (2015). Wavelet-based genetic association analysis of functional phenotypes arising from high-throughput sequencing assays. Ann. Appl. Stat. 9 665-686. Digital Object Identifier: 10.1214/14-AOAS776 Google Scholar: Lookup Link MathSciNet: MR3371330 · Zbl 1397.62473 · doi:10.1214/14-AOAS776
[29] SHIM, H., XING, Z., PANTALEO, E., LUCA, F., PIQUE-REGI, R. and STEPHENS, M. (2024). Supplement to “Multiscale Poisson process approaches for detecting and estimating differences from high-throughput sequencing assays.” https://doi.org/10.1214/23-AOAS1828SUPPA, https://doi.org/10.1214/23-AOAS1828SUPPB, https://doi.org/10.1214/23-AOAS1828SUPPC
[30] SMYTH, G. K. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3 Art. 3, 29. Digital Object Identifier: 10.2202/1544-6115.1027 Google Scholar: Lookup Link MathSciNet: MR2101454 · Zbl 1038.62110 · doi:10.2202/1544-6115.1027
[31] STEPHENS, M. (2017). False discovery rates: A new deal. Biostatistics 18 275-294. Digital Object Identifier: 10.1093/biostatistics/kxw041 Google Scholar: Lookup Link MathSciNet: MR3824755 · doi:10.1093/biostatistics/kxw041
[32] STOREY, J. D., BASS, A. J., DABNEY, A. and ROBINSON, D. (2020). qvalue: Q-value estimation for false discovery rate control R package version 2.20.0.
[33] TARAZONA, S., GARCÍA-ALCALDE, F., DOPAZO, J., FERRER, A. and CONESA, A. (2011). Differential expression in RNA-seq: A matter of depth. Genome Res. 21 2213-2223.
[34] TIMMERMANN, K. E. and NOWAK, R. D. (1999). Multiscale modeling and estimation of Poisson processes with application to photon-limited imaging. IEEE Trans. Inf. Theory 45 846-862. Digital Object Identifier: 10.1109/18.761328 Google Scholar: Lookup Link MathSciNet: MR1682515 · Zbl 0947.94005 · doi:10.1109/18.761328
[35] WAKEFIELD, J. (2009). Bayes factors for genome-wide association studies: Comparison with P-values. Genet. Epidemiol. 33 79-86. Digital Object Identifier: 10.1002/gepi.20359 Google Scholar: Lookup Link · doi:10.1002/gepi.20359
[36] WANG, E. T., SANDBERG, R., LUO, S., KHREBTUKOVA, I., ZHANG, L., MAYR, C., KINGSMORE, S. F., SCHROTH, G. P. and BURGE, C. B. (2008). Alternative isoform regulation in human tissue transcriptomes. Nature 456 470-6. Digital Object Identifier: 10.1038/nature07509 Google Scholar: Lookup Link · doi:10.1038/nature07509
[37] XING, Z., CARBONETTO, P. and STEPHENS, M. (2021). Flexible signal denoising via flexible empirical Bayes shrinkage. J. Mach. Learn. Res. 22 Paper No. 93, 28. MathSciNet: MR4279744 · Zbl 1540.62023
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.