×

A semiparametric kernel independence test with application to mutational signatures. (English) Zbl 1506.62440

Summary: Cancers arise owing to somatic mutations, and the characteristic combinations of somatic mutations form mutational signatures. Despite many mutational signatures being identified, mutational processes underlying a number of mutational signatures remain unknown, which hinders the identification of interventions that may reduce somatic mutation burdens and prevent the development of cancer. We demonstrate that the unknown cause of a mutational signature can be inferred by the associated signatures with known etiology. However, existing association tests are not statistically powerful due to excess zeros in mutational signatures data. To address this limitation, we propose a semiparametric kernel independence test (SKIT). The SKIT statistic is defined as the integrated squared distance between mixed probability distributions and is decomposed into four disjoint components to pinpoint the source of dependency. We derive the asymptotic null distribution and prove the asymptotic convergence of power. Due to slow convergence to the asymptotic null distribution, a bootstrap method is employed to compute \(p\)-values. Simulation studies demonstrate that when zeros are prevalent, SKIT is more resilient to power loss than existing tests and robust to random errors. We applied SKIT to The Cancer Genome Atlas mutational signatures data for over 9000 tumors across 32 cancer types, and identified a novel association between signature 17 curated in the Catalogue of Somatic Mutations in Cancer and apolipoprotein B mRNA editing enzyme (APOBEC) signatures in gastrointestinal cancers. It indicates that APOBEC activity is likely associated with the unknown cause of signature 17.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
62G10 Nonparametric hypothesis testing

Software:

R; zinbwave; pyuvdata; Python

References:

[1] Ahmad, I. A.; Li, Q., “Testing Independence by Nonparametric Kernel Method, Statistics and Probability Letters, 34, 201-210 (1997) · Zbl 0899.62049 · doi:10.1016/S0167-7152(96)00183-6
[2] Alexandrov, L. B.; Jones, P. H.; Wedge, D. C.; Sale, J. E.; Campbell, P. J.; Nik-Zainal, S.; Stratton, M. R., “Clock-Like Mutational Processes in Human Somatic Cells,, Nature Genetics, 47, 1402 (2015) · doi:10.1038/ng.3441
[3] Alexandrov, L. B.; Kim, J.; Haradhvala, N. J.; Huang, M. N.; Ng, A. W. T.; Wu, Y.; Boot, A.; Covington, K. R.; Gordenin, D. A.; Bergstrom, E. N.; Islam, S. A., “The Repertoire of Mutational Signatures in Human Cancer, Nature, 578, 94-101 (2020) · doi:10.1038/s41586-020-1943-3
[4] Alexandrov, L. B.; Nik-Zainal, S.; Wedge, D. C.; Campbell, P. J.; Stratton, M. R., “Deciphering Signatures of Mutational Processes Operative in Human Cancer, Cell Reports, 3, 246-259 (2013) · doi:10.1016/j.celrep.2012.12.008
[5] Basler, H., “Equivalence Between Tie-Corrected Spearman Test and a Chi-Square Test in a Fourfold Contingency Table, Metrika, 35, 203-209 (1988) · Zbl 0638.62040 · doi:10.1007/BF02613305
[6] Benjamini, Y.; Hochberg, Y., “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing, Journal of the Royal Statistical Society, Series B, 57, 289-300 (1995) · Zbl 0809.62014 · doi:10.1111/j.2517-6161.1995.tb02031.x
[7] Blum, J. R.; Kiefer, J.; Rosenblatt, M., “Distribution Free Tests of Independence Based on the Sample Distribution Function, The Annals of Mathematical Statistics, 32, 485-498 (1961) · Zbl 0139.36301 · doi:10.1214/aoms/1177705055
[8] Campbell, P.; Getz, G.; Korbel, J.; Stuart, J.; Jennings, J.; Stein, L.; Perry, M.; Nahal-Bose, H.; Ouellette, B.; Li, C.; Rheinbay, E., “Pan-Cancer Analysis of Whole Genomes,”, Nature, 578, 82-93 (2020)
[9] Donoho, D. L., in Aide-Memoire of a Lecture at American Mathematical Society Conference on Mathematical Challenges of the 21st Century, High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality (2000), Los Angeles, CA
[10] Duong, T.; Hazelton, M. L., “Cross-Validation Bandwidth Matrices for Multivariate Kernel Density Estimation, Scandinavian Journal of Statistics, 32, 485-506 (2005) · Zbl 1089.62035 · doi:10.1111/j.1467-9469.2005.00445.x
[11] Gonzalez-Perez, A.; Sabarinathan, R.; Lopez-Bigas, N., “Local Determinants of the Mutational Landscape of the Human Genome, Cell, 177, 101-114 (2019) · doi:10.1016/j.cell.2019.02.051
[12] Gretton, A.; Fukumizu, K.; Teo, C. H.; Song, L.; Schölkopf, B.; Smola, A. J., A Kernel Statistical Test of Independence, Advances in Neural Information Processing Systems, 585-592 (2008)
[13] Gretton, A.; Györfi, L., “Consistent Nonparametric Tests of Independence,”, The Journal of Machine Learning Research, 11, 1391-1423 (2010) · Zbl 1242.62033
[14] Hall, P., “Central Limit Theorem for Integrated Square Error of Multivariate Nonparametric Density Estimators, Journal of Multivariate Analysis, 14, 1-16 (1984) · Zbl 0528.62028 · doi:10.1016/0047-259X(84)90044-7
[15] Helleday, T.; Eshtad, S.; Nik-Zainal, S., “Mechanisms Underlying Mutational Signatures in Human Cancers, Nature Reviews Genetics, 15, 585-598 (2014) · doi:10.1038/nrg3729
[16] Henderson, D. J.; Parmeter, C. F., Applied Nonparametric Econometrics (2015), New York: Cambridge University Press, New York · Zbl 1305.62004
[17] Hoeffding, W., “A Non-Parametric Test of Independence, The Annals of Mathematical Statistics, 19, 546-557 (1948) · Zbl 0032.42001 · doi:10.1214/aoms/1177730150
[18] Kumar, M. S.; Slud, E. V.; Okrah, K.; Hicks, S. C.; Hannenhalli, S.; Bravo, H. C., “Analysis and Correction of Compositional Bias in Sparse Sequencing Count Data, BMC Genomics, 19, 799 (2018) · doi:10.1186/s12864-018-5160-5
[19] Lee, D. D.; Seung, H. S., “Learning the Parts of Objects by Nonnegative Matrix Factorization, Nature, 401, 788 (1999) · Zbl 1369.68285 · doi:10.1038/44565
[20] Li, C.-S.; Lu, J.-C.; Park, J.; Kim, K.; Brinkley, P. A.; Peterson, J. P., “Multivariate Zero-Inflated Poisson Models and Their Applications, Technometrics, 41, 29-38 (1999) · doi:10.1080/00401706.1999.10485593
[21] Li, Q.; Maasoumi, E.; Racine, J. S., “A Nonparametric Test for Equality of Distributions With Mixed Categorical and Continuous Data, Journal of Econometrics, 148, 186-200 (2009) · Zbl 1429.62157 · doi:10.1016/j.jeconom.2008.10.007
[22] Liu, B.; Mojirsheibani, M., “On a Weighted Bootstrap Approximation of the Lp Norms of Kernel Density Estimators, Statistics and Probability Letters, 105, 65-73 (2015) · Zbl 1396.62074 · doi:10.1016/j.spl.2015.06.005
[23] O’Brien, T. A.; Kashinath, K.; Cavanaugh, N. R.; Collins, W. D.; O’Brien, J. P., “A Fast and Objective Multidimensional Kernel Density Estimation Method: fastKDE, Computational Statistics and Data Analysis, 101, 148-160 (2016) · Zbl 1467.62015 · doi:10.1016/j.csda.2016.02.014
[24] Olkin, I.; Trikalinos, T. A., “Constructions for a Bivariate Beta Distribution, Statistics and Probability Letters, 96, 54-60 (2015) · Zbl 1314.62043 · doi:10.1016/j.spl.2014.09.013
[25] Ospina, R.; Ferrari, S. L., “A General Class of Zero-or-One Inflated Beta Regression Models, Computational Statistics and Data Analysis, 56, 1609-1623 (2012) · Zbl 1243.62099 · doi:10.1016/j.csda.2011.10.005
[26] Parzen, E., “On Estimation of a Probability Density Function and Mode, The Annals of Mathematical Statistics, 33, 1065-1076 (1962) · Zbl 0116.11302 · doi:10.1214/aoms/1177704472
[27] Petljak, M.; Alexandrov, L. B.; Brammeld, J. S.; Price, S.; Wedge, D. C.; Grossmann, S.; Dawson, K. J.; Ju, Y. S.; Iorio, F.; Tubio, J. M.; Koh, C. C., “Characterizing Mutational Signatures in Human Cancer Cell Lines Reveals Episodic APOBEC Mutagenesis, Cell, 176, 1282-1294 (2019) · doi:10.1016/j.cell.2019.02.012
[28] R Core Team, R: A Language and Environment for Statistical Computing (2019), Vienna, Austria: R Foundation for Statistical Computing, Vienna, Austria
[29] Risso, D.; Perraudeau, F.; Gribkova, S.; Dudoit, S.; Vert, J.-P., “A General and Flexible Method for Signal Extraction From Single-Cell RNA-seq Data, Nature Communications, 9, 284 (2018) · doi:10.1038/s41467-017-02554-5
[30] Roberts, S. A.; Lawrence, M. S.; Klimczak, L. J.; Grimm, S. A.; Fargo, D.; Stojanov, P.; Kiezun, A.; Kryukov, G. V.; Carter, S. L.; Saksena, G.; Harris, S., “An APOBEC Cytidine Deaminase Mutagenesis Pattern Is Widespread in Human Cancers, Nature Genetics, 45, 970-976 (2013) · doi:10.1038/ng.2702
[31] Rosenblatt, M., “Remarks on Some Nonparametric Estimates of a Density Function, The Annals of Mathematical Statistics, 27, 832-837 (1956) · Zbl 0073.14602 · doi:10.1214/aoms/1177728190
[32] Rosenblatt, M., “A Quadratic Measure of Deviation of Two-Dimensional Density Estimates and a Test of Independence, The Annals of Statistics, 3, 1-14 (1975) · Zbl 0325.62030
[33] Rosenblatt, M.; Wahlen, B. E., “A Nonparametric Measure of Independence Under a Hypothesis of Independent Components, Statistics and Probability Letters, 15, 245-252 (1992) · Zbl 0770.62039 · doi:10.1016/0167-7152(92)90197-D
[34] Scott, D. W., Multivariate Density Estimation: Theory, Practice, and Visualization (1992), New York: Wiley, New York · Zbl 0850.62006
[35] Sejdinovic, D.; Sriperumbudur, B.; Gretton, A.; Fukumizu, K., “Equivalence of Distance-Based and RKHS-Based Statistics in Hypothesis Testing, The Annals of Statistics, 41, 2263-2291 (2013) · Zbl 1281.62117 · doi:10.1214/13-AOS1140
[36] Sheather, S. J.; Jones, M. C., “A Reliable Data-Based Bandwidth Selection Method for Kernel Density Estimation, Journal of the Royal Statistical Society, Series B, 53, 683-690 (1991) · Zbl 0800.62219 · doi:10.1111/j.2517-6161.1991.tb01857.x
[37] Shen, C.; Priebe, C. E.; Vogelstein, J. T., “From Distance Correlation to Multiscale Graph Correlation, Journal of the American Statistical Association, 115, 280-291 (2020) · Zbl 1437.62210 · doi:10.1080/01621459.2018.1543125
[38] Silverman, B. W., Density Estimation for Statistics and Data Analysis (1986), London: Chapman and Hall, London · Zbl 0617.62042
[39] Stratton, M. R.; Campbell, P. J.; Futreal, P. A., “The Cancer Genome, Nature, 458, 719 (2009) · doi:10.1038/nature07943
[40] Székely, G. J.; Rizzo, M. L., “Brownian Distance Covariance, The Annals of Applied Statistics, 3, 1236-1265 (2009) · Zbl 1196.62077 · doi:10.1214/09-AOAS312
[41] Székely, G. J.; Rizzo, M. L.; Bakirov, N. K., “Measuring and Testing Dependence by Correlation of Distances, The Annals of Statistics, 35, 2769-2794 (2007) · Zbl 1129.62059 · doi:10.1214/009053607000000505
[42] van Rossum, G. (2018), “The Python Language Reference: Release 3.7.”
[43] Xu, L.; Paterson, A. D.; Turpin, W.; Xu, W., “Assessment and Selection of Competing Models for Zero-Inflated Microbiome Data, PLoS One, 10, e0129606 (2015) · doi:10.1371/journal.pone.0129606
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.