×

A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. (English) Zbl 1447.92258

Summary: Detecting, characterizing, and interpreting gene-gene interactions or epistasis in studies of human disease susceptibility is both a mathematical and a computational challenge. To address this problem, we have previously developed a multifactor dimensionality reduction (MDR) method for collapsing high-dimensional genetic data into a single dimension (i.e. constructive induction) thus permitting interactions to be detected in relatively small sample sizes. In this paper, we describe a comprehensive and flexible framework for detecting and interpreting gene-gene interactions that utilizes advances in information theory for selecting interesting single-nucleotide polymorphisms (SNPs), MDR for constructive induction, machine learning methods for classification, and finally graphical models for interpretation. We illustrate the usefulness of this strategy using artificial datasets simulated from several different two-locus and three-locus epistasis models. We show that the accuracy, sensitivity, specificity, and precision of a naïve Bayes classifier are significantly improved when SNPs are selected based on their information gain (i.e. class entropy removed) and reduced to a single attribute using MDR. We then apply this strategy to detecting, characterizing, and interpreting epistatic models in a genetic study \((n=500)\) of atrial fibrillation and show that both classification and model interpretation are significantly improved.

MSC:

92D10 Genetics and epigenetics
92C32 Pathology, pathophysiology
62P10 Applications of statistics to biology and medical sciences; meta analysis
62R07 Statistical aspects of big data and data science
Full Text: DOI

References:

[1] Bateson, W., Mendel’s Principles of Heredity (1909), Cambridge University Press: Cambridge University Press Cambridge
[2] Bloedorn, E.; Michalski, R. S., Data-driven constructive induction, IEEE Intell. Syst., 13, 30-37 (1998)
[3] Brodie, E. D., Why evolutionary genetics does not always add up, (Wolf, J.; Brodie, B.; Wade, M., Epistasis and the Evolutionary Process (2000), Oxford University Press: Oxford University Press New York), 3-19
[4] Cho, Y. M.; Ritchie, M. D.; Moore, J. H.; Park, J. Y.; Lee, K. U.; Shin, H. D.; Lee, H. K.; Park, K. S., Multifactor-dimensionality reduction shows a two-locus interaction associated with Type 2 diabetes mellitus, Diabetologia, 47, 549-554 (2004)
[5] Coffey, C. S.; Hebert, P. R.; Ritchie, M. D.; Krumholz, H. M.; Morgan, T. M.; Gaziano, J. M.; Ridker, P. M.; Moore, J. H., An application of conditional logistic regression and multifactor dimensionality reduction for detecting gene-gene interactions on risk of myocardial infarction: the importance of model validation, BMC Bioinform., 4, 49 (2004)
[6] Cordell, H. J.; Todd, J. A.; Bennett, S. T.; Kawaguchi, Y.; Farrall, M., Two-locus maximum lod score analysis of a multifactorial trait: joint consideration of IDDM2 and IDDM4 with IDDM1 in type 1 diabetes, Am. J. Hum. Genet., 57, 920-934 (1995)
[7] Cordell, H. J.; Todd, J. A.; Hill, N. J.; Lord, C. J.; Lyons, P. A.; Peterson, L. B.; Wicker, L. S.; Clayton, D. G., Statistical modeling of interlocus interactions in a complex disease: rejection of the multiplicative model of epistasis in type 1 diabetes, Genetics, 158, 357-367 (2001)
[8] Cox, N. J.; Frigge, M.; Nicolae, D. L.; Concannon, P.; Hanis, C. L.; Bell, G. I.; Kong, A., Loci on chromosomes 2 (NIDDM1) and 15 interact to increase susceptibility to diabetes in Mexican Americans, Nat. Genet., 21, 213-215 (1999)
[9] Cox, N. J.; Hayes, M. G.; Roe, C. A.; Tsuchiya, T.; Bell, G. I., Linkage of calpain 10 to type 2 diabetes: the biological rationale, Diabetes, 53, Suppl 1, S19-S25 (2004)
[10] Curk, T.; Demsar, J.; Xu, Q.; Leban, G.; Petrovic, U.; Bratko, I.; Shaulsky, G.; Zupan, B., Microarray data mining with visual programming, Bioinformatics, 21, 396-398 (2005)
[11] Fisher, R. A., The correlations between relatives on the supposition of Mendelian inheritance, Trans. R. Soc. Edinburgh, 52, 399-433 (1918)
[12] Frank, E.; Hall, M.; Trigg, L.; Holmes, G.; Witten, I. H., Data mining in bioinformatics using Weka, Bioinformatics, 20, 2479-2481 (2004)
[13] Gibson, G.; Wagner, G., Canalization in evolutionary genetics: a stabilizing theory?, BioEssays, 22, 372-380 (2000)
[14] Goldberg, D. E., Genetic Algorithms in Search, Optimisation and Machine Learning (1998), Addison-Wesley Publishing Company, Inc.: Addison-Wesley Publishing Company, Inc. Reading, Massachusetts
[15] Good, P., Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses (2000), Springer: Springer New York · Zbl 0942.62049
[16] Hahn, L. W.; Moore, J. H., Ideal discrimination of discrete clinical endpoints using multilocus genotypes, In Silico Biol., 4, 183-194 (2004)
[17] Hahn, L. W.; Ritchie, M. D.; Moore, J. H., Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions, Bioinformatics, 19, 376-382 (2003)
[18] Hastie, T.; Tibshirani, R.; Friedman, J., The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2001), Springer: Springer New York · Zbl 0973.62007
[19] Hirschhorn, J. N.; Daly, M. J., Genome-wide association studies for common diseases and complex traits, Nat. Rev. Genet., 6, 95-108 (2005)
[20] Hoh, J.; Ott, J., Genetic dissection of diseases: design and methods, Curr. Opin. Genet. Dev., 14, 229-232 (2004)
[21] Hollander, W. F., Epistasis and hypostasis, J. Hered., 46, 222-225 (1955)
[22] Hu, Y.-J., Constructive induction: covering attribute spectrum, (Liu, H.; Motoda, H., Feature Extraction, Construction and Selection: A Data Mining Perspective (1998), Kluwer: Kluwer Boston), 257-272 · Zbl 0912.00012
[23] Jakulin, A.; Bratko, I., Analyzing attribute interactions, Lect. Notes Artif. Intell., 2838, 229-240 (2003)
[24] Jakulin, A.; Bratko, I.; Smrke, D.; Demsar, J.; Zupan, B., Attribute interactions in medical data analysis, Lect. Notes Artif. Intell., 2780, 229-238 (2003)
[25] Kirkpatrick, S.; Gelatt, C. D.; Vecchi, M. P., Optimization by simulated annealing, Science, 220, 671-680 (1983) · Zbl 1225.90162
[26] Lenat, D. B., Learning from observation and discovery, (Michalski, R. S.; Carbonell, J. G.; Mitchell, T. M., Machine Learning: An Artificial Intelligence Approach (1983), Morgan Kaufmann: Morgan Kaufmann Los Altos, CA)
[27] Lenat, D. B., On automated scientific theory formation: a case study using the AM program, (Hayes, J. E.; Michie, D.; Mikulich, L. I., Machine Intelligence, vol. 9 (1997), Halstead Press: Halstead Press New York)
[28] Li, W.; Reich, J., A complete enumeration and classification of two-locus disease models, Hum. Hered., 50, 334-349 (2000)
[29] Marchini, J.; Donnelly, P.; Cardon, L. R., Genome-wide strategies for detecting multiple loci that influence complex diseases, Nat. Genet., 37, 413-417 (2005)
[30] McGill, W. J., Multivariate information transmission, Psychometrica, 19, 97-116 (1954) · Zbl 0058.35706
[31] Michalewicz, Z.; Fogel, D. B., How to Solve It: Modern Heuristics (2000), Springer: Springer New York · Zbl 0943.90002
[32] Michalski, R. S., A theory and methodology of inductive learning, Artif. Intell., 20, 111-161 (1983)
[33] Mitchell, T., Machine Learning (1997), McGraw-Hill: McGraw-Hill New York · Zbl 0913.68167
[34] Moore, J. H., The ubiquitous nature of epistasis in determining susceptibility to common human diseases, Hum. Hered., 56, 73-82 (2003)
[35] Moore, J. H., Computational analysis of gene-gene interactions in common human diseases using multifactor dimensionality reduction, Expert. Rev. Mol. Diagn., 4, 795-803 (2004)
[36] Moore, J. H., A global view of epistasis, Nat. Genet., 37, 13-14 (2005)
[37] Moore, J. H.; Ritchie, M. D., The challenges of whole-genome approaches to common diseases, J. Am. Med. Assoc., 291, 1642-1643 (2004)
[38] Moore, J. H.; Williams, S. W., New strategies for identifying gene-gene interactions in hypertension, Ann. Med., 34, 88-95 (2002)
[39] Moore, J. H.; Williams, S. W., Traversing the conceptual divide between biological and statistical epistasis: systems biology and a more modern synthesis, BioEssays, 27, 637-646 (2005)
[40] Moore, J. H.; Boczko, E. M.; Summar, M. L., Connecting the dots between genes, biochemistry, and disease susceptibility: systems biology modeling in human genetics, Mol. Genet. Metab., 84, 104-111 (2005)
[41] Nadeau, C.; Bengio, Y., Inference for the generalization error, Mach. Learn., 52, 239-281 (2003) · Zbl 1039.68104
[42] Page, G. P.; George, V.; Go, R. C.; Page, P. Z.; Allison, D. B., “Are we there yet?”: Deciding when one has demonstrated specific genetic causation in complex diseases and quantitative traits, Am. J. Hum. Genet., 73, 711-719 (2003)
[43] Phillips, P. C., The language of gene interaction, Genetics, 149, 1167-1171 (1998)
[44] Pierce, J. R., An Introduction to Information Theory: Symbols, Signals, and Noise (1980), Dover: Dover New York
[45] Proulx, S. R.; Phillips, P. C., The opportunity for canalization and the evolution of genetic networks, Am. Nat., 165, 147-162 (2005)
[46] Qin, S.; Zhao, X.; Pan, Y.; Liu, J.; Feng, G.; Fu, J.; Bao, J.; Zhang, Z.; He, L., An association study of the N-methyl-d-aspartate receptor NR1 subunit gene (GRIN1) and NR2B subunit gene (GRIN2B) in schizophrenia with universal DNA microarray, Eur. J. Hum. Genet., 13, 807-814 (2005)
[47] Ritchie, M. D.; Hahn, L. W.; Roodi, N.; Bailey, L. R.; Dupont, W. D.; Parl, F. F.; Moore, J. H., Multifactor dimensionality reduction reveals high-order interactions among estrogen metabolism genes in sporadic breast cancer, Am. J. Hum. Genet., 69, 138-147 (2001)
[48] Ritchie, M. D.; Hahn, L. W.; Moore, J. H., Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, phenocopy, and genetic heterogeneity, Genet. Epidemiol., 24, 150-157 (2003)
[49] Ritchie, M. D.; White, B. C.; Parker, J. S.; Hahn, L. W.; Moore, J. H., Optimization of neural network architecture using genetic programming improves the detection and modeling of gene-gene interactions in studies of human diseases, BMC Bioinform., 4, 28 (2003)
[50] Ritchie, M. D.; Coffey, C. S.; Moore, J. H., Genetic programming neural networks as a bioinformatics tool in human genetics, Lect. Notes Comput. Sci., 3102, 438-448 (2004)
[51] Robnik-Siknja, M.; Kononenko, I., Theoretical and empirical analysis of ReliefF and RReliefF, Mach. Learn., 53, 23-69 (2003) · Zbl 1076.68065
[52] Segre, D.; Deluna, A.; Church, G. M.; Kishony, R., Modular epistasis in yeast metabolism, Nat. Genet., 37, 77-83 (2005)
[53] Sing, C. F.; Stengard, J. H.; Kardia, S. L., Genes, environment, and cardiovascular disease, Arterioscler. Thromb. Vasc. Biol., 23, 1190-1196 (2003)
[54] Soares, M. L.; Coelho, T.; Sousa, A.; Batalov, S.; Conceicao, I.; Sales-Luis, M. L.; Ritchie, M. D.; Williams, S. M.; Nievergelt, C. M.; Schork, N. J.; Saraiva, M. J.; Buxbaum, J. N., Susceptibility and modifier genes in Portuguese transthyretin V30M amyloid polyneuropathy: complexity in a single-gene disease, Hum. Mol. Genet., 14, 543-553 (2005)
[55] Takahashi, N.; Smithies, O., Human genetics, animal models and computer simulations for studying hypertension, Trends Genet., 20, 136-145 (2004)
[56] Takahashi, N.; Hagaman, J. R.; Kim, H. S.; Smithies, O., Minireview: computer simulations of blood pressure regulation by the renin-angiotensin system, Endocrinology, 144, 2184-2190 (2003)
[57] Templeton, A. R., Epistasis and complex traits, (Wolf, J.; Brodie, B.; Wade, M., Epistasis and the Evolutionary Process (2000), Oxford University Press: Oxford University Press New York), 41-57
[58] Thornton-Wells, T. A.; Moore, J. H.; Haines, J. L., Genetics, statistics and human disease: analytical retooling for complexity, Trends Genet., 20, 640-647 (2004)
[59] Tsai, C. T.; Lai, L. P.; Lin, J. L.; Chiang, F. T.; Hwang, J. J.; Ritchie, M. D.; Moore, J. H.; Hsu, K. L.; Tseng, C. D.; Liau, C. S.; Tseng, Y. Z., Renin-angiotensin system gene polymorphisms and atrial fibrillation, Circulation, 109, 1640-1646 (2004)
[60] Wade, M. J., Epistasis, complex traits, and mapping genes, Genetica, 112-113, 59-69 (2001)
[61] Waddington, C. H., Canalization of development and the inheritance of acquired characters, Nature, 150, 563-565 (1942)
[62] Waddington, C. H., The Strategy of the Genes (1957), MacMillan: MacMillan New York
[63] Wang, W. Y.; Barratt, B. J.; Clayton, D. G.; Todd, J. A., Genome-wide association studies: theoretical and practical concerns, Nat. Rev. Genet., 6, 109-118 (2005)
[64] Wilke, R. A.; Moore, J. H.; Burmester, J. K., Relative impact of CYP3A genotype and concomitant medication on the severity of atorvastatin-induced muscle damage, Pharmacogenet. Genom., 15, 415-421 (2005)
[65] Wilke, R. A.; Reif, D. M.; Moore, J. H., Combinatorial pharmacogenetics, Nat. Rev. Drug Discovery, 4, 911-918 (2005)
[66] Williams, S. M.; Ritchie, M. D.; Phillips, J. A.; Dawson, E.; Prince, M.; Dzhura, E.; Willis, A.; Semenya, A.; Summar, M.; White, B. C.; Addy, J. H.; Kpodonu, J.; Wong, L. J.; Felder, R. A.; Jose, P. A.; Moore, J. H., Multilocus analysis of hypertension: a hierarchical approach, Hum. Hered., 57, 28-38 (2004)
[67] Witten, I. H.; Frank, E., Data Mining (2000), Morgan Kauffman Publishers: Morgan Kauffman Publishers San Francisco
[68] Wnek, J.; Michalski, R. S., Hypothesis-driven constructive induction in AQ17-HCI: a method and experiments, Mach. Learn., 14, 139-168 (1994) · Zbl 0804.68125
[69] Xu, J.; Lowery, J.; Wiklund, F.; Sun, J.; Lindmark, F.; Hsu, F.-C.; Dimitrov, L.; Chang, B.; Turner, A. R.; Adami, H.-O.; Suh, E.; Moore, J. H.; Zheng, S. L.; Isaacs, W. B.; Trent, J. M.; Gronberg, H., The interaction of four inflammatory genes significantly predicts prostate cancer risk, Cancer Epidemiol. Biomarkers Prev., 14, 2563-2568 (2005)
[70] Zupan, B.; Bohanec, M.; Demsar, J.; Bratko, I., Feature transformation by function decomposition, IEEE Int. Syst. Appl., 13, 38-43 (1998)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.