×

Can matching improve the performance of boosting for identifying important genes in observational studies? (English) Zbl 1305.65072

Summary: When two groups of individuals are to be compared with respect to gene expression there will often be some potentially confounding variables that differ between the groups. Matching is an established approach for obtaining comparable groups and enabling subsequent univariate tests for each gene. Alternatively, the confounders might be incorporated directly into a multivariable regression model for adjustment. In contrast to univariate tests, such models can consider all genes simultaneously. Aiming to combine the advantages of both approaches, matching and multivariable modeling, we consider a matching-based boosting procedure for fitting risk prediction models in two-group settings. This possibly allows to identify and automatically remove problematic observations that might negatively affect the regression model. Therefore, we compare the ability to identify important covariates for this combination of matching and boosting with only boosting for different covariate correlation structures in a simulation study. Furthermore, we analyze the prediction performance of these approaches on two gene expression microarray studies. The first study comprises patients with B-cell and T-cell type acute lymphoblastic leukemia and the second patients with acute megakaryoblastic leukemia. While the matching component can in principle guard against problematic observations, the combined approach is seen to neither improve identification of important covariates nor to improve prediction performance. Therefore, a combination of the two approaches cannot be recommended. Adjustment for potential confounders is seen to provide the best performance, i.e. a pure multivariable regression modeling strategy seems to be promising even in presence of considerable heterogeneity.

MSC:

62-08 Computational methods for problems pertaining to statistics

Software:

GlobalAncova
Full Text: DOI

References:

[1] Binder H, Schumacher M (2008a) Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinf 9: 14 · doi:10.1186/1471-2105-9-14
[2] Binder H, Schumacher M (2008b) Comment on ’network-constrained regularization and variable selection for analysis of genomic data’. Bioinformatics 24(21): 2566–2568 · doi:10.1093/bioinformatics/btn412
[3] Binder H, Schumacher M (2008c) Adapting prediction error estimates for biased complexity selection in high-dimensional bootstrap samples. Stat Appl Genet Mol Biol 7(1): 12 · Zbl 1276.62060
[4] Binder H, Porzelius C, Schumacher M (2009) Rank-based p-values for sparse high-dimensional risk prediction models fitted by componentwise boosting, FDM-Preprint Nr.101
[5] Boulesteix A-L, Hothorn T (2010) Testing the additional predictive value of high-dimensional data. BMC Bioinf 11: 78 · doi:10.1186/1471-2105-11-78
[6] Bourquin J et al (2006) Identification of distinct molecular phenotypes in acute megakaryoblastic leukemia by gene expression profiling. PNAS 103(9): 3339–3344 · doi:10.1073/pnas.0511150103
[7] Breiman L (2001) Random forests. Mach Learn 45: 5–32 · Zbl 1007.68152 · doi:10.1023/A:1010933404324
[8] Brier G (1950) Verification of forecast expressed in terms of probability. Mon Weather Rev 78(1): 1–3 · doi:10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2
[9] Cepeda MS et al (2003) Optimal matching with a variable number of controls vs. a fixed number of controls for a cohort study: trade-offs. J Clin Epidemiol 56: 230–237 · doi:10.1016/S0895-4356(02)00583-8
[10] Chiaretti S et al (2004) Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. Blood 103: 2771–2778 · doi:10.1182/blood-2003-09-3243
[11] Cochran W, Rubin D (1973) Controlling bias in observational studies: a review. Indian J Stat Ser A 35(4): 417–446 · Zbl 0291.62012
[12] Cristianini N, Shawe-Taylor J (1999) An introduction to SVM. Cambridge University Press, Cambridge · Zbl 0994.68074
[13] Dettling M, Bühlmann P (2003) Boosting for tumor classification with gene expression data. Bioinformatics 19: 1061–1069 · doi:10.1093/bioinformatics/btf867
[14] Gu X, Rosenbaum P (1993) Comparison of multivariable matching methods: structures, distances and algorithms. J Comput Graph Stat 2: 405–420
[15] Hansen B (2004) Full matching in an observational study coaching for the SAT. J Am Stat Assoc 99(467): 609–618 · Zbl 1117.62349 · doi:10.1198/016214504000000647
[16] Heller R et al (2009) Matching methods for observational microarray studies. Bioinformatics 25(7): 904–909 · doi:10.1093/bioinformatics/btn650
[17] Hummel M et al (2008) GlobalANCOVA: exploration and assessment of gene group effects. Bioinformatics 24(1): 78–85 · doi:10.1093/bioinformatics/btm531
[18] Ming K, Rosenbaum P (2000) Substantial gains in bias reduction from matching with a variable number of controls. Biometrics 56(1): 118–124 · Zbl 1060.62641 · doi:10.1111/j.0006-341X.2000.00118.x
[19] Rosenbaum P, Rubin D (1985) The bias due to incomplete matching. Biometrics 41: 103–116 · Zbl 0607.62137 · doi:10.2307/2530647
[20] Rosenbaum P (1989) Optimal matching for observational studies. J Am Stat Assoc 84(408): 1024–1032 · doi:10.1080/01621459.1989.10478868
[21] Rubin D (1973) Matching to remove bias in observational studies. Biometrics 29(1): 159–183 · doi:10.2307/2529684
[22] Rubin D (1979) Using multivariable matched sampling and regression adjustment to control bias in observational studies. J Am Stat Assoc 74: 318–324 · Zbl 0413.62047
[23] Rubin D (1980) Bias reduction using Mahalanobis metric matching. Biometrics 36: 293–298 · Zbl 0463.62015 · doi:10.2307/2529981
[24] Simon R et al (2003) Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst 95(1): 14–18 · doi:10.1093/jnci/95.1.14
[25] Smith H (1997) Matching with multiple controls to estimate treatment effects in observational studies. Sociol Methodol 27(1): 325–353 · doi:10.1111/1467-9531.271030
[26] Smyth GK (2004) Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 3 (Article 3) · Zbl 1038.62110
[27] Thomas JG et al (2001) An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. Genom Res 11: 1227–1236 · doi:10.1101/gr.165101
[28] Tusher VG et al (2001) Significant analysis of microarrays applied to the ioonizing radiation response. Proc Natl Acad Sci USA 98: 5116–5121 · Zbl 1012.92014 · doi:10.1073/pnas.091062498
[29] Tutz G, Binder H (2007) Boosting ridge regression. Comput Stat Data Anal 51(12): 6044–6059 · Zbl 1330.62294 · doi:10.1016/j.csda.2006.11.041
[30] Vapnik V (1995) The nature of statistical learning theory. Springer, New York · Zbl 0833.62008
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.