×

Incorporating predictor network in penalized regression with application to microarray data. (English) Zbl 1192.62235

Summary: We consider penalized linear regression, especially for “large \(p\), small \(n\)” problems, for which the relationships among predictors are described a priori by a network. A class of motivating examples includes modeling a phenotype through gene expression profiles while accounting for coordinated functioning of genes in the form of biological pathways or networks. To incorporate the prior knowledge of the similar effect sizes of neighboring predictors in a network, we propose a grouped penalty based on the \(L_{\gamma}\)-norm that smoothes the regression coefficients of the predictors over the network. The main feature of the proposed method is its ability to automatically realize grouped variable selection and exploit grouping effects. We also discuss effects of the choices of the \(\gamma \) and some weights inside the \(L_{\gamma}\) -norm. Simulation studies demonstrate the superior finite-sample performance of the proposed method as compared to Lasso, elastic net, and a recently proposed network-based method. The new method performs best in variable selection across all simulation set-ups considered. For illustration, the method is applied to a microarray data set to predict survival times for some glioblastoma patients using a gene expression data set and a gene network compiled from some Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
92C50 Medical applications (general)
92C40 Biochemistry, molecular biology
62J05 Linear regression; mixed models
65C60 Computational problems in statistics (MSC2010)
92C42 Systems biology, networks

Software:

KEGG; OSCAR

References:

[1] Ashburner, Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium, Nature Genetics 25 pp 25– (2000) · doi:10.1038/75556
[2] Bondell, Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR, Biometrics 64 pp 115– (2008) · Zbl 1146.62051 · doi:10.1111/j.1541-0420.2007.00843.x
[3] Choe, Analysis of the phosphatidylinositol 3’-kinase signaling pathway in glioblastoma patients in vivo, Cancer Research 63 pp 2742– (2003)
[4] Efron, Least angle regression, Annals of Statistics 32 pp 407– (2004) · Zbl 1091.62054 · doi:10.1214/009053604000000067
[5] Fan, Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American Statistical Association 96 pp 1348– (2001) · Zbl 1073.62547 · doi:10.1198/016214501753382273
[6] Forbes, Cosmic 2005, British Journal of Cancer 94 pp 318– (2006) · doi:10.1038/sj.bjc.6602928
[7] Gelfand, Proper multivariate conditional autoregressive models for spatial data analysis, Biostatistics 4 pp 11– (2003) · Zbl 1142.62393 · doi:10.1093/biostatistics/4.1.11
[8] Hoerl, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics 12 pp 55– (1970) · Zbl 0202.17205 · doi:10.2307/1267351
[9] Horvath, Analysis of oncogenic signaling networks in glioblastoma identifies ASPM as a molecular target, Proceedings of the National Academy of Sciences USA 103 pp 17402– (2006) · doi:10.1073/pnas.0608396103
[10] Irizarry, Exploration, normalization, and summaries of high density oligonucleotide array probe level data, Biostatistics 4 pp 249– (2003) · Zbl 1141.62348 · doi:10.1093/biostatistics/4.2.249
[11] Kanehisa, KEGG: Kyoto encyclopedia of genes and genomes, Nucleic Acids Research 28 pp 27– (2000) · doi:10.1093/nar/28.1.27
[12] Lam, Expression of p19INK4d, CDK4, CDK6 in glioblastoma multiforme, British Journal of Neurosurgery 14 pp 28– (2000) · doi:10.1080/02688690042870
[13] Li, Network-constrained regularization and variable selection for analysis of genomic data, Bioinformatics 24 pp 1175– (2008) · doi:10.1093/bioinformatics/btn081
[14] Pan , W. 2009 Network-based multiple locus linkage analysis of expression traits http://www.biostat.umn.edu/rrs.php
[15] Ruano, Identification of novel candidate target genes in amplicons of Glioblastoma multiforme tumors detected by expression and CGH microarray profiling, Molecular Cancer 5 pp 39– (2006) · doi:10.1186/1476-4598-5-39
[16] Seoane, Integration of Smad and forkhead pathways in the control of neuroepithelial and glioblastoma cell proliferation, Cell 117 pp 211– (2004) · doi:10.1016/S0092-8674(04)00298-3
[17] Tibshirani, Regression shrinkage and selection via the LASSO, Journal of the Royal Statistical Society, Series B 58 pp 267– (1996) · Zbl 0850.62538
[18] Wei, A Markov random field model for network-based analysis of genomic data, Bioinformatics 23 pp 1537– (2007) · doi:10.1093/bioinformatics/btm129
[19] Yuan, Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society, Series B 68 pp 49– (2006) · Zbl 1141.62030 · doi:10.1111/j.1467-9868.2005.00532.x
[20] Zhao , P. Yu , B. 2004 Boosted Lasso
[21] Zhao, Grouped and hierarchical model selection through composite absolute penalties, Annals of Statistics (2006)
[22] Zhu, Classification of gene microarrays by penalized logistic regression, Biostatistics 5 pp 427– (2004) · Zbl 1154.62406 · doi:10.1093/biostatistics/kxg046
[23] Zhu, Network-based support vector machine for classification of microarray samples, BMC Bioinformatics 10 (Suppl 1) pp S21– (2009) · doi:10.1186/1471-2105-10-S1-S21
[24] Zou, Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society, Series B 67 pp 301– (2005) · Zbl 1069.62054 · doi:10.1111/j.1467-9868.2005.00503.x
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.