×

Sequential model selection-based segmentation to detect DNA copy number variation. (English) Zbl 1390.62266

Summary: Array-based CGH experiments are designed to detect genomic aberrations or regions of DNA copy-number variation that are associated with an outcome, typically a state of disease. Most of the existing statistical methods target on detecting DNA copy number variations in a single sample or array. We focus on the detection of group effect variation, through simultaneous study of multiple samples from multiple groups. Rather than using direct segmentation or smoothing techniques, as commonly seen in existing detection methods, we develop a sequential model selection procedure that is guided by a modified Bayesian information criterion. This approach improves detection accuracy by accumulatively utilizing information across contiguous clones, and has computational advantage over the existing popular detection methods. Our empirical investigation suggests that the performance of the proposed method is superior to that of the existing detection methods, in particular, in detecting small segments or separating neighboring segments with differential degrees of copy-number variation.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
62L10 Sequential statistical analysis
62K05 Optimal statistical designs

References:

[1] Ahn, T., Lee, E., Huh, N., and Park, T. (2014). Personalized identification of altered pathways in cancer using accumulated normal tissue data. Bioinformatics30, i422-i429.
[2] BenDor, A., Lipson, D., Tsalenko, A., Reimers, M., Baumbusch, L., Barrett, M., Weinstein, J., BorresenDale, A., and Yakhini, Z. (2007). Framework for identifying common aberrations in DNA copy number data. Proceedings of RECOMB ’07 4453, 122-136.
[3] Bolstad, B. M., Irizarry, R. A., Astrand, M., and Speed, T. P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on bias and variance. Bioinformatics19, 185-193.
[4] Carrasco, D. R., Tonon, G., Huang, Y., Zhang, Y., Sinha, R., Feng, B., et al. (2006). High‐resolution genomic profiles define distinct clinico‐pathogenetic subgroups of multiple myeloma patients. Cancer Cell9, 313-325.
[5] Chen, C., Mendez, E., Houck, J., Fan, W., Lohavanichbutr, P., Doody, D., et al. (2008). Gene expression profiling identifies genes predictive of oral squamous cell carcinoma. Cancer Epidemiological Biomarkers Prevention17, 2152-2162.
[6] Efron, B. and Zhang, N. R. (2011). False discovery rates and copy number variation. Biometrika98, 251-271. · Zbl 1215.62115
[7] Duan, J., Zhang, J. G., Deng, H. W., and Wang, Y. P. (2013). Comparative studies of copy number variation detection methods for next‐generation sequencing technologies. PLoS ONE8, e59128.
[8] Fabris, S., Ronchetti, D., Agnelli, L., L Baldini, L., Morabito, F., Bicciato, S., et al. (2007). Transcriptional features of multiple myeloma patients with chromosome 1q gain. Leukemia21, 1113-1116.
[9] Guha, S., Li, Y., and Neuberg, D. (2008). Bayesian hidden Markov modeling of array CGH data. Journal of the American Statistical Association103, 485-497. · Zbl 1469.62368
[10] Huang, J., Gusnanto, A., O’Sullivan, K., Staaf, J., Borg, A., and Pawitan, Y. (2007). Robust smooth segmentation approach for array CGH data analysis. Bioinformatics23, 2463-2469.
[11] Jeng, X. J,, Cai, T. T., and Li, H. (2013). Simultaneous discovery of rare and common segment variants. Biometrika100, 157-172. · Zbl 1284.62658
[12] Kim, T., Choi, J., Kim, W., Choi, C., Lee, J., Bae, D., et al. (2008). Gene expression profiling for the prediction of lymph node metastasis in patients with cervical cancer. Cancer Science99, 31-38.
[13] Klijn, C., Holstege, H., de Ridder, J., Liu, X., Reinders, M., Jonkers, J., et al. (2008). Identification of cancer genes using a statistical framework for multiexperiment analysis of nondiscretized array CGH data. Nucleic Acids Research36, e13-e13.
[14] Koenker, R. (2005). Quantile Regression. New York: Cambridge University Press. · Zbl 1111.62037
[15] Lai, T. L., Xing, H., and Zhang, N. (2008). Stochastic segmentation models for array‐based comparative genomic hybridization data analysis. Biostatistics9, 290-307. · Zbl 1143.62082
[16] Largo, C., Alvarez, S., Saez, B., Blesa, D., Martin‐Subero, J. I., Gonzalez‐Garcia, I., et al. (2006). Identification of overexpressed genes in frequently gained/amplified chromosome regions in multiple myeloma. Haematologica91, 184-191.
[17] Lu, H., Knutson, K. L., Gad, E., and Disis, M. L. (2006). The tumor antigen repertoire identified in tumor‐bearing neu transgenic mice predicts human tumor antigens. Cancer Research66, 9754-9761.
[18] Lu, T., Lai, L., Tsai, M., Chen, P., Hsu, C., Lee, J., et al. (2011). Integrated analyses of copy number variations and gene expression in lung adenocarcinoma. PLoS ONE6, e24829.
[19] Lu, T., Hsiao, C., Lai, L., Tsai, M., Hsu, C., Lee, J., et al. (2015). Identification of regulatory SNPs associated with genetic modifications in lung adenocarcinoma. BMC Research Notes892.
[20] Mehalow, A. K., Kameya, S., Smith, R. S., Hawes, N. L., Denegre, J. M., Young, J. A., et al. (2003). CRB1 is essential for external limiting membrane integrity and photoreceptor morphogenesis in the mammalian retina. Human Molecular Genetics12, 2179-2189.
[21] Nishii, R. (1984). Asymptotic properties of criteria for selection of variables in multiple regression. The Annals of Statistics12, 758-765. · Zbl 0544.62063
[22] Niu, Y. S. and Zhang, H. (2012). The screening and ranking algorithm to detect DNA copy number variations. The Annals of Applied Statistics6, 1306-1326. · Zbl 1401.92145
[23] Olshen, A. B., Venkatraman, E. S., Lucito, R., and Wigler, M. (2004). Circular binary segmentation for the analysis of array‐based DNA copy number data. Biostatistics5, 557-572. · Zbl 1155.62478
[24] Pinkel, D. and Albertson, D. G. (2005). Array comparative genomic hybridization and its applications in cancer. Nature Genetics37, Suppl, S11-7.
[25] Rouveirol, C., Stransky, N., Hupe, P., Rosa, P. L., Viara, E., Barillot, E., et al. (2006). Computation of reccurant minimla genomic alterations from array‐CGH data. Bioinformatics22, 849-856.
[26] Rueda, O. M. and Diaz‐Uriarte, R. (2010). Finding recurrent copy number alteration regions: A review of methods. Current Bioinformatics5, 1-17.
[27] Salicioni, A. M., Xi, M., Vanderveer, L. A., Balsara, B., Testa, J. R., Dunbrack, R. L. Jr, et al. (2000). Identification and structural analysis of human RBM8A and RBM8B: Two highly conserved RNA‐binding motif proteins that interact with OVCA1, a candidate tumor suppressor. Genomics69, 54-62.
[28] Schwartz, G. (1978). Estimating the dimension of a model. The Annals of Statistics6, 461-464. · Zbl 0379.62005
[29] Shah, S. P. (2008). Computational methods for identification of recurrent copy number alteration patterns by array CGH. Cytogenetic and Genome Research123, 343-351.
[30] Shao, J. (1997). An asymptotic theory for linear model selection. Statistica Sinica7, 221-264. · Zbl 1003.62527
[31] Shi, P. and Tsai, C. L. (2002). Regression model selectiona residual likelihood approach. Journal of the Royal Statistical Society, Series B (Statistical Methodology)64, 237-252. · Zbl 1059.62074
[32] Siegmund, D. O., Yakir, B., and Zhang, N. R. (2011). Detecting simultaneous variant intervals in aligned sequences. The Annals of Applied Statistics5, 645-668. · Zbl 1223.62166
[33] Smetana, J., Frohlich, J., Zaoralova, R., Vallova, V., Greslikova, H., Kupska, R., et al. (2014). Genome‐wide screening of cytogenetic abnormalities in multiple myeloma patients using array‐CGH technique: A Czech multicenter experience. BioMed Research International, 209-670.
[34] Snijders, A. M., Nowak, N., Segraves, R., Blackwood, S., Brown, N., Conroy, J., et al. (2001). Assembly of microarrays for genome‐wide measurement of DNA copy number. Nature Genetics29, 263-264.
[35] Siegmund, D., Yakir, B., and Zhang, N. R. (2011). Detecting simultaneous variant intervals in aligned sequences. The Annals of Applied Statistics5, 645-668. · Zbl 1223.62166
[36] Tan, R., Wang, Y., Kleinstein, S. E., Liu, Y. Z., Zhu, X. L., Guo, H. Z., et al. (2014). An evaluation of copy number variation detection tools from whole‐exome sequencing data. Human Mutation35, 899-907.
[37] Tibshirani, R. and Wang, P. (2007). Spatial smoothing and hot spot detection for CGH data using the fused lasso. Biostatistics9, 18-29. · Zbl 1274.62886
[38] Tonon, G., Wong, K. K., Maulik, G., Brennan, C., Feng, B., Zhang, Y., et al. (2005). High‐resolution genomic profiles of human lung cancer. Proceedings of the National Academy of Sciences of the United States of America102, 9625-9630.
[39] Wang, H. and Hu, J. (2011). Identification of differential aberrations in multiple‐sample array CGH studies. Biometrics67, 353-362. · Zbl 1217.62195
[40] Wang, H., Li, B., and Leng, C. (2009). Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society, Series B (Statistical Methodology)71, 671-683. · Zbl 1250.62036
[41] Willenbrock, H. and Fridlyand, J. (2005). A comparison study: Applying segmentation to array CGH data for downstream analyses. Bioinformatics21, 4084-4091.
[42] Ylipaa, A., Nykter, M., Kivinen, V., Hu, L., Cogdell, D., Hun, K., et al. (2008). Finding common aberrations in array CGH data. In Proceedings of 3rd International Symposium on Communications, Control and Signal Processing (ISCCSP 2008), 1199-1204, St. Julians, Malta, Mar (2008).
[43] Zhang, N. R., Siegmund, D. O., Ji, H., and Li, J. (2010). Detecting simultaneous change‐points in multiple sequences. Biometrika97, 631-645. · Zbl 1195.62168
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.