×

On optimal multiple changepoint algorithms for large data. (English) Zbl 1505.62269

Summary: Many common approaches to detecting changepoints, for example based on statistical criteria such as penalised likelihood or minimum description length, can be formulated in terms of minimising a cost over segmentations. We focus on a class of dynamic programming algorithms that can solve the resulting minimisation problem exactly, and thus find the optimal segmentation under the given statistical criteria. The standard implementation of these dynamic programming methods have a computational cost that scales at least quadratically in the length of the time-series. Recently pruning ideas have been suggested that can speed up the dynamic programming algorithms, whilst still being guaranteed to be optimal, in that they find the true minimum of the cost function. Here we extend these pruning methods, and introduce two new algorithms for segmenting data: FPOP and SNIP. Empirical results show that FPOP is substantially faster than existing dynamic programming methods, and unlike the existing methods its computational efficiency is robust to the number of changepoints in the data. We evaluate the method for detecting copy number variations and observe that FPOP has a computational cost that is even competitive with that of binary segmentation, but can give much more accurate segmentations.

MSC:

62-08 Computational methods for problems pertaining to statistics
62M10 Time series, auto-correlation, regression, etc. in statistics (GARCH)
62P10 Applications of statistics to biology and medical sciences; meta analysis

Software:

cghseg; wbs; SegAnnDB

References:

[1] Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19, 716-723 (1974) · Zbl 0314.62039 · doi:10.1109/TAC.1974.1100705
[2] Aue, A., Horvth, L.: Structural breaks in time series. J. Time Ser. Anal. 34(1), 1-16 (2013) · Zbl 1274.62553 · doi:10.1111/j.1467-9892.2012.00819.x
[3] Auger, I.E., Lawrence, C.E.: Algorithms for the optimal identification of segment neighborhoods. Bull. Math. Biol. 51, 39-54 (1989) · Zbl 0658.92010 · doi:10.1007/BF02458835
[4] Braun, J.V., Braun, R.K., Muller, H.G.: Multiple changepoint fitting via quasilikelihood, with application to DNA sequence segmentation. Biometrika 87, 301-314 (2000) · Zbl 0963.62067 · doi:10.1093/biomet/87.2.301
[5] Braun, J.V., Müller, H.-G.: Statistical methods for DNA sequence segmentation. Stat. Sci. 13(2), 142-162 (1998) · Zbl 0960.62121 · doi:10.1214/ss/1028905933
[6] Cleynen, A., Koskas, M., Rigaill, G.: A generic implementation of the pruned dynamic programing algorithm. ArXiv e-prints (2012)
[7] Davis, R.A., Lee, T.C.M., Rodriguez-Yam, G.A.: Structural break estimation for nonstationary time series models. J. Am. Stat. Assoc. 101, 223-239 (2006) · Zbl 1118.62359 · doi:10.1198/016214505000000745
[8] Frick, K., Munk, A., Sieling, H.: Multiscale change point inference. J. R. Stat. Soc. Ser. B Stat. Methodol. 76(3), 495-580 (2014) · Zbl 1411.62065 · doi:10.1111/rssb.12047
[9] Fryzlewicz, P.: Wild binary segmentation for multiple change-point detection. Ann. Stat. (2012) (to appear) · Zbl 1302.62075
[10] Futschik, A., Hotz, T., Munk, A., Sieling, H.: Multiscale DNA partitioning: statistical evidence for segments. Bioinformatics 30(16), 2255-2262 (2014) · doi:10.1093/bioinformatics/btu180
[11] Haynes, K., Eckley, I. A., Fearnhead, P.: Efficient penalty search for multiple changepoint problems. ArXiv e-prints (2014) · Zbl 1505.62181
[12] Hocking, T.D., Boeva, V., Rigaill, G., Schleiermacher, G., Janoueix-Lerosey, I., Delattre, O., Richer, W., Bourdeaut, F., Suguro, M., Seto, M., Bach, F., Vert, J.-P.: SegAnnDB: interactive web-based genomic segmentation. Bioinformatics 30, 1539-1546 (2014) · doi:10.1093/bioinformatics/btu072
[13] Hocking, T.D., Schleiermacher, G., Janoueix-lerosey, I., Boeva, V., Cappo, J., Delattre, O., Bach, F., Vert, J.-P.: Learning smoothing models of copy number profiles using breakpoint annotations. BNC Bioinform. 14, 164 (2013) · doi:10.1186/1471-2105-14-164
[14] Jackson, B., Scargle, J.D., Barnes, D., Arabhi, S., Alt, A., Gioumousis, P., Gwin, E., Sangtrakulcharoen, P., Tan, L., Tsai, T.T.: An algorithm for optimal partitioning of data on an interval. IEE Signal Process. Lett. 12, 105-108 (2005) · doi:10.1109/LSP.2001.838216
[15] Killick, R., Eckley, I.A., Ewans, K., Jonathan, P.: Detection of changes in variance of oceanographic time-series using changepoint analysis. Ocean Eng. 37(13), 1120-1126 (2010) · doi:10.1016/j.oceaneng.2010.04.009
[16] Killick, R., Fearnhead, P., Eckley, I.A.: Optimal detection of changepoints with a linear computational cost. J. Am. Stat. Assoc. 107, 1590-1598 (2012) · Zbl 1258.62091 · doi:10.1080/01621459.2012.737745
[17] Lavielle, M.: Using penalized contrasts for the change-point problem. Signal Process. 85, 1501-1510 (2005) · Zbl 1160.94341 · doi:10.1016/j.sigpro.2005.01.012
[18] Lee, C.-B.: Estimating the number of change points in a sequence of independent normal random variables. Stat. Prob. Lett. 25(3), 241-248 (1995) · Zbl 0839.62015 · doi:10.1016/0167-7152(94)00227-Y
[19] Olshen, A.B., Venkatraman, E.S., Lucito, R., Wigler, M.: Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5, 557-572 (2004) · Zbl 1155.62478 · doi:10.1093/biostatistics/kxh008
[20] Picard, F., Lebarbier, E., Hoebeke, M., Rigaill, G., Thiam, B., Robin, S.: Joint segmentation, calling, and normalization of multiple CGH profiles. Biostatistics 12, 413-428 (2011) · doi:10.1093/biostatistics/kxq076
[21] Reeves, J., Chen, J., Wang, X.L., Lund, R., Lu, Q.Q.: A review and comparison of changepoint detection techniques for climate data. J. Appl. Meteorol. Climatol. 46, 900-915 (2007) · doi:10.1175/JAM2493.1
[22] Rigaill, G.: Pruned dynamic programming for optimal multiple change-point detection. ArXiv e-prints (2010) · Zbl 0284.62044
[23] Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461-464 (1978) · Zbl 0379.62005 · doi:10.1214/aos/1176344136
[24] Scott, A.J., Knott, M.: A cluster analysis method for grouping means in the analysis of variance. Biometrics 30, 507-512 (1974) · Zbl 0284.62044 · doi:10.2307/2529204
[25] Yao, Y.C.: Estimating the number of change-points via Schwarz’ criterion. Stat. Prob. Lett. 6(2), 181-189 (1988) · Zbl 0642.62016 · doi:10.1016/0167-7152(88)90118-6
[26] Yao, Y.-C., Au, S.T.: Least-squares estimation of a step function. Indian J. Stat. 51(3), 370-381 (1989) · Zbl 0711.62031
[27] Zhang, N.R., Siegmund, D.O.: A modified bayes information criterion with applications to the analysis of comparative genomic hybridization data. Biometrics 63, 22-32 (2007) · Zbl 1206.62174 · doi:10.1111/j.1541-0420.2006.00662.x
[28] Zhang, N.R., Siegmund, D.O., Ji, H., Li, J.Z.: Detecting simultaneous changepoints in multiple sequences. Biometrika 97(3), 631-645 (2010) · Zbl 1195.62168 · doi:10.1093/biomet/asq025
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.