×

Malware family discovery using reversible jump MCMC sampling of regimes. (English) Zbl 1409.62168

Summary: Malware is computer software that has either been designed or modified with malicious intent. Hundreds of thousands of new malware threats appear on the internet each day. This is made possible through reuse of known exploits in computer systems that have not been fully eradicated; existing pieces of malware can be trivially modified and combined to create new malware, which is unknown to anti-virus programs. Finding new software with similarities to known malware is therefore an important goal in cyber-security. A dynamic instruction trace of a piece of software is the sequence of machine language instructions it generates when executed. Statistical analysis of a dynamic instruction trace can help reverse engineers infer the purpose and origin of the software that generated it. Instruction traces have been successfully modeled as simple Markov chains, but empirically there are change points in the structure of the traces, with recurring regimes of transition patterns. Here, reversible jump Markov chain Monte Carlo for change point detection is extended to incorporate regime-switching, allowing regimes to be inferred from malware instruction traces. A similarity measure for malware programs based on regime matching is then used to infer the originating families, leading to compelling performance results.

MSC:

62M10 Time series, auto-correlation, regression, etc. in statistics (GARCH)
62P99 Applications of statistics

Software:

changepoint

References:

[1] Anderson, B.; Lane, T.; Hash, C., Malware Phylogenetics Based on the Multiview Graphical Lasso, Advances in Intelligent Data Analysis XIII, 1-12 (2014)
[2] Beal, M. J.; Ghahramani, Z.; Rasmussen, C. E., The Infinite Hidden Markov Model, Neural Information Processing Systems 14, eds. T. G. Dietterich, S. Becker, and Z. Ghahramani, Cambridge, MA: MIT Press, 577-585 (2002)
[3] Bolton, A.; Heard, N., Application of a Linear Time Method for Change Point Detection to the Classification of Software, Proceedings of the IEEE Joint Intelligence and Security Informatics Conference (JISIC), The Hague: IEEE, 292-295 (2014)
[4] Brooks, S. P.; Gelman, A., General Methods for Monitoring Convergence of Iterative Simulations, Journal of Computational and Graphical Statistics, 7, 434-455 (1998)
[5] Chib, S., Calculating Posterior Distributions and Modal Estimates in Markov Mixture Models, Journal of Econometrics, 75, 79-97 (1996) · Zbl 0864.62010
[6] Contagio, Mandiant APT1 Samples Categorized by Malware Families,” available at: . (2013)
[7] Dai, J.; Guha, R.; Lee, J., Efficient Virus Detection Using Dynamic Instruction Sequences, Journal of Computers, 4, 405-414 (2009)
[8] Denison, D. G. T.; Holmes, C. C.; Mallick, B. K.; Smith, A. F. M., Bayesian Methods for Nonlinear Classification and Regression (2002), Wiley Series in Probability and Statistics, Chichester: Wiley · Zbl 0994.62019
[9] Domingos, P.; Provost, F., Well-Trained PETs: Improving Probability Estimation Trees (2000)
[10] Fisher, R. A., Statistical Methods for Research Workers (1929), Edinburgh: Oliver & Boyd · JFM 60.1162.01
[11] Fitzpatrick, M.; Marchev, D., Efficient Bayesian Estimation of the Multivariate Double Chain Markov Model, Statistics and Computing, 23, 467-480 (2013) · Zbl 1325.62164
[12] Goldfeld, S. M.; Quandt, R. E., A Markov Model for Switching Regressions, Journal of Econometrics, 1, 3-15 (1973) · Zbl 0294.62087
[13] Green, P. J., Reversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination, Biometrika, 82, 711-732 (1995) · Zbl 0861.62023
[14] Han, D.; Tsung, F., The Optimal Stopping Time for Detecting Changes in Discrete Time Markov Processes, Sequential Analysis, 28, 115-135 (2009) · Zbl 1165.62054
[15] Kass, R. E.; Carlin, B. P.; Gelman, A.; Neal, R. M., Markov Chain Monte Carlo in Practice: A Roundtable Discussion, The American Statistician, 52, 93-100 (1998)
[16] Killick, R.; Fearnhead, P.; Eckley, I. A., Optimal Detection of Changepoints with a Linear Computational Cost, Journal of the American Statistical Association, 107, 1590-1598 (2012) · Zbl 1258.62091
[17] McAfee, McAfee Labs Threats Report June 2017,” available at: . (2017)
[18] Page, E. S., Continuous Inspection Schemes, Biometrika, 41, 100-115 (1954) · Zbl 0056.38002
[19] Polansky, A. M., Detecting Change-Points in Markov Chains, Computational Statistics and Data Analysis, 51, 6013-6026 (2007) · Zbl 1445.62214
[20] Royal, P.; Halpin, M.; Dagon, D.; Edmonds, R.; Lee, W., PolyUnpack: Automating the Hidden-Code Extraction of Unpack-Executing Malware, Proceedings of the 22nd Annual Computer Security Applications Conference (ACSAC ’06), Miami, FL: IEEE, 289-300 (2006)
[21] Storlie, C.; Anderson, B.; Wiel, S.; Quist, D.; Hash, C.; Brown, N., Stochastic Identification of Malware with Dynamic Traces, The Annals of Applied Statistics, 8, 1-18 (2014) · Zbl 1429.62713
[22] Tartakovsky, A. G.; Rozovskii, B. L.; Blažek, R. B.; Kim, H., Detection of Intrusions in Information Systems by Sequential Change-Point Methods, Statistical Methodology, 3, 252-293 (2006) · Zbl 1248.94032
[23] Van Gael, J.; Saatci, Y.; Teh, Y. W.; Ghahramani, Z., Beam Sampling for the Infinite Hidden Markov Model, Proceedings of the 25th International Conference on Machine Learning (ICML ’08), 1088-1095 (2008), New York: ACM, New York
[24] Vasas, K.; Elek, P.; Márkus, L., A Two-State Regime Switching Autoregressive Model with an Application to River Flow Analysis, Journal of Statistical Planning and Inference, 137, 3113-3126 (2007) · Zbl 1114.62093
[25] Xian, J.-G.; Han, D.; Yu, J.-Q., Online Change Detection of Markov Chains With Unknown Post-Change Transition Probabilities, Communications in Statistics—Theory and Methods, 45, 597-611 (2016) · Zbl 1343.62046
[26] Xing, H.; Sun, N.; Chen, Y., Credit Rating Dynamics in the Presence of Unknown Structural Breaks, Journal of Banking and Finance, 36, 78-89 (2012)
[27] Yakir, B., Optimal Detection of a Change in Distribution When the Observations Form a Markov Chain with a Finite State Space, Lecture Notes - Monograph Series, 23, 346-358 (1994) · Zbl 1163.62340
[28] Zhao, H.; Marriott, P., Variational Bayes for Regime-Switching Log-Normal Models, Entropy, 16, 3832 (2014) · Zbl 1338.62027
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.