Abstract
High-performance computing systems conduct extensive logging of resource usage data and system logs, and parsing these data is an often advocated basis for failure diagnosis. Major page faults are known to be one of the most common cause of performance problems in large cluster systems. We conduct an empirical study of major page faults on two large cluster systems. We set up three regression algorithms including the LASSO, Ridge and Elastic Net regression techniques. To the best of our knowledge, there is no work that studied different regression models to diagnose major page faults in a large cluster system. In this paper, we (a) propose an approach for diagnosing major page faults, and (b) evaluate the LASSO, Ridge and Elastic Net regression algorithms on real resource use data and system logs. As part of our contributions, we (a) compare the accuracy of the three regression algorithms, (b) identify the resource use counters which are correlated to major page faults and the system events which are correlated to page fault events, and (c) provide insights into major page faults and page fault events. Our work highlights empirical observations that could facilitate better handling of node failures in cluster systems.
Similar content being viewed by others
Data availability
The datasets analyzed during this study are available from the corresponding author on request.
References
Oliner AJ, Kulkarni AV, Aiken A (2010) Using correlated surprise to infer shared influence. In: Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). https://doi.org/10.1109/DSN.2010.5544921
Zheng Z, Yu L, Lan Z, Jones T (2012) 3-dimensional root cause diagnosis via co-analysis. In: Proceedings of ACM International Conference on Autonomic Computing (ICAC). https://doi.org/10.1145/2371536.2371571
Chuah E, Jhumka A, Alt S, Evans RT, Suri N (2021) Failure diagnosis for cluster systems using partial correlations. In: Proceedings of IEEE International Symposium on Parallel & Distributed Processing with Applications (ISPA). https://doi.org/10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00151
...Snir M, Wisniewski RW, Abraham JA, Adve SV, Bagchi S, Balaji P, Belak J, Bose P, Cappello F, Carlson B, Chien AA, Coteus P, Debardeleben NA, Diniz PC, Engelmann C, Erez M, Fazzari S, Geist A, Gupta R, Johnson F, Krishnamoorthy S, Leyffer S, Liberty D, Mitra S, Munson T, Schreiber R, Stearley J, Hensbergen EV (2014) Addressing failures in exascale computing. Int J High Perform Comput Appl. https://doi.org/10.1177/1094342014522573
Martino CD, Baccanico F, Fullop J, Kramer W, Kalbaczyk Z, Iyer R. Lessons learned from the analysis of system failures at petascale: The case of blue waters. In: Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), p. 2014. https://doi.org/10.1109/DSN.2014.62
Mitra S, Javagal S, Maji AK, Gamblin T, Moody A, Harrell S, Bagchi S (2016) A study of failures in community clusters: The case of conte. In: Proceedings of the 2016 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), pp. 189–196. https://doi.org/10.1109/ISSREW.2016.7
Gupta S, Patel T, Engelmann C, Tiwari D (2017) Failures in large scale systems: Long-term measurement, analysis, and implications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). https://doi.org/10.1145/3126908.3126937
Rojas E, Meneses E, Jones T, Maxwell D (2019) Analyzing a five-year failure record of a leadership-class supercomputer. In: Proceedings of the 31st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 196–203. https://doi.org/10.1109/SBAC-PAD.2019.00040. IEEE
Kumar R, Jha S, Mahgoub A, Kalyanam R, Harrell S, Song XC, Kalbarczyk Z, Kramer W, Iyer R, Bagchi S (2020) The mystery of the failing jobs: Insights from operational data from two university-wide computing systems. In: Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). https://doi.org/10.1109/DSN48063.2020.00034
Liu Z, Lewis R, Kettimuthu R, Harms K, Carns P, Rao N, Foster I, Papka ME (2020) Characterization and identification of HPC applications at leadership computing facility. In: Proceedings of the 34th ACM International Conference on Supercomputing (ICS). Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3392717.3392774
Rojas E, Meneses E, Jones T, Maxwell D (2021) Understanding failures through the lifetime of a top-level supercomputer. J Parallel Distrib Comput 154:27–41. https://doi.org/10.1016/j.jpdc.2021.04.001
Ferreira KB, Levy S, Hemmert J, Pedretti K (2022) Understanding memory failures on a petascale Arm system. In: Proceedings of the 31st ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), pp. 84–96. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3502181.3531465
Abraham JP, Mathew S (2015) A novel approach to improve the processor performance with page replacement method. Proc Comput Sci. https://doi.org/10.1016/j.procs.2015.02.054
Tirumalasetty C, Chou CC, Reddy N, Gratz P, Abouelwafa A (2022) Reducing minor page fault overheads through enhanced page walker. ACM Trans Arch Code Optim. https://doi.org/10.1145/3547142
Psistakis A, Chrysos N, Chaix F, Asiminakis M, Gianioudis M, Xirouchakis P, Papaefstathiou V, Katevenis M (2022) Optimized page fault handling during RDMA. IEEE Trans Parallel Distrib Syst 33(12):3990–4005. https://doi.org/10.1109/TPDS.2022.3175666
Chuah E, Jhumka A, Narasimharmuthy S, Hammond J, Browne JC, Barth B (2013) Linking resource usage anomalies with system failures from cluster log data. In: Proceedings of IEEE International Symposium on Reliable Distributed Systems (SRDS). https://doi.org/10.1109/SRDS.2013.20
Chuah E, Jhumka A, Browne JC, Gurumdimma N, Narasimharmuthy S, Barth B (2016) Using message logs and resource use data for cluster failure diagnosis. In: Proceedings of IEEE International Conference on High Performance Computing (HiPC). https://doi.org/10.1109/HiPC.2016.035
Fu X, Ren R, Zhan J, Zhou W, Jia Z, Lu G (2012) Logmaster: Mining event correlations in logs of large-scale cluster systems. In: Proceedings of IEEE International Symposium on Reliable Distributed Systems (SRDS), pp. 71–80. https://doi.org/10.1109/SRDS.2012.40
Fu X, Ren R, McKee SA, Zhan J, Sun N (2014) Digging deeper into cluster system logs for failure prediction and root cause diagnosis. In: Proceedings of IEEE International Conference on Cluster Computing (CLUSTER). https://doi.org/10.1109/CLUSTER.2014.6968768
Hammond JL, Minyard T, Browne J (2010) End-to-end framework for fault management for open source clusters: Ranger. In: Proceedings of ACM TeraGrid Conference. https://doi.org/10.1145/1838574.1838583
Avizienis A, Lapire J-C, Randell B, Landwehr C (2004) Basic concepts and taxonomy of dependable and secure computing. IEEE Trans Dependable Secure Comput 1(1):11–33. https://doi.org/10.1109/TDSC.2004.2
Mano MM (1993) Computer system architecture. Prentice Hall International Edition, Boston
Tan P-N, Steinbach M, Kumar V (2006) Introduction to data mining. Addison-Wesley, Boston
Evans RT, Browne JC, Barth WL (2016) Understanding application and system performance through system-wide monitoring. In: Proceedings of IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). https://doi.org/10.1109/IPDPSW.2016.145
Palmer JT, Gallo SM, Furlani TR, Jones MD, DeLeon RL, White JP, Simakov N, Patra AK, Sperhac J, Yearke T, Rathsam R, Innus M, Cornelius CD, Browne JC, Barth WL, Evans RT (2015) Open XDMoD: a tool for the comprehensive management of high-performance computing resources. Comput Sci Eng. https://doi.org/10.1109/MCSE.2015.68
Agresti A, Franklin C (2009) Statistics: the art and science of learning from data. Prentice Hall International, Boston
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Roy Stat Soc: Ser B (Methodol) 58(1):267–288
Hoerl AE, Kennard RW (2000) Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 42(1):80–86
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J Royal Stat Soc Ser B (Stat Methodol) 67(2):301–320
Walpole RE, Myers RH, Myers SL (1998) Probab Stat Eng Sci. Prentice Hall International, Boston
Das A, Müller F, Rountree B (2021) Systemic assessment of node failures in HPC production platforms. In: Proceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS). https://doi.org/10.1109/IPDPS49936.2021.00035
Acknowledgements
We would like to thank the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing the resource use data and system logs from their HPC systems. We would also like to thank the anonymous reviewers for their constructive feedback, which helped improve our paper significantly.
Funding
No funding was received to assist with the preparation of this manuscript.
Author information
Authors and Affiliations
Contributions
EC prepared the manuscript and conducted the experiments. AJ and SN reviewed and edited the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.
Ethical approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chuah, E., Jhumka, A. & Narasimhamurthy, S. An empirical study of major page faults for failure diagnosis in cluster systems. J Supercomput 79, 18445–18479 (2023). https://doi.org/10.1007/s11227-023-05366-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-023-05366-1