An empirical study of major page faults for failure diagnosis in cluster systems

201 Accesses
Explore all metrics

Abstract

High-performance computing systems conduct extensive logging of resource usage data and system logs, and parsing these data is an often advocated basis for failure diagnosis. Major page faults are known to be one of the most common cause of performance problems in large cluster systems. We conduct an empirical study of major page faults on two large cluster systems. We set up three regression algorithms including the LASSO, Ridge and Elastic Net regression techniques. To the best of our knowledge, there is no work that studied different regression models to diagnose major page faults in a large cluster system. In this paper, we (a) propose an approach for diagnosing major page faults, and (b) evaluate the LASSO, Ridge and Elastic Net regression algorithms on real resource use data and system logs. As part of our contributions, we (a) compare the accuracy of the three regression algorithms, (b) identify the resource use counters which are correlated to major page faults and the system events which are correlated to page fault events, and (c) provide insights into major page faults and page fault events. Our work highlights empirical observations that could facilitate better handling of node failures in cluster systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Failure Prediction of Cluster Systems Based on System Logs

Analyzing and predicting job failures from HPC system log

Article 24 June 2023

Failure prediction using machine learning in a virtualised HPC system and application

Article 21 March 2019

Data availability

The datasets analyzed during this study are available from the corresponding author on request.

References

Oliner AJ, Kulkarni AV, Aiken A (2010) Using correlated surprise to infer shared influence. In: Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). https://doi.org/10.1109/DSN.2010.5544921
Zheng Z, Yu L, Lan Z, Jones T (2012) 3-dimensional root cause diagnosis via co-analysis. In: Proceedings of ACM International Conference on Autonomic Computing (ICAC). https://doi.org/10.1145/2371536.2371571
Chuah E, Jhumka A, Alt S, Evans RT, Suri N (2021) Failure diagnosis for cluster systems using partial correlations. In: Proceedings of IEEE International Symposium on Parallel & Distributed Processing with Applications (ISPA). https://doi.org/10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00151
...Snir M, Wisniewski RW, Abraham JA, Adve SV, Bagchi S, Balaji P, Belak J, Bose P, Cappello F, Carlson B, Chien AA, Coteus P, Debardeleben NA, Diniz PC, Engelmann C, Erez M, Fazzari S, Geist A, Gupta R, Johnson F, Krishnamoorthy S, Leyffer S, Liberty D, Mitra S, Munson T, Schreiber R, Stearley J, Hensbergen EV (2014) Addressing failures in exascale computing. Int J High Perform Comput Appl. https://doi.org/10.1177/1094342014522573
Article Google Scholar
Martino CD, Baccanico F, Fullop J, Kramer W, Kalbaczyk Z, Iyer R. Lessons learned from the analysis of system failures at petascale: The case of blue waters. In: Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), p. 2014. https://doi.org/10.1109/DSN.2014.62
Mitra S, Javagal S, Maji AK, Gamblin T, Moody A, Harrell S, Bagchi S (2016) A study of failures in community clusters: The case of conte. In: Proceedings of the 2016 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), pp. 189–196. https://doi.org/10.1109/ISSREW.2016.7
Gupta S, Patel T, Engelmann C, Tiwari D (2017) Failures in large scale systems: Long-term measurement, analysis, and implications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). https://doi.org/10.1145/3126908.3126937
Rojas E, Meneses E, Jones T, Maxwell D (2019) Analyzing a five-year failure record of a leadership-class supercomputer. In: Proceedings of the 31st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 196–203. https://doi.org/10.1109/SBAC-PAD.2019.00040. IEEE
Kumar R, Jha S, Mahgoub A, Kalyanam R, Harrell S, Song XC, Kalbarczyk Z, Kramer W, Iyer R, Bagchi S (2020) The mystery of the failing jobs: Insights from operational data from two university-wide computing systems. In: Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). https://doi.org/10.1109/DSN48063.2020.00034
Liu Z, Lewis R, Kettimuthu R, Harms K, Carns P, Rao N, Foster I, Papka ME (2020) Characterization and identification of HPC applications at leadership computing facility. In: Proceedings of the 34th ACM International Conference on Supercomputing (ICS). Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3392717.3392774
Rojas E, Meneses E, Jones T, Maxwell D (2021) Understanding failures through the lifetime of a top-level supercomputer. J Parallel Distrib Comput 154:27–41. https://doi.org/10.1016/j.jpdc.2021.04.001
Article Google Scholar
Ferreira KB, Levy S, Hemmert J, Pedretti K (2022) Understanding memory failures on a petascale Arm system. In: Proceedings of the 31st ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), pp. 84–96. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3502181.3531465
Abraham JP, Mathew S (2015) A novel approach to improve the processor performance with page replacement method. Proc Comput Sci. https://doi.org/10.1016/j.procs.2015.02.054
Article Google Scholar
Tirumalasetty C, Chou CC, Reddy N, Gratz P, Abouelwafa A (2022) Reducing minor page fault overheads through enhanced page walker. ACM Trans Arch Code Optim. https://doi.org/10.1145/3547142
Article Google Scholar
Psistakis A, Chrysos N, Chaix F, Asiminakis M, Gianioudis M, Xirouchakis P, Papaefstathiou V, Katevenis M (2022) Optimized page fault handling during RDMA. IEEE Trans Parallel Distrib Syst 33(12):3990–4005. https://doi.org/10.1109/TPDS.2022.3175666
Article Google Scholar
Chuah E, Jhumka A, Narasimharmuthy S, Hammond J, Browne JC, Barth B (2013) Linking resource usage anomalies with system failures from cluster log data. In: Proceedings of IEEE International Symposium on Reliable Distributed Systems (SRDS). https://doi.org/10.1109/SRDS.2013.20
Chuah E, Jhumka A, Browne JC, Gurumdimma N, Narasimharmuthy S, Barth B (2016) Using message logs and resource use data for cluster failure diagnosis. In: Proceedings of IEEE International Conference on High Performance Computing (HiPC). https://doi.org/10.1109/HiPC.2016.035
Fu X, Ren R, Zhan J, Zhou W, Jia Z, Lu G (2012) Logmaster: Mining event correlations in logs of large-scale cluster systems. In: Proceedings of IEEE International Symposium on Reliable Distributed Systems (SRDS), pp. 71–80. https://doi.org/10.1109/SRDS.2012.40
Fu X, Ren R, McKee SA, Zhan J, Sun N (2014) Digging deeper into cluster system logs for failure prediction and root cause diagnosis. In: Proceedings of IEEE International Conference on Cluster Computing (CLUSTER). https://doi.org/10.1109/CLUSTER.2014.6968768
Hammond JL, Minyard T, Browne J (2010) End-to-end framework for fault management for open source clusters: Ranger. In: Proceedings of ACM TeraGrid Conference. https://doi.org/10.1145/1838574.1838583
Avizienis A, Lapire J-C, Randell B, Landwehr C (2004) Basic concepts and taxonomy of dependable and secure computing. IEEE Trans Dependable Secure Comput 1(1):11–33. https://doi.org/10.1109/TDSC.2004.2
Article Google Scholar
Mano MM (1993) Computer system architecture. Prentice Hall International Edition, Boston
MATH Google Scholar
Tan P-N, Steinbach M, Kumar V (2006) Introduction to data mining. Addison-Wesley, Boston
Google Scholar
Evans RT, Browne JC, Barth WL (2016) Understanding application and system performance through system-wide monitoring. In: Proceedings of IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). https://doi.org/10.1109/IPDPSW.2016.145
Palmer JT, Gallo SM, Furlani TR, Jones MD, DeLeon RL, White JP, Simakov N, Patra AK, Sperhac J, Yearke T, Rathsam R, Innus M, Cornelius CD, Browne JC, Barth WL, Evans RT (2015) Open XDMoD: a tool for the comprehensive management of high-performance computing resources. Comput Sci Eng. https://doi.org/10.1109/MCSE.2015.68
Article Google Scholar
Agresti A, Franklin C (2009) Statistics: the art and science of learning from data. Prentice Hall International, Boston
Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Roy Stat Soc: Ser B (Methodol) 58(1):267–288
MathSciNet MATH Google Scholar
Hoerl AE, Kennard RW (2000) Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 42(1):80–86
Article MATH Google Scholar
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J Royal Stat Soc Ser B (Stat Methodol) 67(2):301–320
Article MathSciNet MATH Google Scholar
Walpole RE, Myers RH, Myers SL (1998) Probab Stat Eng Sci. Prentice Hall International, Boston
Google Scholar
Das A, Müller F, Rountree B (2021) Systemic assessment of node failures in HPC production platforms. In: Proceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS). https://doi.org/10.1109/IPDPS49936.2021.00035

Download references

Acknowledgements

We would like to thank the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing the resource use data and system logs from their HPC systems. We would also like to thank the anonymous reviewers for their constructive feedback, which helped improve our paper significantly.

Funding

No funding was received to assist with the preparation of this manuscript.

Author information

Authors and Affiliations

The University of Aberdeen, Aberdeen, AB24 3FX, UK
Edward Chuah
The University of Warwick, Coventry, CV4 7AL, UK
Arshad Jhumka
Seagate Technology, Portsmouth, PO9 1SA, UK
Sai Narasimhamurthy

Authors

Edward Chuah
View author publications
You can also search for this author in PubMed Google Scholar
Arshad Jhumka
View author publications
You can also search for this author in PubMed Google Scholar
Sai Narasimhamurthy
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

EC prepared the manuscript and conducted the experiments. AJ and SN reviewed and edited the manuscript.

Corresponding author

Correspondence to Edward Chuah.

Ethics declarations

Conflict of interest

All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chuah, E., Jhumka, A. & Narasimhamurthy, S. An empirical study of major page faults for failure diagnosis in cluster systems. J Supercomput 79, 18445–18479 (2023). https://doi.org/10.1007/s11227-023-05366-1

Download citation

Accepted: 28 April 2023
Published: 15 May 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s11227-023-05366-1

An empirical study of major page faults for failure diagnosis in cluster systems

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

The Failure Prediction of Cluster Systems Based on System Logs

Analyzing and predicting job failures from HPC system log

Failure prediction using machine learning in a virtualised HPC system and application

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

An empirical study of major page faults for failure diagnosis in cluster systems

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

The Failure Prediction of Cluster Systems Based on System Logs

Analyzing and predicting job failures from HPC system log

Failure prediction using machine learning in a virtualised HPC system and application

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation