Skip to main content
Log in

An empirical study of major page faults for failure diagnosis in cluster systems

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

High-performance computing systems conduct extensive logging of resource usage data and system logs, and parsing these data is an often advocated basis for failure diagnosis. Major page faults are known to be one of the most common cause of performance problems in large cluster systems. We conduct an empirical study of major page faults on two large cluster systems. We set up three regression algorithms including the LASSO, Ridge and Elastic Net regression techniques. To the best of our knowledge, there is no work that studied different regression models to diagnose major page faults in a large cluster system. In this paper, we (a) propose an approach for diagnosing major page faults, and (b) evaluate the LASSO, Ridge and Elastic Net regression algorithms on real resource use data and system logs. As part of our contributions, we (a) compare the accuracy of the three regression algorithms, (b) identify the resource use counters which are correlated to major page faults and the system events which are correlated to page fault events, and (c) provide insights into major page faults and page fault events. Our work highlights empirical observations that could facilitate better handling of node failures in cluster systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Data availability

The datasets analyzed during this study are available from the corresponding author on request.

References

  1. Oliner AJ, Kulkarni AV, Aiken A (2010) Using correlated surprise to infer shared influence. In: Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). https://doi.org/10.1109/DSN.2010.5544921

  2. Zheng Z, Yu L, Lan Z, Jones T (2012) 3-dimensional root cause diagnosis via co-analysis. In: Proceedings of ACM International Conference on Autonomic Computing (ICAC). https://doi.org/10.1145/2371536.2371571

  3. Chuah E, Jhumka A, Alt S, Evans RT, Suri N (2021) Failure diagnosis for cluster systems using partial correlations. In: Proceedings of IEEE International Symposium on Parallel & Distributed Processing with Applications (ISPA). https://doi.org/10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00151

  4. ...Snir M, Wisniewski RW, Abraham JA, Adve SV, Bagchi S, Balaji P, Belak J, Bose P, Cappello F, Carlson B, Chien AA, Coteus P, Debardeleben NA, Diniz PC, Engelmann C, Erez M, Fazzari S, Geist A, Gupta R, Johnson F, Krishnamoorthy S, Leyffer S, Liberty D, Mitra S, Munson T, Schreiber R, Stearley J, Hensbergen EV (2014) Addressing failures in exascale computing. Int J High Perform Comput Appl. https://doi.org/10.1177/1094342014522573

    Article  Google Scholar 

  5. Martino CD, Baccanico F, Fullop J, Kramer W, Kalbaczyk Z, Iyer R. Lessons learned from the analysis of system failures at petascale: The case of blue waters. In: Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), p. 2014. https://doi.org/10.1109/DSN.2014.62

  6. Mitra S, Javagal S, Maji AK, Gamblin T, Moody A, Harrell S, Bagchi S (2016) A study of failures in community clusters: The case of conte. In: Proceedings of the 2016 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), pp. 189–196. https://doi.org/10.1109/ISSREW.2016.7

  7. Gupta S, Patel T, Engelmann C, Tiwari D (2017) Failures in large scale systems: Long-term measurement, analysis, and implications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). https://doi.org/10.1145/3126908.3126937

  8. Rojas E, Meneses E, Jones T, Maxwell D (2019) Analyzing a five-year failure record of a leadership-class supercomputer. In: Proceedings of the 31st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 196–203. https://doi.org/10.1109/SBAC-PAD.2019.00040. IEEE

  9. Kumar R, Jha S, Mahgoub A, Kalyanam R, Harrell S, Song XC, Kalbarczyk Z, Kramer W, Iyer R, Bagchi S (2020) The mystery of the failing jobs: Insights from operational data from two university-wide computing systems. In: Proceedings of IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). https://doi.org/10.1109/DSN48063.2020.00034

  10. Liu Z, Lewis R, Kettimuthu R, Harms K, Carns P, Rao N, Foster I, Papka ME (2020) Characterization and identification of HPC applications at leadership computing facility. In: Proceedings of the 34th ACM International Conference on Supercomputing (ICS). Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3392717.3392774

  11. Rojas E, Meneses E, Jones T, Maxwell D (2021) Understanding failures through the lifetime of a top-level supercomputer. J Parallel Distrib Comput 154:27–41. https://doi.org/10.1016/j.jpdc.2021.04.001

    Article  Google Scholar 

  12. Ferreira KB, Levy S, Hemmert J, Pedretti K (2022) Understanding memory failures on a petascale Arm system. In: Proceedings of the 31st ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), pp. 84–96. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3502181.3531465

  13. Abraham JP, Mathew S (2015) A novel approach to improve the processor performance with page replacement method. Proc Comput Sci. https://doi.org/10.1016/j.procs.2015.02.054

    Article  Google Scholar 

  14. Tirumalasetty C, Chou CC, Reddy N, Gratz P, Abouelwafa A (2022) Reducing minor page fault overheads through enhanced page walker. ACM Trans Arch Code Optim. https://doi.org/10.1145/3547142

    Article  Google Scholar 

  15. Psistakis A, Chrysos N, Chaix F, Asiminakis M, Gianioudis M, Xirouchakis P, Papaefstathiou V, Katevenis M (2022) Optimized page fault handling during RDMA. IEEE Trans Parallel Distrib Syst 33(12):3990–4005. https://doi.org/10.1109/TPDS.2022.3175666

    Article  Google Scholar 

  16. Chuah E, Jhumka A, Narasimharmuthy S, Hammond J, Browne JC, Barth B (2013) Linking resource usage anomalies with system failures from cluster log data. In: Proceedings of IEEE International Symposium on Reliable Distributed Systems (SRDS). https://doi.org/10.1109/SRDS.2013.20

  17. Chuah E, Jhumka A, Browne JC, Gurumdimma N, Narasimharmuthy S, Barth B (2016) Using message logs and resource use data for cluster failure diagnosis. In: Proceedings of IEEE International Conference on High Performance Computing (HiPC). https://doi.org/10.1109/HiPC.2016.035

  18. Fu X, Ren R, Zhan J, Zhou W, Jia Z, Lu G (2012) Logmaster: Mining event correlations in logs of large-scale cluster systems. In: Proceedings of IEEE International Symposium on Reliable Distributed Systems (SRDS), pp. 71–80. https://doi.org/10.1109/SRDS.2012.40

  19. Fu X, Ren R, McKee SA, Zhan J, Sun N (2014) Digging deeper into cluster system logs for failure prediction and root cause diagnosis. In: Proceedings of IEEE International Conference on Cluster Computing (CLUSTER). https://doi.org/10.1109/CLUSTER.2014.6968768

  20. Hammond JL, Minyard T, Browne J (2010) End-to-end framework for fault management for open source clusters: Ranger. In: Proceedings of ACM TeraGrid Conference. https://doi.org/10.1145/1838574.1838583

  21. Avizienis A, Lapire J-C, Randell B, Landwehr C (2004) Basic concepts and taxonomy of dependable and secure computing. IEEE Trans Dependable Secure Comput 1(1):11–33. https://doi.org/10.1109/TDSC.2004.2

    Article  Google Scholar 

  22. Mano MM (1993) Computer system architecture. Prentice Hall International Edition, Boston

    MATH  Google Scholar 

  23. Tan P-N, Steinbach M, Kumar V (2006) Introduction to data mining. Addison-Wesley, Boston

    Google Scholar 

  24. Evans RT, Browne JC, Barth WL (2016) Understanding application and system performance through system-wide monitoring. In: Proceedings of IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). https://doi.org/10.1109/IPDPSW.2016.145

  25. Palmer JT, Gallo SM, Furlani TR, Jones MD, DeLeon RL, White JP, Simakov N, Patra AK, Sperhac J, Yearke T, Rathsam R, Innus M, Cornelius CD, Browne JC, Barth WL, Evans RT (2015) Open XDMoD: a tool for the comprehensive management of high-performance computing resources. Comput Sci Eng. https://doi.org/10.1109/MCSE.2015.68

    Article  Google Scholar 

  26. Agresti A, Franklin C (2009) Statistics: the art and science of learning from data. Prentice Hall International, Boston

    Google Scholar 

  27. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Roy Stat Soc: Ser B (Methodol) 58(1):267–288

    MathSciNet  MATH  Google Scholar 

  28. Hoerl AE, Kennard RW (2000) Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 42(1):80–86

    Article  MATH  Google Scholar 

  29. Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J Royal Stat Soc Ser B (Stat Methodol) 67(2):301–320

    Article  MathSciNet  MATH  Google Scholar 

  30. Walpole RE, Myers RH, Myers SL (1998) Probab Stat Eng Sci. Prentice Hall International, Boston

    Google Scholar 

  31. Das A, Müller F, Rountree B (2021) Systemic assessment of node failures in HPC production platforms. In: Proceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS). https://doi.org/10.1109/IPDPS49936.2021.00035

Download references

Acknowledgements

We would like to thank the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing the resource use data and system logs from their HPC systems. We would also like to thank the anonymous reviewers for their constructive feedback, which helped improve our paper significantly.

Funding

No funding was received to assist with the preparation of this manuscript.

Author information

Authors and Affiliations

Authors

Contributions

EC prepared the manuscript and conducted the experiments. AJ and SN reviewed and edited the manuscript.

Corresponding author

Correspondence to Edward Chuah.

Ethics declarations

Conflict of interest

All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chuah, E., Jhumka, A. & Narasimhamurthy, S. An empirical study of major page faults for failure diagnosis in cluster systems. J Supercomput 79, 18445–18479 (2023). https://doi.org/10.1007/s11227-023-05366-1

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-023-05366-1

Keywords

Navigation