Analyzing and predicting job failures from HPC system log

338 Accesses
1 Citation
Explore all metrics

Abstract

In this paper, we analyze the scheduler log of a production supercomputer that contains complete job information, which is in contrast to many existing (publicly available) HPC logs that only have largely limited job information. We not only provide an in-depth statistical analysis of failed jobs from the scheduler log, but also demonstrate how the scheduler log, which is available in a detailed form, can be leveraged to predict job failures. For the latter, we first conduct a feature analysis based on the framework of ‘weight of evidence’ and ‘information value’ to uncover the impact of each workload attribute (feature) on the failure or success of a job, thereby enabling us to identify key features. We then conduct a comparative performance study of six data-driven machine learning models for predicting job failures in a HPC system based on the scheduler log. Our experiment results show that tree-based models exhibit superior performance in terms of both prediction accuracy and computational cost. We also demonstrate that our feature analysis improves the computational efficiency of each machine learning model without losing its prediction performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Failure Prediction of Cluster Systems Based on System Logs

An empirical study of major page faults for failure diagnosis in cluster systems

Article 15 May 2023

Design and implementation of I/O performance prediction scheme on HPC systems through large-scale log analysis

Article Open access 17 May 2023

Data availability

No additional data or materials available.

Notes

The Tachyon is the fourth supercomputer at the National Supercomputing Center in the Korean Institute of Science and Technology Information, which has provided computing resources to support the large-scale national research works until 2017. While the fifth supercomputer, Nurion, is currently in operation, the logs of Nurion are not open to the public yet due to security reasons. Thus, we focus on the log of Tachyon in this work.
Note that a job can be running on multiple nodes simultaneously, so it can be associated with multiple hostname’s. Thus, when computing the IV value of hostname, we use the hostname of the first node associated with each job.
In general, the values of IV have the following implications [4]: $\text {IV} < 0.03$ (poor predictor), $0.03< \text {IV} < 0.1$ (weak predictor), $0.1< \text {IV} < 0.3$ (average predictor), $0.3< \text {IV} < 0.5$ (strong predictor), and $0.5 < \text {IV}$ (very strong predictor).

References

Abdou HA, Pointon J (2011) Credit scoring, statistical techniques and evaluation criteria: a review of the literature. Intell Syst Accounting Financ Manag 18(2–3):59–88
Article Google Scholar
Abeyratne N, Chen HM, Oh B, et al (2016) Checkpointing exascale memory systems with existing memory technologies. In: International Symposium on Memory Systems (MEMSYS’16), ACM, pp 18–29
Alharthi KA, Jhumka A, Di S, et al (2022) Clairvoyant: a log-based transformer-decoder for failure prediction in large-scale systems. In: Proceedings of the 36th ACM International Conference on Supercomputing, pp 1–14
Bailey M (2001) Credit scoring: the principles and practicalities. White Box Publishing, Bristol
Google Scholar
Benoit A, Le Fèvre V, Raghavan P, et al (2020) Design and comparison of resilient scheduling heuristics for parallel jobs. In: 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), IEEE, pp 567–576
Bishop CM (1995) Neural networks for pattern recognition. Oxford University Press, Oxford
Book Google Scholar
Bishop CM, Nasrabadi NM (2006) Pattern recognition and machine learning. Springer, Berlin
Google Scholar
Borges G, David M, Gomes J, et al (2007) Sun Grid Engine, a new scheduler for EGEE middleware. In: IBERGRID–Iberian Grid Infrastructure Conference
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article Google Scholar
Burkov A (2019) The hundred-page machine learning book. Andriy Burkov, Quebec City
Google Scholar
Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 785–794
Cirne W, Berman F (2001) A comprehensive model of the supercomputer workload. In: IEEE International Workshop on Workload Characterization, pp 140–148
Das A, Mueller F, Rountree B (2020) Aarohi: making real-time node failure prediction feasible. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), IEEE, pp 1092–1101
Di S, Gupta R, Snir M, et al (2017) Logaider: a tool for mining potential correlations of HPC log events. In: IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’17), IEEE, pp 442–451
Dongarra J, Herault T, Robert Y (2015) Fault tolerance techniques for high-performance computing. Springer, Cham, pp 3–85
Book Google Scholar
Egwutuoha IP, Levy D, Selic B et al (2013) A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J Supercomput 65(3):1302–1326
Article Google Scholar
Feitelson D (2022) Parallel workloads archive and standard workload format. http://www.cs.huji.ac.il/labs/parallel/workload, Accessed Nov. 25, 2022
Feitelson DG, Tsafrir D, Krakov D (2014) Experience with using the parallel workloads archive. J Parallel Distrib Comput 74(10):2967–2982
Article Google Scholar
Foss S, Korshunov D, Zachary S (2013) An introduction to heavy-tailed and subexponential distributions. Springer series in operations research and financial engineering, 2nd edn. Springer, New York
Google Scholar
Gainaru A, Cappello F, Snir M, et al (2012) Fault prediction under the microscope: a closer look into HPC systems. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC’12), IEEE, pp 1–11
Gotoda S, Ito M, Shibata N (2012) Task scheduling algorithm for multicore processor system for minimizing recovery time in case of single node fault. In: IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’12), IEEE, pp 260–267
Gupta S, Tiwari D, Jantzi C, et al (2015) Understanding and exploiting spatial properties of system failures on extreme-scale HPC systems. In: IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’15), IEEE, pp 37–44
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Article Google Scholar
Heien E, LaPine D, Kondo D, et al (2011) Modeling and tolerating heterogeneous failures in large parallel systems. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11)
Hothorn T, Zeileis A (2015) partykit: a modular toolkit for recursive partytioning in R. J Mach Learn Res 16(1):3905–3909
MathSciNet Google Scholar
Huang S, Liu Y, Fung C et al (2020) Hitanomaly: hierarchical transformers for anomaly detection in system log. IEEE Trans Netw Serv Manage 17(4):2064–2076
Article Google Scholar
Jin H, Ke T, Chen Y, et al (2012) Checkpointing orchestration: toward a scalable HPC fault-tolerant environment. In: IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’12), IEEE, pp 276–283
Lai CD, Xie M, Barlow RE (2006) Stochastic ageing and dependence for reliability. Springer-Verlag, New York
Google Scholar
León B, Franco D, Rexachs D et al (2021) Analysis of parallel application checkpoint storage for system configuration. J Supercomput 77(5):4582–4617
Article Google Scholar
León B, Méndez S, Franco D et al (2022) A model of checkpoint behavior for applications that have i/o. J Supercomput 78(13):15404–15436
Article Google Scholar
Li H, Groep D, Wolters L (2004) Workload characteristics of a multi-cluster supercomputer. In: Workshop on Job Scheduling Strategies for Parallel Processing, pp 176–193
Li H, Groep D, Wolters L, et al (2006) Job failure analysis and its implications in a large-scale production grid. In: IEEE International Conference on E-Science and Grid Computing (E-Science’06), IEEE, pp 27–27
Loh WY (2011) Classification and regression trees. Wiley Interdiscip Rev Data Min Knowl Discov 1(1):14–23
Article Google Scholar
Meng W, Liu Y, Zhang S et al (2021) Logclass: anomalous log identification and classification with partial labels. IEEE Trans Netw Serv Manage 18(2):1870–1884
Article Google Scholar
Min JH, Lee YC (2008) A practical approach to credit scoring. Expert Syst Appl 35(4):1762–1770
Article Google Scholar
Naksinehaboon N, Liu Y, Leangsuksun C, et al (2008) Reliability-aware approach: An incremental checkpoint/restart model in HPC environments. In: IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid’08), IEEE, pp 783–788
Nanni L, Lumini A (2009) An experimental comparison of ensemble of classifiers for bankruptcy prediction and credit scoring. Expert Syst Appl 36(2):3028–3033
Article Google Scholar
Nguyen AT, Reiter S, Rigo P (2014) A review on simulation-based optimization methods applied to building performance analysis. Appl Energy 113:1043–1058
Article Google Scholar
Oliner A, Stearley J (2007) What supercomputers say: a study of five system logs. In: IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’07), pp 575–584
Parasyris K, Keller K, Bautista-Gomez L, et al (2020) Checkpoint restart support for heterogeneous hpc applications. In: 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pp 242–251
Park JW (2019) Queue waiting time prediction for large-scale high-performance computing system. In: 2019 International Conference on High Performance Computing & Simulation (HPCS), IEEE, pp 850–855
Park JW, Kim E (2017) Runtime prediction of parallel applications with workload-aware clustering. J Supercomput 73(11):4635–4651
Article Google Scholar
Park JW, Kim E (2018) Exploiting the behavior of the failed job in high performance computing system. In: 2018 18th International Conference on Computational Science and Applications (ICCSA), IEEE, pp 1–3
Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830
MathSciNet Google Scholar
Rodrigo Álvarez GP, Östberg PO, Elmroth E, et al (2015) HPC system lifetime story: Workload characterization and evolutionary analyses on NERSC systems. In: ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC’15), pp 57–60
Roux NL, Schmidt M, Bach F (2012) A stochastic gradient method with an exponential convergence rate for finite training sets. In: NIPS’12 - 26th Annual Conference on Neural Information Processing Systems
Schneider D (2022) The exascale era is upon us: the frontier supercomputer may be the first to reach 1,000,000,000,000,000,000 operations per second. IEEE Spectr 59(1):34–35
Article Google Scholar
Schroeder B, Gibson G (2010) A large-scale study of failures in high-performance computing systems. IEEE Trans Depend Secur Comput 7(4):337–350
Article Google Scholar
Tiwari D, Gupta S, Vazhkudai SS (2014) Lazy checkpointing: Exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems. In: Proceedings of IEEE/IFIP DSN, pp 25–36
Wu M, Sun XH, Jin H (2007) Performance under failures of high-end computing. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC’07), ACM, p 48
Yoon J, Hong T, Park C et al (2015) Stable HPC cluster management scheme through performance evaluation. In: Park JJJH, Stojmenovic I, Jeong HY et al (eds) Computer science and its applications. Springer, Berlin, pp 1017–1023
Chapter Google Scholar
You H, Zhang H (2012) Comprehensive workload analysis and modeling of a petascale supercomputer. In: Workshop on Job Scheduling Strategies for Parallel Processing, pp 253–271
Yuan Y, Wu Y, Wang Q et al (2012) Job failures in high performance computing systems: a large-scale empirical study. Comput Math Appl 63(2):365–377
Article Google Scholar
Zheng Z, Yu L, Tang W, et al (2011) Co-analysis of RAS log and job log on Blue Gene/P. In: IEEE International Parallel & Distributed Processing Symposium (IPDPS’11), pp 840–851

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Division of Supercomputing, Korea Institute of Science and Technology Information (KISTI), Daejeon, 34141, South Korea
Ju-Won Park
Department of Computer Science, Texas State University, San Marcos, TX, 78666, USA
Xin Huang & Chul-Ho Lee

Authors

Ju-Won Park
View author publications
You can also search for this author in PubMed Google Scholar
Xin Huang
View author publications
You can also search for this author in PubMed Google Scholar
Chul-Ho Lee
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.P. and C.L. wrote the main manuscript. J.P. and C.L. conducted statistical analysis of failed jobs from the scheduler log. J.P. and X.H. conducted the performance comparison of machine learning models. All authors reviewed the manuscript.

Corresponding author

Correspondence to Ju-Won Park.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

We provide in Fig. 8 the results of MLP models with different structures, i.e., different numbers of neurons per layer and/or layers. As can be seen from Fig. 8, when we increase the model complexity by using more neurons per layer and/or more MLP layers, the accuracy performance of MLP models does not show any clear improvement (and sometimes becomes even worse), while their running time can increase significantly. Therefore, we have used a three-layer MLP model having one hidden layer of 100 neurons for the performance comparison with other machine learning models. Note that when different MLP models are trained, they may converge at different numbers of training iterations as we stop the model training when the training loss is not improving by at least a tolerance of 0.0001 for ten consecutive iterations. This is why the running time of an MLP model does not increase with respect to the number of neurons per layer and/or the number of layers. Nonetheless, the three-layer model used in this work achieves the minimum running time for each case.

In addition, we below provide in Table 7 detailed numeric values of the results shown in Fig. 7.

Table 7 Performance comparison of machine learning models

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Park, JW., Huang, X. & Lee, CH. Analyzing and predicting job failures from HPC system log. J Supercomput 80, 435–462 (2024). https://doi.org/10.1007/s11227-023-05482-y

Download citation

Accepted: 06 June 2023
Published: 24 June 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s11227-023-05482-y