Abstract
In this paper, we analyze the scheduler log of a production supercomputer that contains complete job information, which is in contrast to many existing (publicly available) HPC logs that only have largely limited job information. We not only provide an in-depth statistical analysis of failed jobs from the scheduler log, but also demonstrate how the scheduler log, which is available in a detailed form, can be leveraged to predict job failures. For the latter, we first conduct a feature analysis based on the framework of ‘weight of evidence’ and ‘information value’ to uncover the impact of each workload attribute (feature) on the failure or success of a job, thereby enabling us to identify key features. We then conduct a comparative performance study of six data-driven machine learning models for predicting job failures in a HPC system based on the scheduler log. Our experiment results show that tree-based models exhibit superior performance in terms of both prediction accuracy and computational cost. We also demonstrate that our feature analysis improves the computational efficiency of each machine learning model without losing its prediction performance.
Similar content being viewed by others
Data availability
No additional data or materials available.
Notes
The Tachyon is the fourth supercomputer at the National Supercomputing Center in the Korean Institute of Science and Technology Information, which has provided computing resources to support the large-scale national research works until 2017. While the fifth supercomputer, Nurion, is currently in operation, the logs of Nurion are not open to the public yet due to security reasons. Thus, we focus on the log of Tachyon in this work.
Note that a job can be running on multiple nodes simultaneously, so it can be associated with multiple hostname’s. Thus, when computing the IV value of hostname, we use the hostname of the first node associated with each job.
In general, the values of IV have the following implications [4]: \(\text {IV} < 0.03\) (poor predictor), \(0.03< \text {IV} < 0.1\) (weak predictor), \(0.1< \text {IV} < 0.3\) (average predictor), \(0.3< \text {IV} < 0.5\) (strong predictor), and \(0.5 < \text {IV}\) (very strong predictor).
References
Abdou HA, Pointon J (2011) Credit scoring, statistical techniques and evaluation criteria: a review of the literature. Intell Syst Accounting Financ Manag 18(2–3):59–88
Abeyratne N, Chen HM, Oh B, et al (2016) Checkpointing exascale memory systems with existing memory technologies. In: International Symposium on Memory Systems (MEMSYS’16), ACM, pp 18–29
Alharthi KA, Jhumka A, Di S, et al (2022) Clairvoyant: a log-based transformer-decoder for failure prediction in large-scale systems. In: Proceedings of the 36th ACM International Conference on Supercomputing, pp 1–14
Bailey M (2001) Credit scoring: the principles and practicalities. White Box Publishing, Bristol
Benoit A, Le Fèvre V, Raghavan P, et al (2020) Design and comparison of resilient scheduling heuristics for parallel jobs. In: 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), IEEE, pp 567–576
Bishop CM (1995) Neural networks for pattern recognition. Oxford University Press, Oxford
Bishop CM, Nasrabadi NM (2006) Pattern recognition and machine learning. Springer, Berlin
Borges G, David M, Gomes J, et al (2007) Sun Grid Engine, a new scheduler for EGEE middleware. In: IBERGRID–Iberian Grid Infrastructure Conference
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Burkov A (2019) The hundred-page machine learning book. Andriy Burkov, Quebec City
Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 785–794
Cirne W, Berman F (2001) A comprehensive model of the supercomputer workload. In: IEEE International Workshop on Workload Characterization, pp 140–148
Das A, Mueller F, Rountree B (2020) Aarohi: making real-time node failure prediction feasible. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), IEEE, pp 1092–1101
Di S, Gupta R, Snir M, et al (2017) Logaider: a tool for mining potential correlations of HPC log events. In: IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’17), IEEE, pp 442–451
Dongarra J, Herault T, Robert Y (2015) Fault tolerance techniques for high-performance computing. Springer, Cham, pp 3–85
Egwutuoha IP, Levy D, Selic B et al (2013) A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J Supercomput 65(3):1302–1326
Feitelson D (2022) Parallel workloads archive and standard workload format. http://www.cs.huji.ac.il/labs/parallel/workload, Accessed Nov. 25, 2022
Feitelson DG, Tsafrir D, Krakov D (2014) Experience with using the parallel workloads archive. J Parallel Distrib Comput 74(10):2967–2982
Foss S, Korshunov D, Zachary S (2013) An introduction to heavy-tailed and subexponential distributions. Springer series in operations research and financial engineering, 2nd edn. Springer, New York
Gainaru A, Cappello F, Snir M, et al (2012) Fault prediction under the microscope: a closer look into HPC systems. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC’12), IEEE, pp 1–11
Gotoda S, Ito M, Shibata N (2012) Task scheduling algorithm for multicore processor system for minimizing recovery time in case of single node fault. In: IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’12), IEEE, pp 260–267
Gupta S, Tiwari D, Jantzi C, et al (2015) Understanding and exploiting spatial properties of system failures on extreme-scale HPC systems. In: IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’15), IEEE, pp 37–44
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Heien E, LaPine D, Kondo D, et al (2011) Modeling and tolerating heterogeneous failures in large parallel systems. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11)
Hothorn T, Zeileis A (2015) partykit: a modular toolkit for recursive partytioning in R. J Mach Learn Res 16(1):3905–3909
Huang S, Liu Y, Fung C et al (2020) Hitanomaly: hierarchical transformers for anomaly detection in system log. IEEE Trans Netw Serv Manage 17(4):2064–2076
Jin H, Ke T, Chen Y, et al (2012) Checkpointing orchestration: toward a scalable HPC fault-tolerant environment. In: IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’12), IEEE, pp 276–283
Lai CD, Xie M, Barlow RE (2006) Stochastic ageing and dependence for reliability. Springer-Verlag, New York
León B, Franco D, Rexachs D et al (2021) Analysis of parallel application checkpoint storage for system configuration. J Supercomput 77(5):4582–4617
León B, Méndez S, Franco D et al (2022) A model of checkpoint behavior for applications that have i/o. J Supercomput 78(13):15404–15436
Li H, Groep D, Wolters L (2004) Workload characteristics of a multi-cluster supercomputer. In: Workshop on Job Scheduling Strategies for Parallel Processing, pp 176–193
Li H, Groep D, Wolters L, et al (2006) Job failure analysis and its implications in a large-scale production grid. In: IEEE International Conference on E-Science and Grid Computing (E-Science’06), IEEE, pp 27–27
Loh WY (2011) Classification and regression trees. Wiley Interdiscip Rev Data Min Knowl Discov 1(1):14–23
Meng W, Liu Y, Zhang S et al (2021) Logclass: anomalous log identification and classification with partial labels. IEEE Trans Netw Serv Manage 18(2):1870–1884
Min JH, Lee YC (2008) A practical approach to credit scoring. Expert Syst Appl 35(4):1762–1770
Naksinehaboon N, Liu Y, Leangsuksun C, et al (2008) Reliability-aware approach: An incremental checkpoint/restart model in HPC environments. In: IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid’08), IEEE, pp 783–788
Nanni L, Lumini A (2009) An experimental comparison of ensemble of classifiers for bankruptcy prediction and credit scoring. Expert Syst Appl 36(2):3028–3033
Nguyen AT, Reiter S, Rigo P (2014) A review on simulation-based optimization methods applied to building performance analysis. Appl Energy 113:1043–1058
Oliner A, Stearley J (2007) What supercomputers say: a study of five system logs. In: IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’07), pp 575–584
Parasyris K, Keller K, Bautista-Gomez L, et al (2020) Checkpoint restart support for heterogeneous hpc applications. In: 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pp 242–251
Park JW (2019) Queue waiting time prediction for large-scale high-performance computing system. In: 2019 International Conference on High Performance Computing & Simulation (HPCS), IEEE, pp 850–855
Park JW, Kim E (2017) Runtime prediction of parallel applications with workload-aware clustering. J Supercomput 73(11):4635–4651
Park JW, Kim E (2018) Exploiting the behavior of the failed job in high performance computing system. In: 2018 18th International Conference on Computational Science and Applications (ICCSA), IEEE, pp 1–3
Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830
Rodrigo Álvarez GP, Östberg PO, Elmroth E, et al (2015) HPC system lifetime story: Workload characterization and evolutionary analyses on NERSC systems. In: ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC’15), pp 57–60
Roux NL, Schmidt M, Bach F (2012) A stochastic gradient method with an exponential convergence rate for finite training sets. In: NIPS’12 - 26th Annual Conference on Neural Information Processing Systems
Schneider D (2022) The exascale era is upon us: the frontier supercomputer may be the first to reach 1,000,000,000,000,000,000 operations per second. IEEE Spectr 59(1):34–35
Schroeder B, Gibson G (2010) A large-scale study of failures in high-performance computing systems. IEEE Trans Depend Secur Comput 7(4):337–350
Tiwari D, Gupta S, Vazhkudai SS (2014) Lazy checkpointing: Exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems. In: Proceedings of IEEE/IFIP DSN, pp 25–36
Wu M, Sun XH, Jin H (2007) Performance under failures of high-end computing. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC’07), ACM, p 48
Yoon J, Hong T, Park C et al (2015) Stable HPC cluster management scheme through performance evaluation. In: Park JJJH, Stojmenovic I, Jeong HY et al (eds) Computer science and its applications. Springer, Berlin, pp 1017–1023
You H, Zhang H (2012) Comprehensive workload analysis and modeling of a petascale supercomputer. In: Workshop on Job Scheduling Strategies for Parallel Processing, pp 253–271
Yuan Y, Wu Y, Wang Q et al (2012) Job failures in high performance computing systems: a large-scale empirical study. Comput Math Appl 63(2):365–377
Zheng Z, Yu L, Tang W, et al (2011) Co-analysis of RAS log and job log on Blue Gene/P. In: IEEE International Parallel & Distributed Processing Symposium (IPDPS’11), pp 840–851
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
J.P. and C.L. wrote the main manuscript. J.P. and C.L. conducted statistical analysis of failed jobs from the scheduler log. J.P. and X.H. conducted the performance comparison of machine learning models. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Ethical approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
We provide in Fig. 8 the results of MLP models with different structures, i.e., different numbers of neurons per layer and/or layers. As can be seen from Fig. 8, when we increase the model complexity by using more neurons per layer and/or more MLP layers, the accuracy performance of MLP models does not show any clear improvement (and sometimes becomes even worse), while their running time can increase significantly. Therefore, we have used a three-layer MLP model having one hidden layer of 100 neurons for the performance comparison with other machine learning models. Note that when different MLP models are trained, they may converge at different numbers of training iterations as we stop the model training when the training loss is not improving by at least a tolerance of 0.0001 for ten consecutive iterations. This is why the running time of an MLP model does not increase with respect to the number of neurons per layer and/or the number of layers. Nonetheless, the three-layer model used in this work achieves the minimum running time for each case.
In addition, we below provide in Table 7 detailed numeric values of the results shown in Fig. 7.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Park, JW., Huang, X. & Lee, CH. Analyzing and predicting job failures from HPC system log. J Supercomput 80, 435–462 (2024). https://doi.org/10.1007/s11227-023-05482-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-023-05482-y