skip to main content
research-article

Modeling and tolerating heterogeneous failures in large parallel systems

Published: 12 November 2011 Publication History

Abstract

As supercomputers and clusters increase in size and complexity, system failures are inevitable. Different hardware components (such as memory, disk, or network) of such systems can have different failure rates. Prior works assume failures equally affect an application, whereas our goal is to provide failure models for applications that reflect their specific component usage. This is challenging because component failure dynamics are heterogeneous in space and time.
To this end, we study 5 years of system logs from a production high-performance computing system and model hardware failures involving processors, memory, storage and network components. We model each component and construct integrated failure models given the component usage of common supercomputing applications. We show that these application-centric models provide more accurate reliability estimates compared to general models, which improves the efficacy of fault-tolerant algorithms. In particular, we demonstrate how applications can tune their checkpointing strategies to the tailored model.

References

[1]
William D. Gropp. Personal communication, May 2010.
[2]
Franck Cappello, Al Geist, Bill Gropp, Laxmikant Kale, Bill Kramer, and Marc Snir. Toward exascale resilience. Int. J. High Perform. Comput. Appl., 23:374--388, November 2009.
[3]
Ana Gainaru, Franck Cappello, Stefan Trausan-Matu, and Bill Kramer. Hierarchical event log organizer. Technical Report of the INRIA-Illinois Joint Laboratory on PetaScale Computing, pages 1--24, Sep 2010.
[4]
Daniel Ford, Francois Labelle, Florentina Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan. Availability in globally distributed storage systems. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation, 2010.
[5]
Bianca Schroeder and Garth Gibson. Understanding disk failure rates: What does an mttf of 1,000,000 hours mean to you? Transactions on Storage (TOS, 3(3), Oct 2007.
[6]
Bianca Schroeder and Garth A. Gibson. A large-scale study of failures in high-performance computing systems. In Proceedings of the International Conference on Dependable Systems and Networks, pages 249--258, Washington, DC, USA, 2006. IEEE Computer Society.
[7]
G. Casella and R. Berger. Statistical Inference. Duxbury, 2002.
[8]
A. Iosup, C. Dumitrescu, D. H. J. Epema, H. Li, and L. Wolters. How are real grids used? the analysis of four grid traces and its implications. In GRID, pages 262--269, 2006.
[9]
Catalog of boinc projects. http://www.boinc-wiki.info/Catalog_of_BOINC_Powered_Projects.
[10]
D. Baker. ROSETTA@home. http://boinc.bakerlab.org/rosetta/.
[11]
EINSTEN@home. http://einstein.phys.uwm.edu.
[12]
James C. Phillips, Rosemary Braun, Wei Wang, James Gumbart, Emad Tajkhorshid, Elizabeth Villa, Christophe Chipot, Robert D. Skeel, Laxmikant V. Kalé, and Klaus Schulten. Scalable molecular dynamics with namd. Journal of Computational Chemistry, 26(16):1781--1802, 2005.
[13]
John W. Young. A first order approximation to the optimum checkpoint interval. Commun. ACM, 17:530--531, September 1974.
[14]
M. S. Bouguerra, D. Kondo, and D. Trystram. On the scheduling of checkpoints in desktop grids. In Proceedings of the 11th IEEE International Symposium on Cluster Computing and Grid (CCGrid), 2011.
[15]
A Oliner, A Aiken, and J Stearley. Alert detection in system logs. Data Mining, 2008. ICDM '08. Eighth IEEE International Conference on, pages 959--964, 2008.
[16]
J. Stearley. Towards informatic analysis of syslogs. In Proceedings of the 2004 IEEE International Conference on Cluster Computing, pages 309--318, Washington, DC, USA, 2004. IEEE Computer Society.
[17]
Derrick Kondo, Bahman Javadi, Alexandru Iosup, and Dick Epema. The failure trace archive: Enabling comparative analysis of failures in diverse distributed systems. Cluster, Cloud and Grid Computing (CCGrid), 2010 10th IEEE/ACM International Conference on, pages 398--407, 2010.
[18]
R. Bhagwan, S. Savage, and G. Voelker. Understanding Availability. In Proceedings of IPTPS'03, 2003.
[19]
B. Javadi, D. Kondo, JM. Vincent, and D. P. Anderson. Mining for statistical availability models in large-scale distributed systems: An empirical study of seti@home. In 17th IEEE/ACM International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), September 2009.
[20]
John R. Douceur. Is remote host availability governed by a universal law? SIGMETRICS Performance Evaluation Review, 31(3):25--29, 2003.
[21]
J. Brevik, D. Nurmi, and R. Wolski. Quantifying Machine Availability in Networked and Desktop Grid Systems. Technical Report CS2003-37, Dept. of Computer Science and Engineering, University of California at Santa Barbara, November 2003.
[22]
Mehmet Bakkaloglu, Jay J. Wylie, Chenxi Wang, and Gregory R. Ganger. On correlated failures in survivable storage systems. Technical Report CMU-CS-02-129, Carnegie Mellon University, 2002.
[23]
D Kondo, F Araujo, P Malecot, P Domingues, LM Silva, G Fedak, and F Cappello. Characterizing result errors in internet desktop grids. Lecture Notes in Computer Science, 4641:361, 2007.
[24]
Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. Dram errors in the wild: a large-scale field study. SIGMETRICS '09: Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems, Jun 2009.
[25]
Xin Li, Michael Huang, Kai Shen, and Lingkun Chu. A realistic evaluation of memory hardware errors and software system susceptibility. USENIXATC'10: Proceedings of the 2010 USENIX conference on USENIX annual technical conference, Jun 2010.
[26]
Jason Ansel, Kapil Arya, and Gene Cooperman. Dmtcp: Transparent checkpointing for cluster computations and the desktop. Parallel and Distributed Processing Symposium, International, 0:1--12, 2009.
[27]
Camille Coti, Thomas Herault, Pierre Lemarinier, Laurence Pilard, Ala Rezmerita, Eric Rodriguez, and Franck Cappello. Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant mpi. In SC 2006 Conference, Proceedings of the ACM/IEEE, page 18, 2006.

Cited By

View all
  • (2024)Checkpointing Strategies to Tolerate Non-Memoryless Failures on HPC PlatformsACM Transactions on Parallel Computing10.1145/362456011:1(1-26)Online publication date: 11-Mar-2024
  • (2024)A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?Future Generation Computer Systems10.1016/j.future.2024.07.022161(315-328)Online publication date: Dec-2024
  • (2023)Analyzing and predicting job failures from HPC system logThe Journal of Supercomputing10.1007/s11227-023-05482-y80:1(435-462)Online publication date: 24-Jun-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
November 2011
866 pages
ISBN:9781450307710
DOI:10.1145/2063384
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2011

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

SC '11
Sponsor:

Acceptance Rates

SC '11 Paper Acceptance Rate 74 of 352 submissions, 21%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)1
Reflects downloads up to 22 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Checkpointing Strategies to Tolerate Non-Memoryless Failures on HPC PlatformsACM Transactions on Parallel Computing10.1145/362456011:1(1-26)Online publication date: 11-Mar-2024
  • (2024)A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?Future Generation Computer Systems10.1016/j.future.2024.07.022161(315-328)Online publication date: Dec-2024
  • (2023)Analyzing and predicting job failures from HPC system logThe Journal of Supercomputing10.1007/s11227-023-05482-y80:1(435-462)Online publication date: 24-Jun-2023
  • (2022)Checkpointing à la Young/Daly: An OverviewProceedings of the 2022 Fourteenth International Conference on Contemporary Computing10.1145/3549206.3549328(701-710)Online publication date: 4-Aug-2022
  • (2022)Intelligent failure localization and maintenance of network based on reliabilityThe Journal of Supercomputing10.1007/s11227-022-04653-779:1(389-418)Online publication date: 11-Jul-2022
  • (2022)Exploring the Impact of Node Failures on the Resource Allocation for Parallel JobsEuro-Par 2021: Parallel Processing Workshops10.1007/978-3-031-06156-1_24(298-309)Online publication date: 9-Jun-2022
  • (2021)Using VDMS to index and search 100M imagesProceedings of the VLDB Endowment10.14778/3476311.347638114:12(3240-3252)Online publication date: 28-Oct-2021
  • (2021)Realizing Best Checkpointing Control in Computing SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.301580532:2(315-329)Online publication date: 1-Feb-2021
  • (2021)Improving checkpointing intervals by considering individual job failure probabilities2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00038(299-309)Online publication date: May-2021
  • (2020)PACEMAKERProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation10.5555/3488766.3488787(369-385)Online publication date: 4-Nov-2020
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media