research-article

Modeling and tolerating heterogeneous failures in large parallel systems

Authors:

Franck CappelloAuthors Info & Claims

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 45, Pages 1 - 11

https://doi.org/10.1145/2063384.2063444

Published: 12 November 2011 Publication History

Abstract

As supercomputers and clusters increase in size and complexity, system failures are inevitable. Different hardware components (such as memory, disk, or network) of such systems can have different failure rates. Prior works assume failures equally affect an application, whereas our goal is to provide failure models for applications that reflect their specific component usage. This is challenging because component failure dynamics are heterogeneous in space and time.

To this end, we study 5 years of system logs from a production high-performance computing system and model hardware failures involving processors, memory, storage and network components. We model each component and construct integrated failure models given the component usage of common supercomputing applications. We show that these application-centric models provide more accurate reliability estimates compared to general models, which improves the efficacy of fault-tolerant algorithms. In particular, we demonstrate how applications can tune their checkpointing strategies to the tailored model.

References

[1]

William D. Gropp. Personal communication, May 2010.

[2]

Franck Cappello, Al Geist, Bill Gropp, Laxmikant Kale, Bill Kramer, and Marc Snir. Toward exascale resilience. Int. J. High Perform. Comput. Appl., 23:374--388, November 2009.

Digital Library

[3]

Ana Gainaru, Franck Cappello, Stefan Trausan-Matu, and Bill Kramer. Hierarchical event log organizer. Technical Report of the INRIA-Illinois Joint Laboratory on PetaScale Computing, pages 1--24, Sep 2010.

[4]

Daniel Ford, Francois Labelle, Florentina Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan. Availability in globally distributed storage systems. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation, 2010.

Digital Library

[5]

Bianca Schroeder and Garth Gibson. Understanding disk failure rates: What does an mttf of 1,000,000 hours mean to you? Transactions on Storage (TOS, 3(3), Oct 2007.

Digital Library

[6]

Bianca Schroeder and Garth A. Gibson. A large-scale study of failures in high-performance computing systems. In Proceedings of the International Conference on Dependable Systems and Networks, pages 249--258, Washington, DC, USA, 2006. IEEE Computer Society.

Digital Library

[7]

G. Casella and R. Berger. Statistical Inference. Duxbury, 2002.

[8]

A. Iosup, C. Dumitrescu, D. H. J. Epema, H. Li, and L. Wolters. How are real grids used? the analysis of four grid traces and its implications. In GRID, pages 262--269, 2006.

Digital Library

[9]

Catalog of boinc projects. http://www.boinc-wiki.info/Catalog_of_BOINC_Powered_Projects.

[10]

D. Baker. ROSETTA@home. http://boinc.bakerlab.org/rosetta/.

[11]

EINSTEN@home. http://einstein.phys.uwm.edu.

[12]

James C. Phillips, Rosemary Braun, Wei Wang, James Gumbart, Emad Tajkhorshid, Elizabeth Villa, Christophe Chipot, Robert D. Skeel, Laxmikant V. Kalé, and Klaus Schulten. Scalable molecular dynamics with namd. Journal of Computational Chemistry, 26(16):1781--1802, 2005.

[13]

John W. Young. A first order approximation to the optimum checkpoint interval. Commun. ACM, 17:530--531, September 1974.

Digital Library

[14]

M. S. Bouguerra, D. Kondo, and D. Trystram. On the scheduling of checkpoints in desktop grids. In Proceedings of the 11th IEEE International Symposium on Cluster Computing and Grid (CCGrid), 2011.

Digital Library

[15]

A Oliner, A Aiken, and J Stearley. Alert detection in system logs. Data Mining, 2008. ICDM '08. Eighth IEEE International Conference on, pages 959--964, 2008.

Digital Library

[16]

J. Stearley. Towards informatic analysis of syslogs. In Proceedings of the 2004 IEEE International Conference on Cluster Computing, pages 309--318, Washington, DC, USA, 2004. IEEE Computer Society.

Digital Library

[17]

Derrick Kondo, Bahman Javadi, Alexandru Iosup, and Dick Epema. The failure trace archive: Enabling comparative analysis of failures in diverse distributed systems. Cluster, Cloud and Grid Computing (CCGrid), 2010 10th IEEE/ACM International Conference on, pages 398--407, 2010.

Digital Library

[18]

R. Bhagwan, S. Savage, and G. Voelker. Understanding Availability. In Proceedings of IPTPS'03, 2003.

[19]

B. Javadi, D. Kondo, JM. Vincent, and D. P. Anderson. Mining for statistical availability models in large-scale distributed systems: An empirical study of seti@home. In 17th IEEE/ACM International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), September 2009.

[20]

John R. Douceur. Is remote host availability governed by a universal law? SIGMETRICS Performance Evaluation Review, 31(3):25--29, 2003.

Digital Library

[21]

J. Brevik, D. Nurmi, and R. Wolski. Quantifying Machine Availability in Networked and Desktop Grid Systems. Technical Report CS2003-37, Dept. of Computer Science and Engineering, University of California at Santa Barbara, November 2003.

[22]

Mehmet Bakkaloglu, Jay J. Wylie, Chenxi Wang, and Gregory R. Ganger. On correlated failures in survivable storage systems. Technical Report CMU-CS-02-129, Carnegie Mellon University, 2002.

[23]

D Kondo, F Araujo, P Malecot, P Domingues, LM Silva, G Fedak, and F Cappello. Characterizing result errors in internet desktop grids. Lecture Notes in Computer Science, 4641:361, 2007.

Digital Library

[24]

Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. Dram errors in the wild: a large-scale field study. SIGMETRICS '09: Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems, Jun 2009.

Digital Library

[25]

Xin Li, Michael Huang, Kai Shen, and Lingkun Chu. A realistic evaluation of memory hardware errors and software system susceptibility. USENIXATC'10: Proceedings of the 2010 USENIX conference on USENIX annual technical conference, Jun 2010.

Digital Library

[26]

Jason Ansel, Kapil Arya, and Gene Cooperman. Dmtcp: Transparent checkpointing for cluster computations and the desktop. Parallel and Distributed Processing Symposium, International, 0:1--12, 2009.

Digital Library

[27]

Camille Coti, Thomas Herault, Pierre Lemarinier, Laurence Pilard, Ala Rezmerita, Eric Rodriguez, and Franck Cappello. Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant mpi. In SC 2006 Conference, Proceedings of the ACM/IEEE, page 18, 2006.

Digital Library

Cited By

Benoit APerotin LRobert YVivien F(2024)Checkpointing Strategies to Tolerate Non-Memoryless Failures on HPC PlatformsACM Transactions on Parallel Computing10.1145/362456011:1(1-26)Online publication date: 11-Mar-2024
https://dl.acm.org/doi/10.1145/3624560
Bautista-Gomez LBenoit ADi SHerault TRobert YSun H(2024)A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?Future Generation Computer Systems10.1016/j.future.2024.07.022161(315-328)Online publication date: Dec-2024
https://doi.org/10.1016/j.future.2024.07.022
Park JHuang XLee C(2023)Analyzing and predicting job failures from HPC system logThe Journal of Supercomputing10.1007/s11227-023-05482-y80:1(435-462)Online publication date: 24-Jun-2023
https://doi.org/10.1007/s11227-023-05482-y
Show More Cited By

Index Terms

Modeling and tolerating heterogeneous failures in large parallel systems
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
2. General and reference
  1. Cross-computing tools and techniques
    1. Reliability

Recommendations

Tolerating Temporal Correlated Failures from Cyclic Dependency in High Performance Computing Systems
ICPADS '08: Proceedings of the 2008 14th IEEE International Conference on Parallel and Distributed Systems

Correlated failures have recently gained more attention in the research of failures in large scale systems. Recent studies have pointed out the negative effect of ignoring such failures when designing a fault tolerant scheme for large scale systems. In ...
Scalable diskless checkpointing for large parallel systems
Tolerating hardware device failures in software
SOSP '09: Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles

Hardware devices can fail, but many drivers assume they do not. When confronted with real devices that misbehave, these assumptions can lead to driver or system failures. While major operating system and device vendors recommend that drivers detect and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

November 2011

866 pages

ISBN:9781450307710

DOI:10.1145/2063384

Conference Chair:
Scott Lathrop
University of Chicago
,
Program Chairs:
Jim Costa
Sandia National Laboratories
,
William Kramer
National Center for Supercomputing Applications

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

U.S. Department of Energy

Conference

SC '11

Sponsor:

SIGARCH
IEEE-CS

SC '11: International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 18, 2011

Washington, Seattle

Acceptance Rates

SC '11 Paper Acceptance Rate 74 of 352 submissions, 21%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

76
Total Citations
View Citations
392
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)1

Reflects downloads up to 22 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Benoit APerotin LRobert YVivien F(2024)Checkpointing Strategies to Tolerate Non-Memoryless Failures on HPC PlatformsACM Transactions on Parallel Computing10.1145/362456011:1(1-26)Online publication date: 11-Mar-2024
https://dl.acm.org/doi/10.1145/3624560
Bautista-Gomez LBenoit ADi SHerault TRobert YSun H(2024)A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?Future Generation Computer Systems10.1016/j.future.2024.07.022161(315-328)Online publication date: Dec-2024
https://doi.org/10.1016/j.future.2024.07.022
Park JHuang XLee C(2023)Analyzing and predicting job failures from HPC system logThe Journal of Supercomputing10.1007/s11227-023-05482-y80:1(435-462)Online publication date: 24-Jun-2023
https://doi.org/10.1007/s11227-023-05482-y
Benoit ADu YHerault TMarchal LPallez GPerotin LRobert YSun HVivien FSahni SSaxena VIyengar S(2022)Checkpointing à la Young/Daly: An OverviewProceedings of the 2022 Fourteenth International Conference on Contemporary Computing10.1145/3549206.3549328(701-710)Online publication date: 4-Aug-2022
https://dl.acm.org/doi/10.1145/3549206.3549328
Zheng QShao F(2022)Intelligent failure localization and maintenance of network based on reliabilityThe Journal of Supercomputing10.1007/s11227-022-04653-779:1(389-418)Online publication date: 11-Jul-2022
https://doi.org/10.1007/s11227-022-04653-7
Vardas IPloumidis MMarazakis M(2022)Exploring the Impact of Node Failures on the Resource Allocation for Parallel JobsEuro-Par 2021: Parallel Processing Workshops10.1007/978-3-031-06156-1_24(298-309)Online publication date: 9-Jun-2022
https://doi.org/10.1007/978-3-031-06156-1_24
Remis LLacewell C(2021)Using VDMS to index and search 100M imagesProceedings of the VLDB Endowment10.14778/3476311.347638114:12(3240-3252)Online publication date: 28-Oct-2021
https://dl.acm.org/doi/10.14778/3476311.3476381
Sigdel PYuan XTzeng N(2021)Realizing Best Checkpointing Control in Computing SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.301580532:2(315-329)Online publication date: 1-Feb-2021
https://doi.org/10.1109/TPDS.2020.3015805
Frank ABaumgartner MSalkhordeh RBrinkmann A(2021)Improving checkpointing intervals by considering individual job failure probabilities2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00038(299-309)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00038
Kadekodi SMaturana FSubramanya SYang JRashmi KGanger GLu SHowell J(2020)PACEMAKERProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation10.5555/3488766.3488787(369-385)Online publication date: 4-Nov-2020
https://dl.acm.org/doi/10.5555/3488766.3488787
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents