skip to main content
research-article

The I/O Trace Initiative: Building a Collaborative I/O Archive to Advance HPC

Published: 12 November 2023 Publication History

Abstract

HPC application developers and administrators need to understand the complex interplay between compute clusters and storage systems to make effective optimization decisions. Ad hoc investigations of this interplay based on isolated case studies can lead to conclusions that are incorrect or difficult to generalize. The I/O Trace Initiative aims to improve the scientific community’s understanding of I/O operations by building a searchable collaborative archive of I/O traces from a wide range of applications and machines, with a focus on high-performance computing and scalable AI/ML. This initiative advances the accessibility of I/O trace data by enabling users to locate and compare traces based on user-specified criteria. It also provides a visual analytics platform for in-depth analysis, paving the way for the development of advanced performance optimization techniques. By acting as a hub for trace data, the initiative fosters collaborative research by encouraging data sharing and collective learning.

Supplemental Material

MP4 File
Recording of "The I/O Trace Initiative: Building a Collaborative I/O Archive to Advance HPC" presentation at PDSW 2023.

References

[1]
2015. darshan-logutils.c. https://github.com/darshan-hpc/darshan/blob/main/darshan-util/darshan-logutils.c
[2]
Jean Luca Bez, Suren Byna, and Shadi Ibrahim. 2023. I/O Access Patterns in HPC Applications: A 360-Degree Survey. ACM Comput. Surv. (jul 2023).
[3]
Jean Luca Bez, Houjun Tang, Bing Xie, David B. Williams-Young, Robert Latham, Robert B. Ross, Sarp Oral, and Suren Byna. 2021. I/O Bottleneck Detection and Tuning: Connecting the Dots using Interactive Log Analysis. In 6th IEEE/ACM International Parallel Data Systems Workshop (PDSW@SC), St. Louis, MO, USA, November 15. 15–22.
[4]
Phil Carns. 2013. ALCF I/O Data Repository. Technical Report. Argonne National Lab.(ANL), Argonne, IL (United States).
[5]
Philip Carns. 2014. Darshan. In High performance parallel I/O. Chapman and Hall/CRC, 351–358.
[6]
Philip H. Carns, Kevin Harms, William E. Allcock, Charles Bacon, Samuel Lang, Robert Latham, and Robert B. Ross. 2011. Understanding and Improving Computational Science Storage Access through Continuous Characterization. ACM Trans. Storage 7, 3 (2011), 8:1–8:26.
[7]
Steven WD Chien, Artur Podobas, Ivy B Peng, and Stefano Markidis. 2020. tf-Darshan: Understanding fine-grained I/O performance in machine learning workloads. In 2020 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 359–370.
[8]
European Organization For Nuclear Research and OpenAIRE. 2013. Zenodo. https://doi.org/10.25495/7GXK-RD71
[9]
Clinton Gormley and Zachary Tong. 2015. Elasticsearch: the definitive guide: a distributed real-time search and analytics engine. " O’Reilly Media, Inc.".
[10]
Harsh Khetawat, Christopher Zimmer, Frank Mueller, Sudharshan Vazhkudai, and Scott Atchley. 2018. Using darshan and codes to evaluate application i/o performance. SC Poster Session (2018).
[11]
Seong Jo Kim, Seung Woo Son, Wei-keng Liao, Mahmut T. Kandemir, Rajeev Thakur, and Alok N. Choudhary. 2012. IOPin: Runtime Profiling of Parallel I/O in HPC Systems. In 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, Salt Lake City, UT, USA, November 10-16, 2012. IEEE Computer Society, 18–23.
[12]
Andreas Knupfer, Christian Rossel, Dieter an Mey, Scott Biersdorff, Kai Diethelm, Dominic Eschweiler, Markus Geimer, Michael Gerndt, Daniel Lorenz, Allen Malony, and Wolfgang E. Nagel. 2012. Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir. (8 2012). https://www.osti.gov/biblio/1567522
[13]
Thorsten Kurth, Sean Treichler, Joshua Romero, Mayur Mudigonda, Nathan Luehr, Everett H. Phillips, Ankur Mahesh, Michael A. Matheson, Jack Deslippe, Massimiliano Fatica, Prabhat, and Michael Houston. 2018. Exascale deep learning for climate analytics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018, Dallas, TX, USA, November 11-16, 2018. IEEE / ACM, 51:1–51:12.
[14]
Jakob Lüttgau, Shane Snyder, Philip H. Carns, Justin M. Wozniak, Julian M. Kunkel, and Thomas Ludwig. 2018. Toward Understanding I/O Behavior in HPC Workflows. In 3rd IEEE/ACM International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS@SC), Dallas, TX, USA, November 12. 64–75. https://doi.org/10.1109/PDSW-DISCS.2018.00012
[15]
Huong Luu, Babak Behzad, Ruth A. Aydt, and Marianne Winslett. 2013. A multi-level approach for understanding I/O activity in HPC applications. In 2013 IEEE International Conference on Cluster Computing, CLUSTER 2013, Indianapolis, IN, USA, September 23-27, 2013. IEEE Computer Society, 1–5.
[16]
D Miller, J Whitlocak, M Gartiner, M Ralphson, R Ratovsky, and U Sarid. [n. d.]. OpenAPI Specification v3. 1.0 (2021). URL https://spec. openapis. org/oas/latest. html. OpenAPI Initiative, The Linux Foundation ([n. d.]).
[17]
Tirthak Patel, Suren Byna, Glenn K. Lockwood, and Devesh Tiwari. 2019. Revisiting I/O behavior in large-scale storage systems: the expected and the unexpected. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019, Denver, Colorado, USA, November 17-19, 2019, Michela Taufer, Pavan Balaji, and Antonio J. Peña (Eds.). ACM, 65:1–65:13.
[18]
Arnab K. Paul, Jong Youl Choi, Ahmad Maroof Karimi, and Feiyi Wang. 2022. Machine Learning Assisted HPC Workload Trace Generation for Leadership Scale Storage Systems. In 31st International Symposium on High-Performance Parallel and Distributed Computing (HPDC), Minneapolis, MN, USA, 27 June 2022 - 1 July. 199–212. https://doi.org/10.1145/3502181.3531457
[19]
Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Jeffrey S. Vetter, Pietro Cicotti, Erwin Laure, and Stefano Markidis. 2018. Characterizing the performance benefit of hybrid memory system for HPC applications. Parallel Comput. 76 (2018), 57–69.
[20]
Sameer Shende, Allen D. Malony, Wyatt Spear, and Karen Schuchardt. [n. d.]. Characterizing I/O Performance Using the TAU Performance System. In Applications, Tools and Techniques on the Road to Exascale Computing, Proceedings of the conference ParCo 2011, 31 August - 3 September 2011, Ghent, Belgium(Advances in Parallel Computing, Vol. 22), Koen De Bosschere, Erik H. D’Hollander, Gerhard R. Joubert, David A. Padua, Frans J. Peters, and Mark Sawyer (Eds.). IOS Press, 647–655.
[21]
Shane Snyder, Philip Carns, Kevin Harms, Robert Ross, Glenn K Lockwood, and Nicholas J Wright. 2016. Modular hpc i/o characterization with darshan. In 2016 5th workshop on extreme-scale programming tools (ESPT). IEEE, 9–17.
[22]
Chen Wang, Jinghan Sun, Marc Snir, Kathryn M. Mohror, and Elsa Gonsiorowski. 2020. Recorder 2.0: Efficient Parallel I/O Tracing and Analysis. In 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2020, New Orleans, LA, USA, May 18-22, 2020. IEEE, 1052–1059.
[23]
Teng Wang, Shane Snyder, Glenn K. Lockwood, Philip H. Carns, Nicholas J. Wright, and Suren Byna. 2018. IOMiner: Large-Scale Analytics Framework for Gaining Knowledge from I/O Logs. In IEEE International Conference on Cluster Computing (CLUSTER), Belfast, UK, September 10-13. 466–476.
[24]
[24] Mark D Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E Bourne, [n. d.]. ([n. d.]).
[25]
Cong Xu, Shane Snyder, Vishwanath Venkatesan, Philip Carns, Omkar Kulkarni, Suren Byna, Roberto Sisneros, and Kalyana Chadalavada. 2017. Dxt: Darshan extended tracing. Technical Report. Argonne National Lab.(ANL), Argonne, IL (United States).

Cited By

View all
  • (2024)ION: Navigating the HPC I/O Optimization Journey using Large Language ModelsProceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665950(86-92)Online publication date: 8-Jul-2024
  • (2024)IO-SEA: Storage I/O and Data Management for Exascale ArchitecturesProceedings of the 21st ACM International Conference on Computing Frontiers: Workshops and Special Sessions10.1145/3637543.3654620(94-100)Online publication date: 7-May-2024
  • (2024)Comparability and Reproducibility in HPC Applications' Energy Consumption CharacterizationProceedings of the 15th ACM International Conference on Future and Sustainable Energy Systems10.1145/3632775.3662162(560-568)Online publication date: 4-Jun-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
November 2023
2180 pages
ISBN:9798400707858
DOI:10.1145/3624062
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. High Performance Computing
  2. I/O profiling
  3. Storage systems

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SC-W 2023

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)145
  • Downloads (Last 6 weeks)17
Reflects downloads up to 21 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)ION: Navigating the HPC I/O Optimization Journey using Large Language ModelsProceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665950(86-92)Online publication date: 8-Jul-2024
  • (2024)IO-SEA: Storage I/O and Data Management for Exascale ArchitecturesProceedings of the 21st ACM International Conference on Computing Frontiers: Workshops and Special Sessions10.1145/3637543.3654620(94-100)Online publication date: 7-May-2024
  • (2024)Comparability and Reproducibility in HPC Applications' Energy Consumption CharacterizationProceedings of the 15th ACM International Conference on Future and Sustainable Energy Systems10.1145/3632775.3662162(560-568)Online publication date: 4-Jun-2024

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media