subscribe to arXiv mailings

Scalable Delivery of Scalable Libraries and Tools: How ECP Delivered a Software Ecosystem for Exascale and Beyond

Abstract: The Exascale Computing Project (ECP) was one of the largest open-source scientific software development projects ever. It supported approximately 1,000 staff from US Department of Energy laboratories, and university and industry partners. About 250 staff contributed to 70 scientific libraries and tools to support applications on multiple exascale computing systems that were also under development.… ▽ More The Exascale Computing Project (ECP) was one of the largest open-source scientific software development projects ever. It supported approximately 1,000 staff from US Department of Energy laboratories, and university and industry partners. About 250 staff contributed to 70 scientific libraries and tools to support applications on multiple exascale computing systems that were also under development. Funded as a construction project, ECP adopted an earned-value management system, based on milestones. and a key performance parameter system based, in part, on integrations. With accelerated delivery schedules and significant project risk, we also emphasized software quality using community policies, automated testing, and continuous integration. Software Development Kit teams provided cross-team collaboration. Products were delivered via E4S, a curated portfolio of libraries and tools. In this paper, we discuss the organizational and management elements that enabled the efficient and effective delivery of ECP libraries and tools, lessons learned and next steps. △ Less

Submitted 12 November, 2023; originally announced November 2023.

Comments: 9 pages, 5 figures, submitted to IEEE Computing in Science and Engineering

arXiv:2311.02010 [pdf, other]

A cast of thousands: How the IDEAS Productivity project has advanced software productivity and sustainability

Authors: Lois Curfman McInnes, Michael Heroux, David E. Bernholdt, Anshu Dubey, Elsa Gonsiorowski, Rinku Gupta, Osni Marques, J. David Moulton, Hai Ah Nam, Boyana Norris, Elaine M. Raybourn, Jim Willenbring, Ann Almgren, Ross Bartlett, Kita Cranfill, Stephen Fickas, Don Frederick, William Godoy, Patricia Grubel, Rebecca Hartman-Baker, Axel Huebl, Rose Lynch, Addi Malviya Thakur, Reed Milewicz, Mark C. Miller , et al. (9 additional authors not shown)

Abstract: Computational and data-enabled science and engineering are revolutionizing advances throughout science and society, at all scales of computing. For example, teams in the U.S. DOE Exascale Computing Project have been tackling new frontiers in modeling, simulation, and analysis by exploiting unprecedented exascale computing capabilities-building an advanced software ecosystem that supports next-gene… ▽ More Computational and data-enabled science and engineering are revolutionizing advances throughout science and society, at all scales of computing. For example, teams in the U.S. DOE Exascale Computing Project have been tackling new frontiers in modeling, simulation, and analysis by exploiting unprecedented exascale computing capabilities-building an advanced software ecosystem that supports next-generation applications and addresses disruptive changes in computer architectures. However, concerns are growing about the productivity of the developers of scientific software, its sustainability, and the trustworthiness of the results that it produces. Members of the IDEAS project serve as catalysts to address these challenges through fostering software communities, incubating and curating methodologies and resources, and disseminating knowledge to advance developer productivity and software sustainability. This paper discusses how these synergistic activities are advancing scientific discovery-mitigating technical risks by building a firmer foundation for reproducible, sustainable science at all scales of computing, from laptops to clusters to exascale and beyond. △ Less

Submitted 16 February, 2024; v1 submitted 3 November, 2023; originally announced November 2023.

Comments: 12 pages, 1 figure

arXiv:2211.09034 [pdf, other]

Research Software Science: Expanding the Impact of Research Software Engineering

Authors: Michael A. Heroux

Abstract: Software plays a central role in scientific discovery. Improving how we develop and use software for research can have both broad and deep impacts on a spectrum of challenges and opportunities society faces today. The emergence of Research Software Engineer (RSE) as a role correlates with the growing complexity of scientific challenges and diversity of software team skills. In this paper, we descr… ▽ More Software plays a central role in scientific discovery. Improving how we develop and use software for research can have both broad and deep impacts on a spectrum of challenges and opportunities society faces today. The emergence of Research Software Engineer (RSE) as a role correlates with the growing complexity of scientific challenges and diversity of software team skills. In this paper, we describe research software science (RSS), an idea related to RSE, and particularly suited to research software teams. RSS promotes the use of scientific methodologies to explore and establish broadly applicable knowledge. Using RSS, we can pursue sustainable, repeatable, and reproducible software improvements that positively impact research software toward improved scientific discovery. △ Less

Submitted 16 November, 2022; originally announced November 2022.

Comments: Submitted to IEEE Computing in Science and Engineering

arXiv:1811.08473 [pdf, other]

doi 10.1109/MCSE.2018.2883051

Community Organizations: Changing the Culture in Which Research Software Is Developed and Sustained

Authors: Daniel S. Katz, Lois Curfman McInnes, David E. Bernholdt, Abigail Cabunoc Mayes, Neil P. Chue Hong, Jonah Duckles, Sandra Gesing, Michael A. Heroux, Simon Hettrick, Rafael C. Jimenez, Marlon Pierce, Belinda Weaver, Nancy Wilkins-Diehr

Abstract: Software is the key crosscutting technology that enables advances in mathematics, computer science, and domain-specific science and engineering to achieve robust simulations and analysis for science, engineering, and other research fields. However, software itself has not traditionally received focused attention from research communities; rather, software has evolved organically and inconsistently… ▽ More Software is the key crosscutting technology that enables advances in mathematics, computer science, and domain-specific science and engineering to achieve robust simulations and analysis for science, engineering, and other research fields. However, software itself has not traditionally received focused attention from research communities; rather, software has evolved organically and inconsistently, with its development largely as by-products of other initiatives. Moreover, challenges in scientific software are expanding due to disruptive changes in computer hardware, increasing scale and complexity of data, and demands for more complex simulations involving multiphysics, multiscale modeling and outer-loop analysis. In recent years, community members have established a range of grass-roots organizations and projects to address these growing technical and social challenges in software productivity, quality, reproducibility, and sustainability. This article provides an overview of such groups and discusses opportunities to leverage their synergistic activities while nurturing work toward emerging software ecosystems. △ Less

Submitted 7 December, 2018; v1 submitted 20 November, 2018; originally announced November 2018.

arXiv:1704.09004 [pdf, other]

Kanban + X: Leveraging Kanban for Focused Improvements

Authors: Adam J. Hey, Michael A. Heroux

Abstract: Agile Development is used for many problems, often with different priorities and challenges. However, generalized engineering methodologies often overlook the particularities of a project. To solve this problem, we have looked at ways engineers have modified development methodologies for a particular focus, and created a generalized framework for leveraging Kanban towards focused improvements. The… ▽ More Agile Development is used for many problems, often with different priorities and challenges. However, generalized engineering methodologies often overlook the particularities of a project. To solve this problem, we have looked at ways engineers have modified development methodologies for a particular focus, and created a generalized framework for leveraging Kanban towards focused improvements. The result is a parallel iterative board that tracks and visualizes progress towards a focus, which we have applied to security, sustainability, and high performance as examples. Through use of this system, software projects can be more focused and directed towards their goals. △ Less

Submitted 28 April, 2017; originally announced April 2017.

Comments: 7 pages, 6 figures

ACM Class: D.2.2

arXiv:1702.08425 [pdf, other]

xSDK Foundations: Toward an Extreme-scale Scientific Software Development Kit

Authors: Roscoe Bartlett, Irina Demeshko, Todd Gamblin, Glenn Hammond, Michael Heroux, Jeffrey Johnson, Alicia Klinvex, Xiaoye Li, Lois Curfman McInnes, J. David Moulton, Daniel Osei-Kuffuor, Jason Sarich, Barry Smith, Jim Willenbring, Ulrike Meier Yang

Abstract: Extreme-scale computational science increasingly demands multiscale and multiphysics formulations. Combining software developed by independent groups is imperative: no single team has resources for all predictive science and decision support capabilities. Scientific libraries provide high-quality, reusable software components for constructing applications with improved robustness and portability.… ▽ More Extreme-scale computational science increasingly demands multiscale and multiphysics formulations. Combining software developed by independent groups is imperative: no single team has resources for all predictive science and decision support capabilities. Scientific libraries provide high-quality, reusable software components for constructing applications with improved robustness and portability. However, without coordination, many libraries cannot be easily composed. Namespace collisions, inconsistent arguments, lack of third-party software versioning, and additional difficulties make composition costly. The Extreme-scale Scientific Software Development Kit (xSDK) defines community policies to improve code quality and compatibility across independently developed packages (hypre, PETSc, SuperLU, Trilinos, and Alquimia) and provides a foundation for addressing broader issues in software interoperability, performance portability, and sustainability. The xSDK provides turnkey installation of member software and seamless combination of aggregate capabilities, and it marks first steps toward extreme-scale scientific software ecosystems from which future applications can be composed rapidly with assured quality and scalability. △ Less

Submitted 27 February, 2017; originally announced February 2017.

Comments: 14 pages

ACM Class: D.2.0; D.2.2; D.2.11

arXiv:1610.02608 [pdf, other]

Research and Education in Computational Science and Engineering

Authors: Ulrich Rüde, Karen Willcox, Lois Curfman McInnes, Hans De Sterck, George Biros, Hans Bungartz, James Corones, Evin Cramer, James Crowley, Omar Ghattas, Max Gunzburger, Michael Hanke, Robert Harrison, Michael Heroux, Jan Hesthaven, Peter Jimack, Chris Johnson, Kirk E. Jordan, David E. Keyes, Rolf Krause, Vipin Kumar, Stefan Mayer, Juan Meza, Knut Martin Mørken, J. Tinsley Oden , et al. (8 additional authors not shown)

Abstract: Over the past two decades the field of computational science and engineering (CSE) has penetrated both basic and applied research in academia, industry, and laboratories to advance discovery, optimize systems, support decision-makers, and educate the scientific and engineering workforce. Informed by centuries of theory and experiment, CSE performs computational experiments to answer questions that… ▽ More Over the past two decades the field of computational science and engineering (CSE) has penetrated both basic and applied research in academia, industry, and laboratories to advance discovery, optimize systems, support decision-makers, and educate the scientific and engineering workforce. Informed by centuries of theory and experiment, CSE performs computational experiments to answer questions that neither theory nor experiment alone is equipped to answer. CSE provides scientists and engineers of all persuasions with algorithmic inventions and software systems that transcend disciplines and scales. Carried on a wave of digital technology, CSE brings the power of parallelism to bear on troves of data. Mathematics-based advanced computing has become a prevalent means of discovery and innovation in essentially all areas of science, engineering, technology, and society; and the CSE community is at the core of this transformation. However, a combination of disruptive developments---including the architectural complexity of extreme-scale computing, the data revolution that engulfs the planet, and the specialization required to follow the applications to new frontiers---is redefining the scope and reach of the CSE endeavor. This report describes the rapid expansion of CSE and the challenges to sustaining its bold advances. The report also presents strategies and directions for CSE research and education for the next decade. △ Less

Submitted 31 December, 2017; v1 submitted 8 October, 2016; originally announced October 2016.

Comments: Major revision, to appear in SIAM Review

Report number: Argonne National Laboratory Preprint ANL/MCS-P6054-0916 MSC Class: 00A72; 62-07; 68U20; 68W01; 68W10; 97A99; 97M10; 97N80; 97R20; 97R30 ACM Class: G.0; G.4; I.6; J.0; J.2; J.3; J.4; J.6; J.7; K.3.2

arXiv:1402.3809 [pdf, ps, other]

Toward Resilient Algorithms and Applications

Authors: Michael A. Heroux

Abstract: Over the past decade, the high performance computing community has become increasingly concerned that preserving the reliable, digital machine model will become too costly or infeasible. In this paper we discuss four approaches for developing new algorithms that are resilient to hard and soft failures. Over the past decade, the high performance computing community has become increasingly concerned that preserving the reliable, digital machine model will become too costly or infeasible. In this paper we discuss four approaches for developing new algorithms that are resilient to hard and soft failures. △ Less

Submitted 13 March, 2014; v1 submitted 16 February, 2014; originally announced February 2014.

ACM Class: C.4; D.1.3; D.4.5; G.1.0

arXiv:1307.6638 [pdf, ps, other]

Supporting 64-bit global indices in Epetra and other Trilinos packages -- Techniques used and lessons learned

Authors: Chetan Jhurani, Travis M. Austin, Michael A. Heroux, James M. Willenbring

Abstract: The Trilinos Project is an effort to facilitate the design, development, integration and ongoing support of mathematical software libraries within an object-oriented framework. It is intended for large-scale, complex multiphysics engineering and scientific applications. Epetra is one of its basic packages. It provides serial and parallel linear algebra capabilities. Before Trilinos version 11.0, r… ▽ More The Trilinos Project is an effort to facilitate the design, development, integration and ongoing support of mathematical software libraries within an object-oriented framework. It is intended for large-scale, complex multiphysics engineering and scientific applications. Epetra is one of its basic packages. It provides serial and parallel linear algebra capabilities. Before Trilinos version 11.0, released in 2012, Epetra used the C++ int data-type for storing global and local indices for degrees of freedom (DOFs). Since int is typically 32-bit, this limited the largest problem size to be smaller than approximately two billion DOFs. This was true even if a distributed memory machine could handle larger problems. We have added optional support for C++ long long data-type, which is at least 64-bit wide, for global indices. To save memory, maintain the speed of memory-bound operations, and reduce further changes to the code, the local indices are still 32-bit. We document the changes required to achieve this feature and how the new functionality can be used. We also report on the lessons learned in modifying a mature and popular package from various perspectives -- design goals, backward compatibility, engineering decisions, C++ language features, effects on existing users and other packages, and build integration. △ Less

Submitted 25 July, 2013; originally announced July 2013.

arXiv:1206.1390 [pdf, other]

Fault-tolerant linear solvers via selective reliability

Authors: Patrick G. Bridges, Kurt B. Ferreira, Michael A. Heroux, Mark Hoemmen

Abstract: Energy increasingly constrains modern computer hardware, yet protecting computations and data against errors costs energy. This holds at all scales, but especially for the largest parallel computers being built and planned today. As processor counts continue to grow, the cost of ensuring reliability consistently throughout an application will become unbearable. However, many algorithms only need r… ▽ More Energy increasingly constrains modern computer hardware, yet protecting computations and data against errors costs energy. This holds at all scales, but especially for the largest parallel computers being built and planned today. As processor counts continue to grow, the cost of ensuring reliability consistently throughout an application will become unbearable. However, many algorithms only need reliability for certain data and phases of computation. This suggests an algorithm and system codesign approach. We show that if the system lets applications apply reliability selectively, we can develop algorithms that compute the right answer despite faults. These "fault-tolerant" iterative methods either converge eventually, at a rate that degrades gracefully with increased fault rate, or return a clear failure indication in the rare case that they cannot converge. Furthermore, they store most of their data unreliably, and spend most of their time in unreliable mode. We demonstrate this for the specific case of detected but uncorrectable memory faults, which we argue are representative of all kinds of faults. We developed a cross-layer application / operating system framework that intercepts and reports uncorrectable memory faults to the application, rather than killing the application, as current operating systems do. The application in turn can mark memory allocations as subject to such faults. Using this framework, we wrote a fault-tolerant iterative linear solver using components from the Trilinos solvers library. Our solver exploits hybrid parallelism (MPI and threads). It performs just as well as other solvers if no faults occur, and converges where other solvers do not in the presence of faults. We show convergence results for representative test problems. Near-term future work will include performance tests. △ Less

Submitted 6 June, 2012; originally announced June 2012.

Showing 1–10 of 10 results for author: Heroux, M