research-article

Open access

Algorithms for Efficient Reproducible Floating Point Summation

Authors:

Hong Diep NguyenAuthors Info & Claims

ACM Transactions on Mathematical Software (TOMS), Volume 46, Issue 3

Article No.: 22, Pages 1 - 49

https://doi.org/10.1145/3389360

Published: 21 July 2020 Publication History

All formats PDF

Abstract

We define “reproducibility” as getting bitwise identical results from multiple runs of the same program, perhaps with different hardware resources or other changes that should not affect the answer. Many users depend on reproducibility for debugging or correctness. However, dynamic scheduling of parallel computing resources, combined with nonassociative floating point addition, makes reproducibility challenging even for summation, or operations like the BLAS. We describe a “reproducible accumulator” data structure (the “binned number”) and associated algorithms to reproducibly sum binary floating point numbers, independent of summation order. We use a subset of the IEEE Floating Point Standard 754-2008 and bitwise operations on the standard representations in memory. Our approach requires only one read-only pass over the data, and one reduction in parallel, using a 6-word reproducible accumulator (more words can be used for higher accuracy), enabling standard tiling optimization techniques. Summing n words with a 6-word reproducible accumulator requires approximately 9n floating point operations (arithmetic, comparison, and absolute value) and approximately 3n bitwise operations. The final error bound with a 6-word reproducible accumulator and our default settings can be up to 2²⁹ times smaller than the error bound for conventional (recursive) summation on ill-conditioned double-precision inputs.

References

[1]

IEEE. 2008. IEEE standard for floating-point arithmetic. IEEE Std 754-2008 (Aug. 2008), 1--70.

[2]

Intel. 2018. Developer Reference for Intel® Math Kernel Library 2018 - C | Intel® Software. Retrieved from https://software.intel.com/en-us/download/developer-reference-for-intel-math-kernel-library-2018-c.

[3]

NVIDIA. 2018. NVIDIA® cuBLAS. Retrieved from http://docs.nvidia.com/cuda/cublas/index.html.

[4]

Intel. 2019. bfloat16 - HardwareNumerics Definition. Retrieved from https://software.intel.com/sites/default/files/managed/40/8b/bf16-hardware-numerics-definition-white-paper.pdf.

[5]

IEEE. 2019. IEEE standard for floating-point arithmetic. IEEE Std 754-2019 (July 2019), 1--84.

[6]

R. Alverson. 1991. Integer division using reciprocals. In Proceedings of the Symposium on Computer Arithmetic (ARITH’91). 186--190.

[7]

A. Arteaga, O. Fuhrer, and T. Hoefler. 2014. Designing bit-reproducible portable high-performance applications. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’14). 1235--1244.

Digital Library

[8]

C. Chohra, P. Langlois, and D. Parello. 2015. Efficiency of reproducible level 1 BLAS. In Proceedings of the Conference on Scientific Computing, Computer Arithmetic, and Validated Numerics (SCAN’15). Springer, Cham, 99--108.

Digital Library

[9]

C. Chohra, P. Langlois, and D. Parello. 2016. Reproducible, accurately rounded and efficient BLAS. In Proceedings of the Euro-Par Parallel Processing Workshops. Springer, Cham, 609--620.

[10]

S. Collange, D. Defour, S. Graillat, and R. Iakymchuk. 2015. Numerical reproducibility for the parallel reduction on multi- and many-core architectures. Parallel Comput. 49 (Nov. 2015), 83--97.

Digital Library

[11]

T. J. Dekker. 1971. A floating-point technique for extending the available precision. Numer. Math. 18, 3 (June 1971), 224--242.

Digital Library

[12]

J. Demmel, G. Gopalakrishnan, M. Heroux, W. Keyrouz, and K. Sato. 2015. Reproducibility of high performance codes and simulations: Tools, techniques, debugging. In Proceedings of the SC 2015 Birds of a Feather Sessions. Retrieved from https://gcl.cis.udel.edu/sc15bof.php.

[13]

J. Demmel and Y. Hida. 2004. Accurate and efficient floating point summation. SIAM J. Sci. Comput. 25, 4 (Jan. 2004), 1214--1248.

Digital Library

[14]

J. Demmel and H. D. Nguyen. 2013. Fast reproducible floating-point summation. In Proceedings of the Symposium on Computer Arithmetic (ARITH’13). 163--172.

Digital Library

[15]

J. Demmel and H. D. Nguyen. 2015. Parallel reproducible summation. IEEE Trans. Comput. 64, 7 (July 2015), 2060--2070.

Digital Library

[16]

T. Granlund and P. L. Montgomery. 1994. Division by invariant integers using multiplication. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI’94). 61--72.

Digital Library

[17]

Y. Hida, X. S. Li, and D. H. Bailey. 2001. Algorithms for quad-double precision floating point arithmetic. In Proceedings of the Symposium on Computer Arithmetic (ARITH’01). 155--162.

[18]

N. Higham. 1993. The accuracy of floating point summation. SIAM J. Sci. Comput. 14, 4 (July 1993), 783--799.

Digital Library

[19]

N. Higham. 2002. Accuracy and Stability of Numerical Algorithms (2nd ed.). Society for Industrial and Applied Mathematics.

[20]

D. G. Hough. 2019. The IEEE standard 754: One for the history books. Computer 52, 12 (Dec. 2019), 109--112.

[21]

R. Iakymchuk, S. Collange, D. Defour, and S. Graillat. 2015. ExBLAS: Reproducible and accurate BLAS library. In Proceedings of the SC 2015 Numerical Reproducibility at Exascale Workshops (NRE’15). Retrieved from https://hal.archives-ouvertes.fr/hal-01202396.

[22]

R. Iakymchuk, D. Defour, S. Collange, and S. Graillat. 2015. Reproducible and accurate matrix multiplication. In Proceedings of the Conference on Scientific Computing, Computer Arithmetic, and Validated Numerics (SCAN’15). Springer, Cham, 126--137.

Digital Library

[23]

R. Iakymchuk, D. Defour, S. Collange, and S. Graillat. 2015. Reproducible triangular solvers for high-performance computing. In Proceedings of the International Conference on Information Technology - New Generations (ITNG’15). 353--358.

Digital Library

[24]

W. Kahan. 1965. Pracniques: Further remarks on reducing truncation errors. Commun. ACM 8, 1 (Jan. 1965).

Digital Library

[25]

D. E. Knuth. 1969. The Art of Computer Programming 2: Seminumerical Algorithms. Addison-Wesley, Reading, MA.

[26]

U. Kulisch. 2012. Computer Arithmetic and Validity: Theory, Implementation, and Applications (2nd ed.). Walter de Gruyter.

[27]

D. R. Lutz and C. N. Hinds. 2017. High-precision anchored accumulators for reproducible floating-point summation. In Proceedings of the Symposium on Computer Arithmetic (ARITH’17). 98--105.

[28]

J.-M. Muller, N. Brunie, F. Dinechin, C.-P. Jeannerod, M. Joldes, V. Lefèvre, G. Melquiond, N. Revol, and S. Torres. 2018. Handbook of Floating-Point Arithmetic (2nd ed.). Birkhäuser Basel. Retrieved from http://www.springer.com/us/book/9783319765259.

[29]

J. Riedy and J. Demmel. 2018. Augmented arithmetic operations proposed for IEEE-754 2018. In Proceedings of the Symposium on Computer Arithmetic (ARITH’18). 45--52.

[30]

S. M. Rump. 2009. Ultimately fast accurate summation. SIAM J. Sci. Comput. 31, 5 (Jan. 2009), 3466--3502.

Digital Library

[31]

S. M. Rump, T. Ogita, and S. Oishi. 2010. Fast high precision summation. Nonlin. Theor. Applic. IEICE 1 (2010), 2--24.

Cited By

Mallasén DMurillo RDel Barrio ABotella GPrieto-Matias M(2024)Integration of Posit Arithmetic in RISC-V Targeting Low-Power Computations2024 IEEE 24th International Conference on Nanotechnology (NANO)10.1109/NANO61778.2024.10628939(352-357)Online publication date: 8-Jul-2024
https://doi.org/10.1109/NANO61778.2024.10628939
Hubrecht TJeannerod CMuller J(2024)Useful applications of correctly-rounded operators of the form ab + cd + e2024 IEEE 31st Symposium on Computer Arithmetic (ARITH)10.1109/ARITH61463.2024.00015(32-39)Online publication date: 10-Jun-2024
https://doi.org/10.1109/ARITH61463.2024.00015
Assogba KNicolae BVan Dam HRafique M(2023)Asynchronous Multi-Level Checkpointing: An Enabler of Reproducibility using Checkpoint History AnalyticsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624256(1748-1756)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624256
Show More Cited By

Index Terms

Algorithms for Efficient Reproducible Floating Point Summation
1. Computing methodologies
  1. Distributed computing methodologies
  2. Parallel computing methodologies
    1. Parallel algorithms
2. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Arbitrary-precision arithmetic
  2. Mathematical software

Recommendations

Decimal Floating-Point Multiplication

Decimal multiplication is important in many commercial applications including financial analysis, banking, tax calculation, currency conversion, insurance, and accounting. This paper presents the design of two decimal floating-point multipliers: one ...
Hardware Designs for Decimal Floating-Point Addition and Related Operations

Decimal arithmetic is often used in commercial, financial, and Internet-based applications. Due to the growing importance of decimal floating-point (DFP) arithmetic, the IEEE 754 Draft Standard for Floating-Point Arithmetic (IEEE P754) includes ...
Low-Power Multiple-Precision Iterative Floating-Point Multiplier with SIMD Support

The demand for improved SIMD floating-point performance on general-purpose x86-compatible microprocessors is rising. At the same time, there is a conflicting demand in the low-power computing market for a reduction in power consumption. Along with this, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software

ACM Transactions on Mathematical Software Volume 46, Issue 3

September 2020

267 pages

ISSN:0098-3500

EISSN:1557-7295

DOI:10.1145/3410509

Editors:
Zhaojun Bai
University of California at Davis, USA
,
Wolfgang Bangerth
Colorado State University, USA

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 July 2020

Online AM: 07 May 2020

Accepted: 01 March 2020

Revised: 01 February 2020

Received: 01 June 2016

Published in TOMS Volume 46, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Darpa XDATA
HP
Nokia
DOE Computational Science Graduate Fellowship
Mathworks
DARPA
ASPIRE Lab
LGE
Samsung
Cray
NSF
Intel
DOE
Intel ITSC
Google
Huawei
NVIDIA
Oracle
Aramco

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
1,111
Total Downloads

Downloads (Last 12 months)467
Downloads (Last 6 weeks)44

Reflects downloads up to 24 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mallasén DMurillo RDel Barrio ABotella GPrieto-Matias M(2024)Integration of Posit Arithmetic in RISC-V Targeting Low-Power Computations2024 IEEE 24th International Conference on Nanotechnology (NANO)10.1109/NANO61778.2024.10628939(352-357)Online publication date: 8-Jul-2024
https://doi.org/10.1109/NANO61778.2024.10628939
Hubrecht TJeannerod CMuller J(2024)Useful applications of correctly-rounded operators of the form ab + cd + e2024 IEEE 31st Symposium on Computer Arithmetic (ARITH)10.1109/ARITH61463.2024.00015(32-39)Online publication date: 10-Jun-2024
https://doi.org/10.1109/ARITH61463.2024.00015
Assogba KNicolae BVan Dam HRafique M(2023)Asynchronous Multi-Level Checkpointing: An Enabler of Reproducibility using Checkpoint History AnalyticsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624256(1748-1756)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624256
Lei XGu TGraillat SXu XMeng J(2023)Comparison of Reproducible Parallel Preconditioned BiCGSTAB Algorithm Based on ExBLAS and ReproBLASProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3578178.3578234(46-54)Online publication date: 27-Feb-2023
https://dl.acm.org/doi/10.1145/3578178.3578234
Heumos LEhmele PKuhn Cuellar LMenden KMiller ELemke SGabernet GNahnsen S(2023)mlf-core: a framework for deterministic machine learningBioinformatics10.1093/bioinformatics/btad16439:4Online publication date: 2-Apr-2023
https://doi.org/10.1093/bioinformatics/btad164
Li KHe KGraillat SJiang HGu TLiu J(2023)Multi-level parallel multi-layer block reproducible summation algorithmParallel Computing10.1016/j.parco.2023.102996115(102996)Online publication date: Feb-2023
https://doi.org/10.1016/j.parco.2023.102996
Lei XGu TXu X(2023)ddRingAllreduce: a high-precision RingAllreduce algorithmCCF Transactions on High Performance Computing10.1007/s42514-023-00150-25:3(245-257)Online publication date: 5-Jul-2023
https://doi.org/10.1007/s42514-023-00150-2
Dmitruk BStpiczyński P(2023)Improving accuracy of summation using parallel vectorized Kahan's and Gill‐Møller algorithmsConcurrency and Computation: Practice and Experience10.1002/cpe.776335:23Online publication date: 10-May-2023
https://doi.org/10.1002/cpe.7763
Demmel JDongarra JGates MHenry GLangou JLi XLuszczek PPereira WRiedy JRubio-Gonzalez C(2022)Proposed Consistent Exception Handling for the BLAS and LAPACK2022 IEEE/ACM Sixth International Workshop on Software Correctness for HPC Applications (Correctness)10.1109/Correctness56720.2022.00006(1-9)Online publication date: Nov-2022
https://doi.org/10.1109/Correctness56720.2022.00006
Dmitruk BStpiczyński P(2022)Parallel Vectorized Implementations of Compensated Summation AlgorithmsParallel Processing and Applied Mathematics10.1007/978-3-031-30445-3_6(63-74)Online publication date: 11-Sep-2022
https://dl.acm.org/doi/10.1007/978-3-031-30445-3_6
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents