skip to main content
research-article
Open access

Algorithms for Efficient Reproducible Floating Point Summation

Published: 21 July 2020 Publication History

Abstract

We define “reproducibility” as getting bitwise identical results from multiple runs of the same program, perhaps with different hardware resources or other changes that should not affect the answer. Many users depend on reproducibility for debugging or correctness. However, dynamic scheduling of parallel computing resources, combined with nonassociative floating point addition, makes reproducibility challenging even for summation, or operations like the BLAS. We describe a “reproducible accumulator” data structure (the “binned number”) and associated algorithms to reproducibly sum binary floating point numbers, independent of summation order. We use a subset of the IEEE Floating Point Standard 754-2008 and bitwise operations on the standard representations in memory. Our approach requires only one read-only pass over the data, and one reduction in parallel, using a 6-word reproducible accumulator (more words can be used for higher accuracy), enabling standard tiling optimization techniques. Summing n words with a 6-word reproducible accumulator requires approximately 9n floating point operations (arithmetic, comparison, and absolute value) and approximately 3n bitwise operations. The final error bound with a 6-word reproducible accumulator and our default settings can be up to 229 times smaller than the error bound for conventional (recursive) summation on ill-conditioned double-precision inputs.

References

[1]
IEEE. 2008. IEEE standard for floating-point arithmetic. IEEE Std 754-2008 (Aug. 2008), 1--70.
[2]
Intel. 2018. Developer Reference for Intel® Math Kernel Library 2018 - C | Intel® Software. Retrieved from https://software.intel.com/en-us/download/developer-reference-for-intel-math-kernel-library-2018-c.
[3]
NVIDIA. 2018. NVIDIA® cuBLAS. Retrieved from http://docs.nvidia.com/cuda/cublas/index.html.
[4]
Intel. 2019. bfloat16 - HardwareNumerics Definition. Retrieved from https://software.intel.com/sites/default/files/managed/40/8b/bf16-hardware-numerics-definition-white-paper.pdf.
[5]
IEEE. 2019. IEEE standard for floating-point arithmetic. IEEE Std 754-2019 (July 2019), 1--84.
[6]
R. Alverson. 1991. Integer division using reciprocals. In Proceedings of the Symposium on Computer Arithmetic (ARITH’91). 186--190.
[7]
A. Arteaga, O. Fuhrer, and T. Hoefler. 2014. Designing bit-reproducible portable high-performance applications. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’14). 1235--1244.
[8]
C. Chohra, P. Langlois, and D. Parello. 2015. Efficiency of reproducible level 1 BLAS. In Proceedings of the Conference on Scientific Computing, Computer Arithmetic, and Validated Numerics (SCAN’15). Springer, Cham, 99--108.
[9]
C. Chohra, P. Langlois, and D. Parello. 2016. Reproducible, accurately rounded and efficient BLAS. In Proceedings of the Euro-Par Parallel Processing Workshops. Springer, Cham, 609--620.
[10]
S. Collange, D. Defour, S. Graillat, and R. Iakymchuk. 2015. Numerical reproducibility for the parallel reduction on multi- and many-core architectures. Parallel Comput. 49 (Nov. 2015), 83--97.
[11]
T. J. Dekker. 1971. A floating-point technique for extending the available precision. Numer. Math. 18, 3 (June 1971), 224--242.
[12]
J. Demmel, G. Gopalakrishnan, M. Heroux, W. Keyrouz, and K. Sato. 2015. Reproducibility of high performance codes and simulations: Tools, techniques, debugging. In Proceedings of the SC 2015 Birds of a Feather Sessions. Retrieved from https://gcl.cis.udel.edu/sc15bof.php.
[13]
J. Demmel and Y. Hida. 2004. Accurate and efficient floating point summation. SIAM J. Sci. Comput. 25, 4 (Jan. 2004), 1214--1248.
[14]
J. Demmel and H. D. Nguyen. 2013. Fast reproducible floating-point summation. In Proceedings of the Symposium on Computer Arithmetic (ARITH’13). 163--172.
[15]
J. Demmel and H. D. Nguyen. 2015. Parallel reproducible summation. IEEE Trans. Comput. 64, 7 (July 2015), 2060--2070.
[16]
T. Granlund and P. L. Montgomery. 1994. Division by invariant integers using multiplication. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI’94). 61--72.
[17]
Y. Hida, X. S. Li, and D. H. Bailey. 2001. Algorithms for quad-double precision floating point arithmetic. In Proceedings of the Symposium on Computer Arithmetic (ARITH’01). 155--162.
[18]
N. Higham. 1993. The accuracy of floating point summation. SIAM J. Sci. Comput. 14, 4 (July 1993), 783--799.
[19]
N. Higham. 2002. Accuracy and Stability of Numerical Algorithms (2nd ed.). Society for Industrial and Applied Mathematics.
[20]
D. G. Hough. 2019. The IEEE standard 754: One for the history books. Computer 52, 12 (Dec. 2019), 109--112.
[21]
R. Iakymchuk, S. Collange, D. Defour, and S. Graillat. 2015. ExBLAS: Reproducible and accurate BLAS library. In Proceedings of the SC 2015 Numerical Reproducibility at Exascale Workshops (NRE’15). Retrieved from https://hal.archives-ouvertes.fr/hal-01202396.
[22]
R. Iakymchuk, D. Defour, S. Collange, and S. Graillat. 2015. Reproducible and accurate matrix multiplication. In Proceedings of the Conference on Scientific Computing, Computer Arithmetic, and Validated Numerics (SCAN’15). Springer, Cham, 126--137.
[23]
R. Iakymchuk, D. Defour, S. Collange, and S. Graillat. 2015. Reproducible triangular solvers for high-performance computing. In Proceedings of the International Conference on Information Technology - New Generations (ITNG’15). 353--358.
[24]
W. Kahan. 1965. Pracniques: Further remarks on reducing truncation errors. Commun. ACM 8, 1 (Jan. 1965).
[25]
D. E. Knuth. 1969. The Art of Computer Programming 2: Seminumerical Algorithms. Addison-Wesley, Reading, MA.
[26]
U. Kulisch. 2012. Computer Arithmetic and Validity: Theory, Implementation, and Applications (2nd ed.). Walter de Gruyter.
[27]
D. R. Lutz and C. N. Hinds. 2017. High-precision anchored accumulators for reproducible floating-point summation. In Proceedings of the Symposium on Computer Arithmetic (ARITH’17). 98--105.
[28]
J.-M. Muller, N. Brunie, F. Dinechin, C.-P. Jeannerod, M. Joldes, V. Lefèvre, G. Melquiond, N. Revol, and S. Torres. 2018. Handbook of Floating-Point Arithmetic (2nd ed.). Birkhäuser Basel. Retrieved from http://www.springer.com/us/book/9783319765259.
[29]
J. Riedy and J. Demmel. 2018. Augmented arithmetic operations proposed for IEEE-754 2018. In Proceedings of the Symposium on Computer Arithmetic (ARITH’18). 45--52.
[30]
S. M. Rump. 2009. Ultimately fast accurate summation. SIAM J. Sci. Comput. 31, 5 (Jan. 2009), 3466--3502.
[31]
S. M. Rump, T. Ogita, and S. Oishi. 2010. Fast high precision summation. Nonlin. Theor. Applic. IEICE 1 (2010), 2--24.

Cited By

View all
  • (2024)Integration of Posit Arithmetic in RISC-V Targeting Low-Power Computations2024 IEEE 24th International Conference on Nanotechnology (NANO)10.1109/NANO61778.2024.10628939(352-357)Online publication date: 8-Jul-2024
  • (2024)Useful applications of correctly-rounded operators of the form ab + cd + e2024 IEEE 31st Symposium on Computer Arithmetic (ARITH)10.1109/ARITH61463.2024.00015(32-39)Online publication date: 10-Jun-2024
  • (2023)Asynchronous Multi-Level Checkpointing: An Enabler of Reproducibility using Checkpoint History AnalyticsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624256(1748-1756)Online publication date: 12-Nov-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software
ACM Transactions on Mathematical Software  Volume 46, Issue 3
September 2020
267 pages
ISSN:0098-3500
EISSN:1557-7295
DOI:10.1145/3410509
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 July 2020
Online AM: 07 May 2020
Accepted: 01 March 2020
Revised: 01 February 2020
Received: 01 June 2016
Published in TOMS Volume 46, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Reproducible summation
  2. binned number
  3. binned summation
  4. computer arithmetic
  5. floating point number
  6. floating point summation
  7. parallel
  8. reproducibility
  9. summation

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Darpa XDATA
  • HP
  • Nokia
  • DOE Computational Science Graduate Fellowship
  • Mathworks
  • DARPA
  • ASPIRE Lab
  • LGE
  • Samsung
  • Cray
  • NSF
  • Intel
  • DOE
  • Intel ITSC
  • Google
  • Huawei
  • NVIDIA
  • Oracle
  • Aramco

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)467
  • Downloads (Last 6 weeks)44
Reflects downloads up to 24 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Integration of Posit Arithmetic in RISC-V Targeting Low-Power Computations2024 IEEE 24th International Conference on Nanotechnology (NANO)10.1109/NANO61778.2024.10628939(352-357)Online publication date: 8-Jul-2024
  • (2024)Useful applications of correctly-rounded operators of the form ab + cd + e2024 IEEE 31st Symposium on Computer Arithmetic (ARITH)10.1109/ARITH61463.2024.00015(32-39)Online publication date: 10-Jun-2024
  • (2023)Asynchronous Multi-Level Checkpointing: An Enabler of Reproducibility using Checkpoint History AnalyticsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624256(1748-1756)Online publication date: 12-Nov-2023
  • (2023)Comparison of Reproducible Parallel Preconditioned BiCGSTAB Algorithm Based on ExBLAS and ReproBLASProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3578178.3578234(46-54)Online publication date: 27-Feb-2023
  • (2023)mlf-core: a framework for deterministic machine learningBioinformatics10.1093/bioinformatics/btad16439:4Online publication date: 2-Apr-2023
  • (2023)Multi-level parallel multi-layer block reproducible summation algorithmParallel Computing10.1016/j.parco.2023.102996115(102996)Online publication date: Feb-2023
  • (2023)ddRingAllreduce: a high-precision RingAllreduce algorithmCCF Transactions on High Performance Computing10.1007/s42514-023-00150-25:3(245-257)Online publication date: 5-Jul-2023
  • (2023)Improving accuracy of summation using parallel vectorized Kahan's and Gill‐Møller algorithmsConcurrency and Computation: Practice and Experience10.1002/cpe.776335:23Online publication date: 10-May-2023
  • (2022)Proposed Consistent Exception Handling for the BLAS and LAPACK2022 IEEE/ACM Sixth International Workshop on Software Correctness for HPC Applications (Correctness)10.1109/Correctness56720.2022.00006(1-9)Online publication date: Nov-2022
  • (2022)Parallel Vectorized Implementations of Compensated Summation AlgorithmsParallel Processing and Applied Mathematics10.1007/978-3-031-30445-3_6(63-74)Online publication date: 11-Sep-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media