SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
Reproducible Linear Algebra from
Application to Architecture
Jason Riedy (GT) James Demmel (UCB) Peter Ahrens (MIT)
SIAM PP, 15 February 2020
Outline
Motivation: Why worry?
Higher level methods & libraries
Implementations and architectures
Closing
Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 1/21
Motivation: Why worry?
Informal UCB survey
Sca/LAPACK prospectus (2006) identified usefulness of
reproducibility. A request appeared on NA-Digest in 2010.
So Dr. Demmel asked around 100 relevant UCB faculty if
“reproducibility” matters.1
• Typical response: Debugging.
• Atypical responses:
• Error analysis! (One guess...)
• Investigating rare instances in simulated data, so
must be reproducible.
• UN uses my code to detect nuclear tests.
1
Demmel, Riedy, Ahrens. “Reproducible BLAS: Make Addition Associative Again!” SIAM News, Oct. 2018
Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 2/21
(Some) kinds of reproducibility
• Debugging
• Just developer’s platform.
• Then users occur.
• Investigate rare instance
• Small job, similar to debugging.
• Larger? # proc changes.
• Schrödinger’s nuke, climate
negotiations, ...
• Likely little control over the
runtime environment.
• Accounting, some finance
• Legal: identical across history.
Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 3/21
Growing “demand,” or at least interest
• Reproducibility is a tag on Supercomputing sessions.
• Oħten means reproducible (“replicable”) experiments.
• But many sessions and talks on numerical
reproducibility.
• Requirement for journals like ACM TOMS
• Vendors want to reduce support calls.
• New MATLAB™ version, new optimized BLAS, new
cache hierarchy ⇒ “My results changed! Fix it!”
• Compiler A evaluates constant f.p. expressions at run
time, compiler B at compile time ⇒ “My results
changed! Fix it!”
• Vendors (HW & SW) want to sell new versions.
Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 4/21
What Stands in the Way?
1. Performance.
2. Shared
definitions.
3. Performance.
4. “Ease of use.”
5. Performance Summit photo from OLCF at ORNL.
Debugging run much longer than an email check ⇒ No.
Production run uses up an allocation ⇒ No.
Some help from excess of processing cycles compared to
memory then compared to interconnect.
Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 5/21
Shared definitions: Exceptional behavior?
0 × NaN ⇒ NaN except when ⇒ 0?
0 × ∞ ⇒ NaN except when ⇒ 0?
GER: BLAS2 routine for A = A + α x yT
. In the reference
implementation:
• If α = 0, xyT
is not evaluated, no propagation.
• If yk = 0, skip column, no propagation.
A symmetric update of a general matrix may not be a
symmetric update!
Some vendor libraries behave differently.
Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 6/21
Other exceptional issues
ISAMAX, ICAMAX: Find entry of max. |magnitude|.
• Return NaN’s location only if it’s first.
• ICAMAX uses |real| + |imag|, which may overflow!
• Actual |complex| may differ (e.g. C++ v. C).
GEMM: BLAS3, C = α op(A) op(B) + βC
• If α = 0, then A, B do not propagate.
• If β = 0, then C does not...
• BUT traditionally β = 0 overwrites C.
More examples in Demmel, Gates, Henry, Li, Riedy, Tang, “A Proposal
for a Next-Generation BLAS,” http://goo.gl/D1UKnw.
Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 7/21
Complex complexity
It gets worse.
• C99 & C11: Complex is infinite if either
component is, regardless of NaNs.
• Hence multiplication yields complex ∞, but
• “by hand” would yield NaN + iNaN.
• C standard defines 30+ line complex
multiplication.
• Vectorization tends to ignore this definition.
• Fortran, C++ do not define semantics!
• People expect their particular compiler’s
result.
• Division is at least as bad.
Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 8/21
Higher level methods & libraries
Assuming “agreement” on exceptions...
(Some) current approaches:
• Specific platform reproducibility for debugging.
• Intel CNR, NVIDIA
• Arbitrary precision / exact comp.
• Not saying more on this.
• Correctly rounded results
• ExBLAS
• Not “faithful” rounding. One of two choices, but
another implementation may choose the other.
• Reproducible accumulators
• Very wide accumulators (Kulisch, ARM HPA)
• Binned accumulators (ReproBLAS)
Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 9/21
Vendor debugging support
Intel Conditional Numerical Reproducibility in MKL2
:
• “[...] calls to Intel®
MKL occur in a single executable”
• “the number of computational threads [...] remains constant
[...]” with SSE/AVX selection routines
• Strict CNR Mode: Any # threads for GEMM, TRSM, and SYMM.
NVIDIA cuBLAS3
:
• “By design, all CUBLAS API routines from a given toolkit version,
generate the same bit-wise results at every run when executed
on GPUs with the same architecture and the same number of
SMs.”
• No promise across versions, can disable for performance
2
https://software.intel.com/en-us/articles/
introduction-to-the-conditional-numerical-reproducibility-cnr
3
https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility
Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 10/21
Correctly rounded results
Example: the ExBLAS https://exblas.lip6.fr/
• Deliver correctly rounded results for AXPY, SUM, DOT,
TRSV, GEMV, GEMM, ...
• Reproducible no matter parallelism, vectors,
distribution
• Used in fluid simulations with discontinuous
Galerkin (ArXiV 1807.01971), parallel PCG
• Memory / communication limited.
Built on accumulators using exact transformations:
• twoSum: a + b ⇒ s + e, and
• twoProd: a × b ⇒ p + e.
Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 11/21
Reproducible accumulators: Fixed-point
Expose the accumulators. Can compose across uses.
• With a sufficiently wide accumulator, floating point
becomes fixed point.
• Fixed point is reproducible.
• Kulisch superaccumulator: binary64 ⇒ >4000 bits
• Various implementations and optimizations: e.g.
Koenig et al. (ARITH17)
• ARM High-Precision Anchored (HPA)4
number in SVE
• Narrowed by data range, likely ≤ 200 bits
• Kinda wide, kinda binned
• Int / fixed-point summation is much faster than f.p.
4
Burgess et al., IEEE ToC 7/2019, doi: 10.1109/TC.2018.2855729
Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 12/21
Reproducible accumulators: Binned
Expose the accumulators: Example: ReproBLAS5
:
Goals for summation:
• Build on standard IEEE 754 binary ops
• Tunable precision, ≥ conventional sum
• Handle exceptions reproducibly
• One read-only pass, one reduction
• Minimal memory for tiling
• “Usable:” Can build higher-level and parallel
routines (for some α, β, ...)
5
Demmel, Ahrens, Nguyen https://bebop.cs.berkeley.edu/reproblas/
Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 13/21
ReproBLAS binning
• p: Precison
• K : -fold (in)
• W: Width (in)
bin32 bin64 bin128
W 13 40 100
K 3
Sum len. 233
264
2124
7n flops for n sums. But twoSum again!
20% overhead for adding 1M #s on 1K XC30 processors
Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 14/21
Implementations and
architectures
Defining twoSum
Long history...
quickTwoSum(a, b) ⇒ (s, e), |a| ≥ |b|
1. s = a + b
2. e = b − (s − a)
twoSum(a, b) ⇒ (s, e)
1. s = a + b
2. t := s − a
3. e = (a − (s − t)) + (b − t)
Many sign re-arrangements possible.
Each generates different ±0, NaNs, ...
Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 15/21
Specifying twoSum ⇒ augmentedAddition
Finally standardized in IEEE 754-2019.
• augmentedAddition(a, b) ⇒ (s, e)
• augmentedSubtraction(a, b) ⇒ (d, e)
• augmentedMultiplication(a, b) ⇒ (p, e)
Fully specified exceptional cases including overflow,
underflow, invalid, and signed zeros.
“Funny” aspect: Round ties towards zero.
• Mode exists for decimal but not binary.
• Reproducibility: Rounding must not depend on
value!
Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 16/21
augmented∗ operations i
• Rationale and test generation: Riedy, Demmel.
“Augmented Arithmetic Operations Proposed for
IEEE-754 2018.” ARITH 18, doi:
10.1109/ARITH.2018.8464813
• Soħtware:
• FP implementation: Boldo, Lauter, Muller. “Emulating
[ties→0] ‘augmented’ [...] operations using
[ties→even] arithmetic.” HAL 2137968.
• Int impl: Convincing myself it’s right...
• No one should use these... Performance.
Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 17/21
augmented∗ operations ii
• Hardware: There is “interest” from processor
designers...
• In a two-operation form, one for each piece.
• There need to be users / uses of $ importance.
Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 18/21
Possible HW performance: single CPU core
Emulating augmentedAddition as two instructions
improves double-double:
Operation Skylake Haswell
Addition latency −55% −45%
throughput +36% +18%
DDGEMM MFLOP/s from reduced insn dependencies:
Operation Intel Skylake Intel Haswell
“Typical” implementation 1732 (≈ 1/37 DP) 1199 (≈ 1/45 DP)
Two-insn augmentedAddition 3344 (≈ 1/19 DP) 2283 (≈ 1/24 DP)
Dukhan, Riedy, Vuduc. “Wanted: Floating-point add round-off error instruction,” PMAA 2016, ArXiv 1603.00491.
ReproBLAS dot product: 33% rate improvement,
only 2× slower than non-reproducible,
single node.
Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 19/21
Closing
Closing
• Even before the “interesting” parts, challenges in
reproducible linear algebra:
• Exceptional conditions and propagation
• Different complex arithmetics
• (Also: Signaling exceptions and more)
• Each approach has benefits and limitations.
• ExBLAS: Cannot easily compose, but easy to
understand.
• ReproBLAS: Composable (multipliers ±1, 0),
• Aside: Composing is useful for streaming data...
• Performance can be ok, so long as networks are
involved.
• Really need hardware for day-to-day use.
Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 20/21
Upcoming issues
• Current work assumes a processor-centric view.
• Novel memory-centric processors may minimize
state, limiting accumulators.
• FPGAs near / in memory change engineering choices.
• Coming flood of 8-bit multipliers.
• Energy and space scale as bits∗∗2.
• Machine learning “wants” them (bfloat16).
• Reproducibility and neuromorphic? Quantum?
Emu Chick FPGAs & smart
NICs FPAA
Coming soon...
Georgia Tech CRNCH Rogues Gallery http://crnch.gatech.edu/rg/
Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 21/21

Weitere ähnliche Inhalte

Was ist angesagt?

ScilabTEC 2015 - Embedded Solutions
ScilabTEC 2015 - Embedded SolutionsScilabTEC 2015 - Embedded Solutions
ScilabTEC 2015 - Embedded SolutionsScilab
 
Design and implementation of Closed Loop Control of Three Phase Interleaved P...
Design and implementation of Closed Loop Control of Three Phase Interleaved P...Design and implementation of Closed Loop Control of Three Phase Interleaved P...
Design and implementation of Closed Loop Control of Three Phase Interleaved P...IJMTST Journal
 
Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...
Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...
Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...Daniel Lemire
 
Building and road detection from large aerial imagery
Building and road detection from large aerial imageryBuilding and road detection from large aerial imagery
Building and road detection from large aerial imageryShunta Saito
 
Double-constrained RPCA based on Saliency Maps for Foreground Detection in Au...
Double-constrained RPCA based on Saliency Maps for Foreground Detection in Au...Double-constrained RPCA based on Saliency Maps for Foreground Detection in Au...
Double-constrained RPCA based on Saliency Maps for Foreground Detection in Au...ActiveEon
 

Was ist angesagt? (6)

ScilabTEC 2015 - Embedded Solutions
ScilabTEC 2015 - Embedded SolutionsScilabTEC 2015 - Embedded Solutions
ScilabTEC 2015 - Embedded Solutions
 
Design and implementation of Closed Loop Control of Three Phase Interleaved P...
Design and implementation of Closed Loop Control of Three Phase Interleaved P...Design and implementation of Closed Loop Control of Three Phase Interleaved P...
Design and implementation of Closed Loop Control of Three Phase Interleaved P...
 
Bs25412419
Bs25412419Bs25412419
Bs25412419
 
Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...
Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...
Logarithmic Discrete Wavelet Transform for High-Quality Medical Image Compres...
 
Building and road detection from large aerial imagery
Building and road detection from large aerial imageryBuilding and road detection from large aerial imagery
Building and road detection from large aerial imagery
 
Double-constrained RPCA based on Saliency Maps for Foreground Detection in Au...
Double-constrained RPCA based on Saliency Maps for Foreground Detection in Au...Double-constrained RPCA based on Saliency Maps for Foreground Detection in Au...
Double-constrained RPCA based on Saliency Maps for Foreground Detection in Au...
 

Ähnlich wie Reproducible Linear Algebra from Application to Architecture

ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisJason Riedy
 
C-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUsC-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUsPandey_G
 
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...Naoki Shibata
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learningArnaud Rachez
 
S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs
S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUsS6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs
S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUsIBM
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUsSri Ambati
 
Low power & area efficient carry select adder
Low power & area efficient carry select adderLow power & area efficient carry select adder
Low power & area efficient carry select adderSai Vara Prasad P
 
Products go Green: Worst-Case Energy Consumption in Software Product Lines
Products go Green: Worst-Case Energy Consumption in Software Product LinesProducts go Green: Worst-Case Energy Consumption in Software Product Lines
Products go Green: Worst-Case Energy Consumption in Software Product LinesGreenLabAtDI
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesIntel® Software
 
Graph Analysis: New Algorithm Models, New Architectures
Graph Analysis: New Algorithm Models, New ArchitecturesGraph Analysis: New Algorithm Models, New Architectures
Graph Analysis: New Algorithm Models, New ArchitecturesJason Riedy
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCAapo Kyrölä
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalJunli Gu
 
PR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningPR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningSungchul Kim
 
Ananthprofilepln
AnanthprofileplnAnanthprofilepln
Ananthprofileplnananthch
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Christian Peel
 
Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18Aritra Sarkar
 

Ähnlich wie Reproducible Linear Algebra from Application to Architecture (20)

ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
 
C-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUsC-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUs
 
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learning
 
S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs
S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUsS6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs
S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
 
Smallsat 2021
Smallsat 2021Smallsat 2021
Smallsat 2021
 
Low power & area efficient carry select adder
Low power & area efficient carry select adderLow power & area efficient carry select adder
Low power & area efficient carry select adder
 
Products go Green: Worst-Case Energy Consumption in Software Product Lines
Products go Green: Worst-Case Energy Consumption in Software Product LinesProducts go Green: Worst-Case Energy Consumption in Software Product Lines
Products go Green: Worst-Case Energy Consumption in Software Product Lines
 
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splinesOptimize Single Particle Orbital (SPO) Evaluations Based on B-splines
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
 
Graph Analysis: New Algorithm Models, New Architectures
Graph Analysis: New Algorithm Models, New ArchitecturesGraph Analysis: New Algorithm Models, New Architectures
Graph Analysis: New Algorithm Models, New Architectures
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
Overview of RTaW SysML-Companion
Overview of RTaW SysML-Companion Overview of RTaW SysML-Companion
Overview of RTaW SysML-Companion
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation final
 
PraveenBOUT++
PraveenBOUT++PraveenBOUT++
PraveenBOUT++
 
PR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation LearningPR-305: Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation Learning
 
Ananthprofilepln
AnanthprofileplnAnanthprofilepln
Ananthprofilepln
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015
 
Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18Virus, Vaccines, Genes and Quantum - 2020-06-18
Virus, Vaccines, Genes and Quantum - 2020-06-18
 

Mehr von Jason Riedy

Lucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoFLucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoFJason Riedy
 
LAGraph 2021-10-13
LAGraph 2021-10-13LAGraph 2021-10-13
LAGraph 2021-10-13Jason Riedy
 
Lucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoFLucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoFJason Riedy
 
Graph analysis and novel architectures
Graph analysis and novel architecturesGraph analysis and novel architectures
Graph analysis and novel architecturesJason Riedy
 
GraphBLAS and Emus
GraphBLAS and EmusGraphBLAS and Emus
GraphBLAS and EmusJason Riedy
 
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...Jason Riedy
 
Novel Architectures for Applications in Data Science and Beyond
Novel Architectures for Applications in Data Science and BeyondNovel Architectures for Applications in Data Science and Beyond
Novel Architectures for Applications in Data Science and BeyondJason Riedy
 
Characterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with MicrobenchmarksCharacterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with MicrobenchmarksJason Riedy
 
CRNCH 2018 Summit: Rogues Gallery Update
CRNCH 2018 Summit: Rogues Gallery UpdateCRNCH 2018 Summit: Rogues Gallery Update
CRNCH 2018 Summit: Rogues Gallery UpdateJason Riedy
 
Augmented Arithmetic Operations Proposed for IEEE-754 2018
Augmented Arithmetic Operations Proposed for IEEE-754 2018Augmented Arithmetic Operations Proposed for IEEE-754 2018
Augmented Arithmetic Operations Proposed for IEEE-754 2018Jason Riedy
 
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsCRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsJason Riedy
 
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsCRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsJason Riedy
 
A New Algorithm Model for Massive-Scale Streaming Graph Analysis
A New Algorithm Model for Massive-Scale Streaming Graph AnalysisA New Algorithm Model for Massive-Scale Streaming Graph Analysis
A New Algorithm Model for Massive-Scale Streaming Graph AnalysisJason Riedy
 
High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming Graphs High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming Graphs Jason Riedy
 
High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming GraphsHigh-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming GraphsJason Riedy
 
Updating PageRank for Streaming Graphs
Updating PageRank for Streaming GraphsUpdating PageRank for Streaming Graphs
Updating PageRank for Streaming GraphsJason Riedy
 
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsScalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsJason Riedy
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraJason Riedy
 
Network Challenge: Error and Sensitivity Analysis
Network Challenge: Error and Sensitivity AnalysisNetwork Challenge: Error and Sensitivity Analysis
Network Challenge: Error and Sensitivity AnalysisJason Riedy
 
Graph Analysis Trends and Opportunities -- CMG Performance and Capacity 2014
Graph Analysis Trends and Opportunities -- CMG Performance and Capacity 2014Graph Analysis Trends and Opportunities -- CMG Performance and Capacity 2014
Graph Analysis Trends and Opportunities -- CMG Performance and Capacity 2014Jason Riedy
 

Mehr von Jason Riedy (20)

Lucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoFLucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoF
 
LAGraph 2021-10-13
LAGraph 2021-10-13LAGraph 2021-10-13
LAGraph 2021-10-13
 
Lucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoFLucata at the HPEC GraphBLAS BoF
Lucata at the HPEC GraphBLAS BoF
 
Graph analysis and novel architectures
Graph analysis and novel architecturesGraph analysis and novel architectures
Graph analysis and novel architectures
 
GraphBLAS and Emus
GraphBLAS and EmusGraphBLAS and Emus
GraphBLAS and Emus
 
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...
 
Novel Architectures for Applications in Data Science and Beyond
Novel Architectures for Applications in Data Science and BeyondNovel Architectures for Applications in Data Science and Beyond
Novel Architectures for Applications in Data Science and Beyond
 
Characterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with MicrobenchmarksCharacterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with Microbenchmarks
 
CRNCH 2018 Summit: Rogues Gallery Update
CRNCH 2018 Summit: Rogues Gallery UpdateCRNCH 2018 Summit: Rogues Gallery Update
CRNCH 2018 Summit: Rogues Gallery Update
 
Augmented Arithmetic Operations Proposed for IEEE-754 2018
Augmented Arithmetic Operations Proposed for IEEE-754 2018Augmented Arithmetic Operations Proposed for IEEE-754 2018
Augmented Arithmetic Operations Proposed for IEEE-754 2018
 
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsCRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
 
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing PlatformsCRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
 
A New Algorithm Model for Massive-Scale Streaming Graph Analysis
A New Algorithm Model for Massive-Scale Streaming Graph AnalysisA New Algorithm Model for Massive-Scale Streaming Graph Analysis
A New Algorithm Model for Massive-Scale Streaming Graph Analysis
 
High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming Graphs High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming Graphs
 
High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming GraphsHigh-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming Graphs
 
Updating PageRank for Streaming Graphs
Updating PageRank for Streaming GraphsUpdating PageRank for Streaming Graphs
Updating PageRank for Streaming Graphs
 
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsScalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear Algebra
 
Network Challenge: Error and Sensitivity Analysis
Network Challenge: Error and Sensitivity AnalysisNetwork Challenge: Error and Sensitivity Analysis
Network Challenge: Error and Sensitivity Analysis
 
Graph Analysis Trends and Opportunities -- CMG Performance and Capacity 2014
Graph Analysis Trends and Opportunities -- CMG Performance and Capacity 2014Graph Analysis Trends and Opportunities -- CMG Performance and Capacity 2014
Graph Analysis Trends and Opportunities -- CMG Performance and Capacity 2014
 

Kürzlich hochgeladen

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 

Kürzlich hochgeladen (20)

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 

Reproducible Linear Algebra from Application to Architecture

  • 1. Reproducible Linear Algebra from Application to Architecture Jason Riedy (GT) James Demmel (UCB) Peter Ahrens (MIT) SIAM PP, 15 February 2020
  • 2. Outline Motivation: Why worry? Higher level methods & libraries Implementations and architectures Closing Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 1/21
  • 4. Informal UCB survey Sca/LAPACK prospectus (2006) identified usefulness of reproducibility. A request appeared on NA-Digest in 2010. So Dr. Demmel asked around 100 relevant UCB faculty if “reproducibility” matters.1 • Typical response: Debugging. • Atypical responses: • Error analysis! (One guess...) • Investigating rare instances in simulated data, so must be reproducible. • UN uses my code to detect nuclear tests. 1 Demmel, Riedy, Ahrens. “Reproducible BLAS: Make Addition Associative Again!” SIAM News, Oct. 2018 Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 2/21
  • 5. (Some) kinds of reproducibility • Debugging • Just developer’s platform. • Then users occur. • Investigate rare instance • Small job, similar to debugging. • Larger? # proc changes. • Schrödinger’s nuke, climate negotiations, ... • Likely little control over the runtime environment. • Accounting, some finance • Legal: identical across history. Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 3/21
  • 6. Growing “demand,” or at least interest • Reproducibility is a tag on Supercomputing sessions. • Oħten means reproducible (“replicable”) experiments. • But many sessions and talks on numerical reproducibility. • Requirement for journals like ACM TOMS • Vendors want to reduce support calls. • New MATLAB™ version, new optimized BLAS, new cache hierarchy ⇒ “My results changed! Fix it!” • Compiler A evaluates constant f.p. expressions at run time, compiler B at compile time ⇒ “My results changed! Fix it!” • Vendors (HW & SW) want to sell new versions. Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 4/21
  • 7. What Stands in the Way? 1. Performance. 2. Shared definitions. 3. Performance. 4. “Ease of use.” 5. Performance Summit photo from OLCF at ORNL. Debugging run much longer than an email check ⇒ No. Production run uses up an allocation ⇒ No. Some help from excess of processing cycles compared to memory then compared to interconnect. Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 5/21
  • 8. Shared definitions: Exceptional behavior? 0 × NaN ⇒ NaN except when ⇒ 0? 0 × ∞ ⇒ NaN except when ⇒ 0? GER: BLAS2 routine for A = A + α x yT . In the reference implementation: • If α = 0, xyT is not evaluated, no propagation. • If yk = 0, skip column, no propagation. A symmetric update of a general matrix may not be a symmetric update! Some vendor libraries behave differently. Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 6/21
  • 9. Other exceptional issues ISAMAX, ICAMAX: Find entry of max. |magnitude|. • Return NaN’s location only if it’s first. • ICAMAX uses |real| + |imag|, which may overflow! • Actual |complex| may differ (e.g. C++ v. C). GEMM: BLAS3, C = α op(A) op(B) + βC • If α = 0, then A, B do not propagate. • If β = 0, then C does not... • BUT traditionally β = 0 overwrites C. More examples in Demmel, Gates, Henry, Li, Riedy, Tang, “A Proposal for a Next-Generation BLAS,” http://goo.gl/D1UKnw. Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 7/21
  • 10. Complex complexity It gets worse. • C99 & C11: Complex is infinite if either component is, regardless of NaNs. • Hence multiplication yields complex ∞, but • “by hand” would yield NaN + iNaN. • C standard defines 30+ line complex multiplication. • Vectorization tends to ignore this definition. • Fortran, C++ do not define semantics! • People expect their particular compiler’s result. • Division is at least as bad. Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 8/21
  • 11. Higher level methods & libraries
  • 12. Assuming “agreement” on exceptions... (Some) current approaches: • Specific platform reproducibility for debugging. • Intel CNR, NVIDIA • Arbitrary precision / exact comp. • Not saying more on this. • Correctly rounded results • ExBLAS • Not “faithful” rounding. One of two choices, but another implementation may choose the other. • Reproducible accumulators • Very wide accumulators (Kulisch, ARM HPA) • Binned accumulators (ReproBLAS) Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 9/21
  • 13. Vendor debugging support Intel Conditional Numerical Reproducibility in MKL2 : • “[...] calls to Intel® MKL occur in a single executable” • “the number of computational threads [...] remains constant [...]” with SSE/AVX selection routines • Strict CNR Mode: Any # threads for GEMM, TRSM, and SYMM. NVIDIA cuBLAS3 : • “By design, all CUBLAS API routines from a given toolkit version, generate the same bit-wise results at every run when executed on GPUs with the same architecture and the same number of SMs.” • No promise across versions, can disable for performance 2 https://software.intel.com/en-us/articles/ introduction-to-the-conditional-numerical-reproducibility-cnr 3 https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 10/21
  • 14. Correctly rounded results Example: the ExBLAS https://exblas.lip6.fr/ • Deliver correctly rounded results for AXPY, SUM, DOT, TRSV, GEMV, GEMM, ... • Reproducible no matter parallelism, vectors, distribution • Used in fluid simulations with discontinuous Galerkin (ArXiV 1807.01971), parallel PCG • Memory / communication limited. Built on accumulators using exact transformations: • twoSum: a + b ⇒ s + e, and • twoProd: a × b ⇒ p + e. Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 11/21
  • 15. Reproducible accumulators: Fixed-point Expose the accumulators. Can compose across uses. • With a sufficiently wide accumulator, floating point becomes fixed point. • Fixed point is reproducible. • Kulisch superaccumulator: binary64 ⇒ >4000 bits • Various implementations and optimizations: e.g. Koenig et al. (ARITH17) • ARM High-Precision Anchored (HPA)4 number in SVE • Narrowed by data range, likely ≤ 200 bits • Kinda wide, kinda binned • Int / fixed-point summation is much faster than f.p. 4 Burgess et al., IEEE ToC 7/2019, doi: 10.1109/TC.2018.2855729 Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 12/21
  • 16. Reproducible accumulators: Binned Expose the accumulators: Example: ReproBLAS5 : Goals for summation: • Build on standard IEEE 754 binary ops • Tunable precision, ≥ conventional sum • Handle exceptions reproducibly • One read-only pass, one reduction • Minimal memory for tiling • “Usable:” Can build higher-level and parallel routines (for some α, β, ...) 5 Demmel, Ahrens, Nguyen https://bebop.cs.berkeley.edu/reproblas/ Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 13/21
  • 17. ReproBLAS binning • p: Precison • K : -fold (in) • W: Width (in) bin32 bin64 bin128 W 13 40 100 K 3 Sum len. 233 264 2124 7n flops for n sums. But twoSum again! 20% overhead for adding 1M #s on 1K XC30 processors Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 14/21
  • 19. Defining twoSum Long history... quickTwoSum(a, b) ⇒ (s, e), |a| ≥ |b| 1. s = a + b 2. e = b − (s − a) twoSum(a, b) ⇒ (s, e) 1. s = a + b 2. t := s − a 3. e = (a − (s − t)) + (b − t) Many sign re-arrangements possible. Each generates different ±0, NaNs, ... Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 15/21
  • 20. Specifying twoSum ⇒ augmentedAddition Finally standardized in IEEE 754-2019. • augmentedAddition(a, b) ⇒ (s, e) • augmentedSubtraction(a, b) ⇒ (d, e) • augmentedMultiplication(a, b) ⇒ (p, e) Fully specified exceptional cases including overflow, underflow, invalid, and signed zeros. “Funny” aspect: Round ties towards zero. • Mode exists for decimal but not binary. • Reproducibility: Rounding must not depend on value! Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 16/21
  • 21. augmented∗ operations i • Rationale and test generation: Riedy, Demmel. “Augmented Arithmetic Operations Proposed for IEEE-754 2018.” ARITH 18, doi: 10.1109/ARITH.2018.8464813 • Soħtware: • FP implementation: Boldo, Lauter, Muller. “Emulating [ties→0] ‘augmented’ [...] operations using [ties→even] arithmetic.” HAL 2137968. • Int impl: Convincing myself it’s right... • No one should use these... Performance. Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 17/21
  • 22. augmented∗ operations ii • Hardware: There is “interest” from processor designers... • In a two-operation form, one for each piece. • There need to be users / uses of $ importance. Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 18/21
  • 23. Possible HW performance: single CPU core Emulating augmentedAddition as two instructions improves double-double: Operation Skylake Haswell Addition latency −55% −45% throughput +36% +18% DDGEMM MFLOP/s from reduced insn dependencies: Operation Intel Skylake Intel Haswell “Typical” implementation 1732 (≈ 1/37 DP) 1199 (≈ 1/45 DP) Two-insn augmentedAddition 3344 (≈ 1/19 DP) 2283 (≈ 1/24 DP) Dukhan, Riedy, Vuduc. “Wanted: Floating-point add round-off error instruction,” PMAA 2016, ArXiv 1603.00491. ReproBLAS dot product: 33% rate improvement, only 2× slower than non-reproducible, single node. Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 19/21
  • 25. Closing • Even before the “interesting” parts, challenges in reproducible linear algebra: • Exceptional conditions and propagation • Different complex arithmetics • (Also: Signaling exceptions and more) • Each approach has benefits and limitations. • ExBLAS: Cannot easily compose, but easy to understand. • ReproBLAS: Composable (multipliers ±1, 0), • Aside: Composing is useful for streaming data... • Performance can be ok, so long as networks are involved. • Really need hardware for day-to-day use. Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 20/21
  • 26. Upcoming issues • Current work assumes a processor-centric view. • Novel memory-centric processors may minimize state, limiting accumulators. • FPGAs near / in memory change engineering choices. • Coming flood of 8-bit multipliers. • Energy and space scale as bits∗∗2. • Machine learning “wants” them (bfloat16). • Reproducibility and neuromorphic? Quantum? Emu Chick FPGAs & smart NICs FPAA Coming soon... Georgia Tech CRNCH Rogues Gallery http://crnch.gatech.edu/rg/ Reproducible Linear Algebra, Alg. to Arch. — SIAM PP20 21/21