SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
Making fitting in RooFit faster
Automated Parallel Computation of Collaborative Statistical Models
Patrick Bos
Sarajevo, 10 Sep 2018
Automated Parallel Computation of Collaborative Statistical Models
Physics: Wouter Verkerke (PI), Vince Croft, Carsten Burgard
eScience: Patrick Bos (yours truly), Inti Pelupessy, Jisk Attema
RooFit: Collaborative Statistical Modeling
Collaborative Statistical Modeling
• RooFit: build models together
• Teams 10-100 physicists
• Collaborations ~3000
à ~100 teams
• 1 goal
• Pretty impressive to an
outsider
Collaborative Statistical Modeling with RooFit
Making RooFit faster (~30x; ~h à ~m)
• More efficient collaboration
• Faster iteration/debugging
• Faster feedback between teams
• Next level physics modeling ambitions,
retaining interactive workflow
1. Complex likelihood models, e.g.
a) Higgs fit to all channels, ~200 datasets, O(1000)
parameter, now O(few) hours
b) EFT framework: again 10-100x more expensive
2. Unbinned ML fits with very large data samples
3. Unbinned ML fits with MC-style numeric integrals
Higgs @ ATLAS
20k+ nodes, 125k hours
Expression tree of C++ objects for mathematical components (variables,
operators, functions, integrals, datasets, etc.)
Couple with data, event “observables”
Goals and Design: Make fitting in RooFit faster
Making fitting in RooFit faster: how?
Serial:
benchmarks show no obvious bottlenecks
RooFit already highly optimized (pre-calculation/memoization, MPFE)
Parallel
Faster fitting: (how) can we do it?
Levels of parallelism
1. Gradient (parameter partial
derivatives) in minimizer
2. Likelihood
3. Integrals (normalization) &
other expensive shared
components
likelihood:
events
likelihood:
(unequal)
components
integrals etc.
“Vector”
Faster fitting: (how) can we do it?
Heterogeneous: sizes, types
• Multiple strategies
• How to split up?
• Small components à need low
latency/overhead
• Large components as well…
• How to divide over cores?
• Load balancing à task-based
approach: work stealing
likelihood:
events
likelihood:
(unequal)
components
integrals etc.
Design: MultiProcess task-stealing framework
Task-stealing, worker pool, executes Job tasks
No threads, process-based: “bipe”
(BidirMMapPipe) handles fork, mmap, pipes
Master Queue
Worker 1
Worker 2
...
bipesbipe
Master: main RooFit process, submits Jobs to queue, waits for results (or does other things in between)
Worker requests
Job task
Queue pops task
Worker executes
task
Worker sends
result Queue
... repeat ...
Job done: Queue
sends to Master
on request
worker loop:
queue loop: act on input from Master or Workers (mainly to avoid loop in Master / user code)
template <class T> class MP::Vector :
public T, public MP::Job
Parallelized
class
MP::Vector
MP::Job
MP::TaskManager
Serial class
likelihood, gradient..
MultiProcess usage for devs
template <class T> class MP::Vector : public T, public MP::Job
class Parallel : public MP:Vector<Serial>
Parallelized
class
MP::Vector
MP::Job
MP::TaskManager
Serial class
MultiProcess usage for devs
class xSquaredSerial {
public:
xSquaredSerial(vector<double> x_init)
: x(move(x_init))
, result(x.size()) {}
virtual void evaluate() {
for (size_t ix = 0; ix < x.size(); ++ix) {
x_squared[ix] = x[ix] * x[ix];
}
}
vector<double> get_result() {
evaluate();
return x_squared;
}
protected:
vector<double> x;
vector<double> x_squared;
};
class xSquaredParallel
: public RooFit::MultiProcess::Vector<xSquaredSerial> {
public:
xSquaredParallel(size_t N_workers, vector<double> x_init) :
RooFit::MultiProcess::Vector<xSquaredSerial>(N_workers, x_init)
{}
private:
void evaluate_task(size_t task) override {
result[task] = x[task] * x[task];
}
public:
void evaluate() override {
if (get_manager()->is_master()) {
// do necessary synchronization before work_mode
// enable work mode: workers will start stealing work from queue
get_manager()->set_work_mode(true);
// master fills queue with tasks
for (size_t task_id = 0; task_id < x.size(); ++task_id) {
get_manager()->to_queue(JobTask(id, task_id));
}
// wait for task results back from workers to master
gather_worker_results();
// end work mode
get_manager()->set_work_mode(false);
// put gathered results in desired container (same as used in serial class)
for (size_t task_id = 0; task_id < x.size(); ++task_id) {
x_squared[task_id] = results[task_id];
}
}
}
};
template <class T> class MP::Vector : public T, public MP::Job
MultiProcess for users
vector<double> x {1, 4, 5, 6.48074};
xSquaredSerial xsq_serial(x);
size_t N_workers = 4;
xSquaredParallel xsq_parallel(N_workers, x);
// get the same results, but now faster:
xsq_serial.get_result();
xsq_parallel.get_result();
// use parallelized version in your existing functions
void some_function(xSquaredSerial* xsq);
some_function(&xsq_parallel); // no problem!
Parallel performance (MPFE & MP)
Likelihood fits (unbinned, binned)
Numerical integrals
Gradients
Parallel likelihood fits: unbinned, MPFE
Before: max ~2x
Now (with CPU affinity fixed):
max ~20x (more for larger fits)
Run-time vs N(cores)
Actual performance
Expected
performance (ideal
parallelization)
Parallel likelihood fits: binned
Run-time vs N(cores) in binned fits
Actual performance
Expected
performance (ideal
parallelization)
CPU time (single core)
Room for
improvement
WIP
Gradient parallelization
0th step: get Minuit to use external derivative
1st step: replicate Minuit2 behavior
• NumericalDerivator (Lorenzo)
• Modified to exactly (floating point bit-wise) replicate Minuit2
• à RooGradMinimizer
2nd step: calculate partial derivative for each parameter in
parallel
Gradient parallelization
First benchmarks (yesterday):
ggF workspace (Carsten), migrad fit
scaling not perfect and erratic (+/- 5s)
similar as we saw for likelihoods without CPU pinning
probably due to too much synchronization
RooMinimizer MultiProcess GradMinimizer
- 1 worker 2 workers 3 workers 4 workers 6 workers 8 workers
28s 33s 20s 15s 14s 17s (…) 11s
Let’s stay in touch
+31 (0)6 10 79 58 74
p.bos@esciencecenter.nl
www.esciencecenter.nl
egpbos
linkedin.com/in/egpbos
blog.esciencecenter.nl
Encore
Future work
Load balancing
PDF timings change dynamically due to RooFit precalculation strategies
… not a problem for numerical integrals
Analytical derivatives (automated? CLAD)
Numerical integrals
“Analytical” integrals
Forced numerical (Monte
Carlo) integrals
(Higgs fits didn’t have them)
Numerical integrals
Maxima
Individual NI timings
(variation in runs and iterations)
Minima
Sum of slowest integrals/cores
per iteration over the entire run
(single core total runtime: 3.2s)
Faster fitting: MultiProcess design
RooFit::MultiProcess::Vector<YourSerialClass>
Serial class: likelihood (e.g. RooNLLVar) or gradient (Minuit)
Interface: subclass + MP
Define ”vector elements”
Group elements into tasks (to be executed in parallel)
RooFit::MultiProcess::SharedArg<T>
RooFit::MultiProcess::TaskManager
Faster fitting: MultiProcess design
RooFit::MultiProcess::Vector<YourSerialClass>
RooFit::MultiProcess::SharedArg<T>
Normalization integrals or other shared expensive objects
Parallel task definition specific to type of object
… design in progress
RooFit::MultiProcess::TaskManager
Faster fitting: MultiProcess design
RooFit::MultiProcess::Vector<YourSerialClass>
RooFit::MultiProcess::SharedArg<T>
RooFit::MultiProcess::TaskManager
Queue gathers tasks and communicates with worker pool
Workers steal tasks from queue
Worker pool: forked processes (BidirMMapPipe)
• performant and already used in RooFit
• no thread-safety concerns
• instead: communication concerns
• … flexible design, implementation can be replaced (e.g. TBB)
Single core profiling and improvements
Faster fitting: single core profiling with Callgrind, Cachegrind, Instruments !
Higgs ggf & 9 channel fits (workspaces by Lydia Brenner)
Most time spent on:
1. Memory access à RooVectorDataStore::get() (4% / 32%), 0.3% LL
cache misses (expensive!)
• Row-wise access pattern on column-wise data store (and std::vector<std::vector>)
2. Logarithms: 12%
3. Interpolation à RooStats::HistFactory::FlexibleInterpVar (10%)
Faster fitting: single core improvements
RooLinkedList::findArg: ~ 5% of memory access instructions
RooLinkedList::At took considerable time in Gaussian test fit (Vince)
std::vector lookup à 1.6x speedup! WIP
Faster fitting: future work
Reorder tree evaluation à CPU cache use, vectorization
Smarter fitting (stochastic minimizer, analytical gradient, CLAD)
Front-end / back-end separation (e.g. TensorFlow back-end)
Faster fitting: single core profiling meta-conclusions
profiling functions & classes
valgrind
gprof
Instruments
… etc.
profiling objects (e.g. call-trees, e.g. RooFit…)
… DIY?
More Multi-Core
Parallel likelihood fits: existing RooFit implementation details
RooRealMPFE / BidirMMapPipe
Custom multi-process message passing protocol
• POSIX fork, pipe, mmap
Communication “overhead” (delay between sending and receiving
messages): ~ 1e-4 seconds
• serverLoop waits for message & runs server-side code
• messages used sparingly
• data transfer over memory-mapped pipes
TensorFlow experiments
Fits on identical model & data (single i7 machine)
TensorFlow: No pre-calculation / caching!
Major advantage of RooFit for binned fits (e.g. morphing histograms)
(feature request for memoization https://github.com/tensorflow/tensorflow/issues/5323)
N.B.: measured before CPU affinity fixing
RooFit now even faster (but limited to running one machine)
RooFit (MINUIT) TensorFlow (BFGS)
Unbinned fit 0.1s 0.01 - 0.1s (dep. on precision)
Binned fit 0.7ms 2.3ms

Weitere ähnliche Inhalte

Was ist angesagt?

Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016MLconf
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16MLconf
 
Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationUday Vakalapudi
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Chris Fregly
 
Introduction to Deep Learning with Python
Introduction to Deep Learning with PythonIntroduction to Deep Learning with Python
Introduction to Deep Learning with Pythonindico data
 
Buzzwords Numba Presentation
Buzzwords Numba PresentationBuzzwords Numba Presentation
Buzzwords Numba Presentationkammeyer
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureMani Goswami
 
Numba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyNumba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyTravis Oliphant
 
DIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe WorkshopDIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe Workshopodsc
 
Data Science at the Command Line
Data Science at the Command LineData Science at the Command Line
Data Science at the Command LineHéloïse Nonne
 
Numba: Flexible analytics written in Python with machine-code speeds and avo...
Numba:  Flexible analytics written in Python with machine-code speeds and avo...Numba:  Flexible analytics written in Python with machine-code speeds and avo...
Numba: Flexible analytics written in Python with machine-code speeds and avo...PyData
 
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15MLconf
 
190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pubJaewook. Kang
 
SciPy 2019: How to Accelerate an Existing Codebase with Numba
SciPy 2019: How to Accelerate an Existing Codebase with NumbaSciPy 2019: How to Accelerate an Existing Codebase with Numba
SciPy 2019: How to Accelerate an Existing Codebase with Numbastan_seibert
 
The Flow of TensorFlow
The Flow of TensorFlowThe Flow of TensorFlow
The Flow of TensorFlowJeongkyu Shin
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Chris Fregly
 

Was ist angesagt? (20)

Numba
NumbaNumba
Numba
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
 
Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integration
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
 
Introduction to Deep Learning with Python
Introduction to Deep Learning with PythonIntroduction to Deep Learning with Python
Introduction to Deep Learning with Python
 
Buzzwords Numba Presentation
Buzzwords Numba PresentationBuzzwords Numba Presentation
Buzzwords Numba Presentation
 
Storm
StormStorm
Storm
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
 
Numba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyNumba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPy
 
Numba Overview
Numba OverviewNumba Overview
Numba Overview
 
DIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe WorkshopDIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe Workshop
 
Data Science at the Command Line
Data Science at the Command LineData Science at the Command Line
Data Science at the Command Line
 
Numba: Flexible analytics written in Python with machine-code speeds and avo...
Numba:  Flexible analytics written in Python with machine-code speeds and avo...Numba:  Flexible analytics written in Python with machine-code speeds and avo...
Numba: Flexible analytics written in Python with machine-code speeds and avo...
 
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
 
Introduction to Apache Storm
Introduction to Apache StormIntroduction to Apache Storm
Introduction to Apache Storm
 
190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub
 
SciPy 2019: How to Accelerate an Existing Codebase with Numba
SciPy 2019: How to Accelerate an Existing Codebase with NumbaSciPy 2019: How to Accelerate an Existing Codebase with Numba
SciPy 2019: How to Accelerate an Existing Codebase with Numba
 
The Flow of TensorFlow
The Flow of TensorFlowThe Flow of TensorFlow
The Flow of TensorFlow
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 

Ähnlich wie Making fitting in RooFit faster

introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Holden Karau
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopHéloïse Nonne
 
OpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomOpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomFacultad de Informática UCM
 
Travis Oliphant "Python for Speed, Scale, and Science"
Travis Oliphant "Python for Speed, Scale, and Science"Travis Oliphant "Python for Speed, Scale, and Science"
Travis Oliphant "Python for Speed, Scale, and Science"Fwdays
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniquesmark_landry
 
Go from a PHP Perspective
Go from a PHP PerspectiveGo from a PHP Perspective
Go from a PHP PerspectiveBarry Jones
 
Distributed computing and hyper-parameter tuning with Ray
Distributed computing and hyper-parameter tuning with RayDistributed computing and hyper-parameter tuning with Ray
Distributed computing and hyper-parameter tuning with RayJan Margeta
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpcDr Reeja S R
 
Golang Performance : microbenchmarks, profilers, and a war story
Golang Performance : microbenchmarks, profilers, and a war storyGolang Performance : microbenchmarks, profilers, and a war story
Golang Performance : microbenchmarks, profilers, and a war storyAerospike
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...EUDAT
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit
 
On the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of PythonOn the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of PythonTakeshi Akutsu
 
On the necessity and inapplicability of python
On the necessity and inapplicability of pythonOn the necessity and inapplicability of python
On the necessity and inapplicability of pythonYung-Yu Chen
 
Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Ram Sriharsha
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetEric Haibin Lin
 

Ähnlich wie Making fitting in RooFit faster (20)

introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and Hadoop
 
OpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomOpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroom
 
Travis Oliphant "Python for Speed, Scale, and Science"
Travis Oliphant "Python for Speed, Scale, and Science"Travis Oliphant "Python for Speed, Scale, and Science"
Travis Oliphant "Python for Speed, Scale, and Science"
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniques
 
Go from a PHP Perspective
Go from a PHP PerspectiveGo from a PHP Perspective
Go from a PHP Perspective
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
 
Distributed computing and hyper-parameter tuning with Ray
Distributed computing and hyper-parameter tuning with RayDistributed computing and hyper-parameter tuning with Ray
Distributed computing and hyper-parameter tuning with Ray
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpc
 
Golang Performance : microbenchmarks, profilers, and a war story
Golang Performance : microbenchmarks, profilers, and a war storyGolang Performance : microbenchmarks, profilers, and a war story
Golang Performance : microbenchmarks, profilers, and a war story
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
On the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of PythonOn the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of Python
 
On the necessity and inapplicability of python
On the necessity and inapplicability of pythonOn the necessity and inapplicability of python
On the necessity and inapplicability of python
 
Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 

Kürzlich hochgeladen

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 

Kürzlich hochgeladen (20)

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 

Making fitting in RooFit faster

  • 1. Making fitting in RooFit faster Automated Parallel Computation of Collaborative Statistical Models Patrick Bos Sarajevo, 10 Sep 2018
  • 2. Automated Parallel Computation of Collaborative Statistical Models Physics: Wouter Verkerke (PI), Vince Croft, Carsten Burgard eScience: Patrick Bos (yours truly), Inti Pelupessy, Jisk Attema
  • 4. Collaborative Statistical Modeling • RooFit: build models together • Teams 10-100 physicists • Collaborations ~3000 à ~100 teams • 1 goal • Pretty impressive to an outsider
  • 5. Collaborative Statistical Modeling with RooFit Making RooFit faster (~30x; ~h à ~m) • More efficient collaboration • Faster iteration/debugging • Faster feedback between teams • Next level physics modeling ambitions, retaining interactive workflow 1. Complex likelihood models, e.g. a) Higgs fit to all channels, ~200 datasets, O(1000) parameter, now O(few) hours b) EFT framework: again 10-100x more expensive 2. Unbinned ML fits with very large data samples 3. Unbinned ML fits with MC-style numeric integrals Higgs @ ATLAS 20k+ nodes, 125k hours Expression tree of C++ objects for mathematical components (variables, operators, functions, integrals, datasets, etc.) Couple with data, event “observables”
  • 6. Goals and Design: Make fitting in RooFit faster
  • 7. Making fitting in RooFit faster: how? Serial: benchmarks show no obvious bottlenecks RooFit already highly optimized (pre-calculation/memoization, MPFE) Parallel
  • 8. Faster fitting: (how) can we do it? Levels of parallelism 1. Gradient (parameter partial derivatives) in minimizer 2. Likelihood 3. Integrals (normalization) & other expensive shared components likelihood: events likelihood: (unequal) components integrals etc. “Vector”
  • 9. Faster fitting: (how) can we do it? Heterogeneous: sizes, types • Multiple strategies • How to split up? • Small components à need low latency/overhead • Large components as well… • How to divide over cores? • Load balancing à task-based approach: work stealing likelihood: events likelihood: (unequal) components integrals etc.
  • 10. Design: MultiProcess task-stealing framework Task-stealing, worker pool, executes Job tasks No threads, process-based: “bipe” (BidirMMapPipe) handles fork, mmap, pipes Master Queue Worker 1 Worker 2 ... bipesbipe Master: main RooFit process, submits Jobs to queue, waits for results (or does other things in between) Worker requests Job task Queue pops task Worker executes task Worker sends result Queue ... repeat ... Job done: Queue sends to Master on request worker loop: queue loop: act on input from Master or Workers (mainly to avoid loop in Master / user code) template <class T> class MP::Vector : public T, public MP::Job Parallelized class MP::Vector MP::Job MP::TaskManager Serial class likelihood, gradient..
  • 11. MultiProcess usage for devs template <class T> class MP::Vector : public T, public MP::Job class Parallel : public MP:Vector<Serial> Parallelized class MP::Vector MP::Job MP::TaskManager Serial class
  • 12. MultiProcess usage for devs class xSquaredSerial { public: xSquaredSerial(vector<double> x_init) : x(move(x_init)) , result(x.size()) {} virtual void evaluate() { for (size_t ix = 0; ix < x.size(); ++ix) { x_squared[ix] = x[ix] * x[ix]; } } vector<double> get_result() { evaluate(); return x_squared; } protected: vector<double> x; vector<double> x_squared; }; class xSquaredParallel : public RooFit::MultiProcess::Vector<xSquaredSerial> { public: xSquaredParallel(size_t N_workers, vector<double> x_init) : RooFit::MultiProcess::Vector<xSquaredSerial>(N_workers, x_init) {} private: void evaluate_task(size_t task) override { result[task] = x[task] * x[task]; } public: void evaluate() override { if (get_manager()->is_master()) { // do necessary synchronization before work_mode // enable work mode: workers will start stealing work from queue get_manager()->set_work_mode(true); // master fills queue with tasks for (size_t task_id = 0; task_id < x.size(); ++task_id) { get_manager()->to_queue(JobTask(id, task_id)); } // wait for task results back from workers to master gather_worker_results(); // end work mode get_manager()->set_work_mode(false); // put gathered results in desired container (same as used in serial class) for (size_t task_id = 0; task_id < x.size(); ++task_id) { x_squared[task_id] = results[task_id]; } } } }; template <class T> class MP::Vector : public T, public MP::Job
  • 13. MultiProcess for users vector<double> x {1, 4, 5, 6.48074}; xSquaredSerial xsq_serial(x); size_t N_workers = 4; xSquaredParallel xsq_parallel(N_workers, x); // get the same results, but now faster: xsq_serial.get_result(); xsq_parallel.get_result(); // use parallelized version in your existing functions void some_function(xSquaredSerial* xsq); some_function(&xsq_parallel); // no problem!
  • 14. Parallel performance (MPFE & MP) Likelihood fits (unbinned, binned) Numerical integrals Gradients
  • 15. Parallel likelihood fits: unbinned, MPFE Before: max ~2x Now (with CPU affinity fixed): max ~20x (more for larger fits) Run-time vs N(cores) Actual performance Expected performance (ideal parallelization)
  • 16. Parallel likelihood fits: binned Run-time vs N(cores) in binned fits Actual performance Expected performance (ideal parallelization) CPU time (single core) Room for improvement WIP
  • 17. Gradient parallelization 0th step: get Minuit to use external derivative 1st step: replicate Minuit2 behavior • NumericalDerivator (Lorenzo) • Modified to exactly (floating point bit-wise) replicate Minuit2 • à RooGradMinimizer 2nd step: calculate partial derivative for each parameter in parallel
  • 18. Gradient parallelization First benchmarks (yesterday): ggF workspace (Carsten), migrad fit scaling not perfect and erratic (+/- 5s) similar as we saw for likelihoods without CPU pinning probably due to too much synchronization RooMinimizer MultiProcess GradMinimizer - 1 worker 2 workers 3 workers 4 workers 6 workers 8 workers 28s 33s 20s 15s 14s 17s (…) 11s
  • 19. Let’s stay in touch +31 (0)6 10 79 58 74 p.bos@esciencecenter.nl www.esciencecenter.nl egpbos linkedin.com/in/egpbos blog.esciencecenter.nl
  • 21. Future work Load balancing PDF timings change dynamically due to RooFit precalculation strategies … not a problem for numerical integrals Analytical derivatives (automated? CLAD)
  • 22. Numerical integrals “Analytical” integrals Forced numerical (Monte Carlo) integrals (Higgs fits didn’t have them)
  • 23. Numerical integrals Maxima Individual NI timings (variation in runs and iterations) Minima Sum of slowest integrals/cores per iteration over the entire run (single core total runtime: 3.2s)
  • 24. Faster fitting: MultiProcess design RooFit::MultiProcess::Vector<YourSerialClass> Serial class: likelihood (e.g. RooNLLVar) or gradient (Minuit) Interface: subclass + MP Define ”vector elements” Group elements into tasks (to be executed in parallel) RooFit::MultiProcess::SharedArg<T> RooFit::MultiProcess::TaskManager
  • 25. Faster fitting: MultiProcess design RooFit::MultiProcess::Vector<YourSerialClass> RooFit::MultiProcess::SharedArg<T> Normalization integrals or other shared expensive objects Parallel task definition specific to type of object … design in progress RooFit::MultiProcess::TaskManager
  • 26. Faster fitting: MultiProcess design RooFit::MultiProcess::Vector<YourSerialClass> RooFit::MultiProcess::SharedArg<T> RooFit::MultiProcess::TaskManager Queue gathers tasks and communicates with worker pool Workers steal tasks from queue Worker pool: forked processes (BidirMMapPipe) • performant and already used in RooFit • no thread-safety concerns • instead: communication concerns • … flexible design, implementation can be replaced (e.g. TBB)
  • 27. Single core profiling and improvements
  • 28. Faster fitting: single core profiling with Callgrind, Cachegrind, Instruments ! Higgs ggf & 9 channel fits (workspaces by Lydia Brenner) Most time spent on: 1. Memory access à RooVectorDataStore::get() (4% / 32%), 0.3% LL cache misses (expensive!) • Row-wise access pattern on column-wise data store (and std::vector<std::vector>) 2. Logarithms: 12% 3. Interpolation à RooStats::HistFactory::FlexibleInterpVar (10%)
  • 29. Faster fitting: single core improvements RooLinkedList::findArg: ~ 5% of memory access instructions RooLinkedList::At took considerable time in Gaussian test fit (Vince) std::vector lookup à 1.6x speedup! WIP
  • 30. Faster fitting: future work Reorder tree evaluation à CPU cache use, vectorization Smarter fitting (stochastic minimizer, analytical gradient, CLAD) Front-end / back-end separation (e.g. TensorFlow back-end)
  • 31. Faster fitting: single core profiling meta-conclusions profiling functions & classes valgrind gprof Instruments … etc. profiling objects (e.g. call-trees, e.g. RooFit…) … DIY?
  • 33. Parallel likelihood fits: existing RooFit implementation details RooRealMPFE / BidirMMapPipe Custom multi-process message passing protocol • POSIX fork, pipe, mmap Communication “overhead” (delay between sending and receiving messages): ~ 1e-4 seconds • serverLoop waits for message & runs server-side code • messages used sparingly • data transfer over memory-mapped pipes
  • 34. TensorFlow experiments Fits on identical model & data (single i7 machine) TensorFlow: No pre-calculation / caching! Major advantage of RooFit for binned fits (e.g. morphing histograms) (feature request for memoization https://github.com/tensorflow/tensorflow/issues/5323) N.B.: measured before CPU affinity fixing RooFit now even faster (but limited to running one machine) RooFit (MINUIT) TensorFlow (BFGS) Unbinned fit 0.1s 0.01 - 0.1s (dep. on precision) Binned fit 0.7ms 2.3ms