MLHEP Lectures - day 1, basic track

Machine Learning in High Energy Physics
Lectures 1 & 2
Alex Rogozhnikov
Lund, MLHEP 2016
1 / 87

Intro notes
two tracks:
introductory course (this one)
advanced track: Mon, Tue, Wed, then two tracks are merged
Introductory track:
two lectures and two practice seminars on each day
Kaggle challenges
'Triggers' — only for advanced track, lasts for 3 days
'Higgs' — for both tracks, lasts for 7 days
know material? Spend more time on challenges!
1 / 87

Intro notes — 2
chat rooms
gitter
if you want to share something between teams — please do it publicly
(via chat)
repository
glossary is in the repository
2 / 87

What is Machine Learning about?
a method of teaching computers to make and improve predictions or
behaviors based on some data?
a ﬁeld of computer science, probability theory, and optimization theory
which allows complex tasks to be solved for which a logical/procedural
approach would not be possible or feasible?
a type of AI that provides computers with the ability to learn without being
explicitly programmed?
somewhat in between of statistics, AI, optimization theory, signal processing
and pattern matching?
3 / 87

What is Machine Learning about
Inference of statistical dependencies which give us ability to predict
4 / 87

What is Machine Learning about
Inference of statistical dependencies which give us ability to predict
Data is cheap, knowledge is precious
5 / 87

Machine Learning is used in
search engines
spam detection
security: virus detection, DDOS defense
computer vision and speech recognition
market basket analysis, customer relationship management (CRM), churn
prediction
credit scoring / insurance scoring, fraud detection
health monitoring
trafﬁc jam prediction, self-driving cars
advertisement systems / recommendation systems / news clustering
6 / 87

Machine Learning is used in
search engines
spam detection
security: virus detection, DDOS defense
computer vision and speech recognition
market basket analysis, customer relationship management (CRM), churn
prediction
credit scoring / insurance scoring, fraud detection
health monitoring
trafﬁc jam prediction, self-driving cars
advertisement systems / recommendation systems / news clustering
and hundreds more
7 / 87

Triggers (LHCb, CMS to join soon)
Particle identiﬁcation
Calibration
Tagging
Stripping line
Analysis
8 / 87

Triggers (LHCb, CMS to join soon)
Particle identiﬁcation
Calibration
Tagging
Stripping line
Analysis
On each stage different data is used and different information is inferred, but
the ideas beyond are quite similar.
9 / 87

General notion
In supervised learning the training data is represented as a set of pairs
is an index of event
is a vector of features available for event
is a target — the value we need to predict
features = observables = variables
,xi yi
i
xi
yi
10 / 87

Classification problem
, where is finite set of labels.
Examples
particle identification based on information about track:
binary classification:
— is signal, is background
∈ Yyi Y
xi
Y
=
=
(p, η, E, charge, PV , FlightTime)χ2
{electron, muon, pion, ... }
Y = {0, 1} 1 0
11 / 87

Regression problem
Examples:
predicting price of a house by it's positions
predicting number of customers / money income
reconstructing real momentum of particle
y ∈ ℝ
12 / 87

Regression problem
Examples:
predicting price of a house by it's positions
predicting number of customers / money income
reconstructing real momentum of particle
Why do we need automatic classiﬁcation/regression?
in applications up to thousands of features
higher quality
much faster adaptation to new problems
y ∈ ℝ
13 / 87

Classification based on nearest neighbours
Given training set of objects and their labels we predict the label for the
new observation :
Here and after is the distance in the space of features.
{ , }xi yi
x
= , j = arg ρ(x, )ŷ yj min
i
xi
ρ(x, )x̃
14 / 87

Visualization of decision rule
Consider a classiﬁcation problem with 2 features:
,= ( , )xi x
1
i
x
2
i
∈ Y = {0, 1}yi
15 / 87

Nearest Neighbours ( NN)
A better way is to use neighbors:
k k
k
(x) =pỹ
# of knn events of x in class ỹ
k
16 / 87

Overfitting
What is the quality of classiﬁcation on training dataset when ?k = 1
19 / 87

Overfitting
What is the quality of classiﬁcation on training dataset when ?
answer: it is ideal (closest neighbor is event itself)
k = 1
20 / 87

Overfitting
quality is lower when
k = 1
k > 1
21 / 87

Overfitting
quality is lower when
this doesn't mean is the best,
it means we cannot use training events to estimate quality
when classiﬁer's decision rule is too complex and captures details from
training data that are not relevant to distribution, we call this an overﬁtting
(more details tomorrow)
k = 1
k > 1
k = 1
22 / 87

Regression using NN
Regression with nearest neighbours is done by averaging of output
k
=ŷ
1
k ∑
j∈knn(x)
yj
23 / 87

NN with weights
Average neighbours' output with
weights:
the closer the neighbour, the
higher weights of its contribution,
i.e.:
k
=ŷ
∑j∈knn(x)
wj yj
∑j∈knn(x)
wj
= 1/ρ(x, )wj xj
24 / 87

Computational complexity
Given that dimensionality of space is and there are training samples:
training time: ~ O(save a link to the data)
prediction time: for each sample
d n
n × d
25 / 87

Spacial index: ball tree
26 / 87

Ball tree
training time ~
prediction time ~ for each sample
Other option exists: KD-tree.
O(d × n log(n))
log(n) × d
27 / 87

Overview of NN
Awesomely simple classiﬁer and regressor
Provides too optimistic quality on training data
Quite slow, though optimizations exist
Too sensitive to scale of features
Hard times with data of high dimensions
k
28 / 87

Sensitivity to scale of features
Euclidean distance:
ρ(x, = ( − + ( − + ⋯ + ( −x̃ )
2
x1 x̃ 1 )
2
x2 x̃ 2 )
2
xd x̃ d )
2
29 / 87

Euclidean distance:
Change scale of ﬁrst feature:
ρ(x, = ( − + ( − + ⋯ + ( −x̃ )
2
x1 x̃ 1 )
2
x2 x̃ 2 )
2
xd x̃ d )
2
ρ(x, x̃ )
2
ρ(x, x̃ )
2
=
∼
(10 − 10 + ( − + ⋯ + ( −x1 x̃ 1 )
2
x2 x̃ 2 )
2
xd x̃ d )
2
100 ( −x1 x̃ 1 )
2
30 / 87

Euclidean distance:
Change scale of ﬁrst feature:
Scaling of features frequently increases quality.
ρ(x, = ( − + ( − + ⋯ + ( −x̃ )
2
x1 x̃ 1 )
2
x2 x̃ 2 )
2
xd x̃ d )
2
ρ(x, x̃ )
2
ρ(x, x̃ )
2
=
∼
(10 − 10 + ( − + ⋯ + ( −x1 x̃ 1 )
2
x2 x̃ 2 )
2
xd x̃ d )
2
100 ( −x1 x̃ 1 )
2
31 / 87

Distance function matters
Minkowski distance
Canberra
Cosine metric
ρ(x, = ( −x̃ )
p
∑l
xl x̃ l )
p
ρ(x, ) =x̃
∑
l
| − |xl x̃ l
| | + | |xl x̃ l
ρ(x, ) =x̃
< x, >x̃
|x| | |x̃
32 / 87

Problems with high dimensions
With higher dimensions the neighboring points are further.
Example: consider training data points being distributed unformly in the unit
cube:
expected number of point in the ball of size is proportional to the
to collect the same amount on neighbors, we need to put
NN suffers from curse of dimensionality.
d >> 1
n
r r
d
r = → 1const
1/d
k
33 / 87

Measuring quality of binary classification
The classifier's output in binary classification is real variable (say, signal is blue
and background is red)
Which classifier provides better discrimination?
34 / 87

Measuring quality of binary classification
The classifier's output in binary classification is real variable (say, signal is blue
and background is red)
Which classifier provides better discrimination?
Discrimination is identical in all three cases
35 / 87

ROC curve demonstration
36 / 87

ROC curve
These distributions have the same ROC
curve:
(ROC curve is passed signal vs passed bck
dependency)
38 / 87

ROC curve
Defined only for binary classification
Contains important information:
all possible combinations of signal and background efficiencies you may
achieve by setting threshold
39 / 87

ROC curve
Particular values of thresholds (and initial pdfs) don't matter, ROC curve
doesn't contain this information
40 / 87

ROC curve
ROC curve = information about order of events:
b b s b s b ... s s b s s
41 / 87

ROC curve
ROC curve = information about order of events:
b b s b s b ... s s b s s
Comparison of algorithms should be based on the information from ROC
curve.
42 / 87

Terminology and Conventions
fpr = background efﬁciency = b
tpr = signal efﬁciency = s
43 / 87

Terminology and Conventions
fpr = background efﬁciency = b
tpr = signal efﬁciency = s
→
44 / 87

ROC AUC (area under the ROC curve)
where are predictions of
random background and signal
events.
ROC AUC = P( < )rb rs
,rb rs
45 / 87

Classiﬁer have the same ROC
AUC, but which is better for
triggers at the LHC? (we need
to pass very few background)
46 / 87

Classiﬁer have the same ROC
AUC, but which is better for
triggers at the LHC? (we need
to pass very few background)
Applications frequently demand
different metric.
47 / 87

Recapitulation
1. Statistical ML: applications and problems
2. ML in HEP
3. nearest neighbours classiﬁer and regressor.
4. ROC curve, ROC AUC
k
49 / 87

Statistical Machine Learning
Machine learning we use in practice is based on statistics
Main assumption: the data is generated from probabilistic distribution:
Does there really exist the distribution of people / pages / texts?
p(x, y)
50 / 87

Statistical Machine Learning
Machine learning we use in practice is based on statistics
Main assumption: the data is generated from probabilistic distribution:
Does there really exist the distribution of people / pages / texts?
In HEP these distributions do exist
p(x, y)
51 / 87

Optimal classification. Bayes optimal classifier
Assuming that we know real distributions we reconstruct using Bayes'
rule
Lemma (Neyman–Pearson):
The best classiﬁcation quality is provided by (Bayes optimal
classiﬁer)
p(x, y)
p(y|x) = =
p(x, y)
p(x)
p(y)p(x|y)
p(x)
=
p(y = 1 | x)
p(y = 0 | x)
p(y = 1) p(x | y = 1)
p(y = 0) p(x | y = 0)
p(y = 1 | x)
p(y = 0 | x)
53 / 87

Optimal Binary Classification
Bayes optimal classifier has highest possible ROC curve.
Since the classification quality depends only on order, gives
optimal classification quality too!
p(y = 1 | x)
= ×
p(y = 1 | x)
p(y = 0 | x)
p(y = 1)
p(y = 0)
p(x | y = 1)
p(x | y = 0)
54 / 87

Optimal Binary Classification
Bayes optimal classifier has highest possible ROC curve.
Since the classification quality depends only on order, gives
optimal classification quality too!
How can we estimate terms from this expression?
p(y = 1 | x)
= ×
p(y = 1 | x)
p(y = 0 | x)
p(y = 1)
p(y = 0)
p(x | y = 1)
p(x | y = 0)
55 / 87

Histograms density estimation
Counting number of samples in each bin and normalizing.
fast
choice of binning is crucial
number of bins grows exponentially curse of dimensionality→
56 / 87

Kernel density estimation
is kernel, is
bandwidth
Typically, gaussian kernel is
used,
but there are many others.
Approach is very close to
weighted NN.
f (x) = K
( )
1
nh ∑
i
x − xi
h
K(x) h
k 57 / 87

bandwidth selection
Silverman's rule of thumb:
h = σ̂
( )
4
3n
1
5
58 / 87

bandwidth selection
Silverman's rule of thumb:
may be irrelevant if the data is far from
being gaussian
h = σ̂
( )
4
3n
1
5
59 / 87

Parametric density estimation
Family of density functions: .
Problem: estimate parameters of a Gaussian
distribution.
f (x; θ)
f (x; μ, Σ) = exp
(
− (x − μ (x − μ)
)
1
(2π |Σ
)
d/2
|
1/2
1
2
)
T
Σ
−1
60 / 87

QDA (Quadratic discriminant analysis)
Reconstructing probabilities from data, assuming those
are multidimensional normal distributions:
p(x | y = 1), p(x | y = 0)
p(x | y = 0) ∼  ( , )μ0 Σ
0
p(x | y = 1) ∼  ( , )μ1 Σ
1
= = const =
p(y = 1 | x)
p(y = 0 | x)
p(y = 1)
p(y = 0)
p(x | y = 1)
p(x | y = 0)
n1
n0
exp(− (x − (x − ))
1
2
μ1 )
T
Σ
−1
1
μ1
exp(− (x − (x − ))
1
2
μ0 )
T
Σ
−1
0
μ0
= exp
(
− (x − (x − ) + (x − (x − ) + const
)
1
2
μ1 )
T
Σ
−1
1
μ1
1
2
μ0 )
T
Σ
−1
0
μ0
61 / 87

QDA complexity
samples, dimensions
training consists of ﬁtting and takes
computing covariance matrix
inverting covariance matrix
prediction takes for each sample spent on computing dot product
n d
p(x | y = 0) p(x | y = 1) O(n + )d
2
d
3
O(n )d
2
O( )d
3
O( )d
2
63 / 87

QDA overview
simple decision rule
fast prediction
many parameters to reconstruct in high dimensions
data almost never has gaussian distribution
64 / 87

Gaussian mixtures for density estimation
Mixture of distributions:
Mixture of Gaussian distributions:
Parameters to be found: , ,
f (x) = (x, ) = 1
∑
c−components
πc fc θc
∑
c−components
πc
f (x) = f (x; , )
∑
c−components
πc μc Σ
c
, …,π1 πC , …,μ1 μC , …,Σ
1 Σ
C
65 / 87

Gaussian mixtures: finding parameters
Criterion is maximizing likelihood (using MLE to ﬁnd optimal parameters)
no analytic solution
we can use general-purpose optimization methods
log f ( ; θ) →∑
i
xi max
θ
67 / 87

Gaussian mixtures: finding parameters
Criterion is maximizing likelihood (using MLE to ﬁnd optimal parameters)
no analytic solution
we can use general-purpose optimization methods
In mixtures parameters are split in two groups:
— parameters of components
— contributions of components
log f ( ; θ) →∑
i
xi max
θ
, …,θ1 θC
, …,π1 πC
68 / 87

Expectation-Maximization algorithm [Dempster et al., 1977]
Idea: introduce set of hidden variables
Expectation:
Maximization:
Maximization step is trivial for Gaussian distributions.
EM-algorithm is more stable and has good convergence properties.
(x)πc
(x) ← p(x ∈ c) =πc
(x; )πc fc θc
(x; )∑c̃
πc̃ fc̃ θc̃
πc
θc
←
←
( )
∑
i
πc xi
arg (x) log (x, )max
θ ∑
i
πc fc θc
69 / 87

Classification model based on mixtures density estimation is called MDA
(mixture discriminant analysis)
Generative approach
Generative approach: trying to reconstruct , then use Bayes classification
formula to predict.
QDA, MDA are generative classifiers.
p(x, y)
72 / 87

Classification model based on mixtures density estimation is called MDA
(mixture discriminant analysis)
Generative approach
Generative approach: trying to reconstruct , then use Bayes classification
formula to predict.
QDA, MDA are generative classifiers.
Problems of generative approach
Real life distributions hardly can be reconstructed
Especially in high-dimensional spaces
So, we switch to discriminative approach: guessing directly
p(x, y)
p(y|x)
73 / 87

Classification: truck vs car
74 / 87

If we can avoid density estimation, we'd better do it.
75 / 87

Linear decision rule
Decision function is linear:
This is a parametric model (ﬁnding parameters ).
QDA & MDA are parametric as well.
d(x) =< w, x > +w0
{
d(x) > 0 →
d(x) < 0 →
= +1ŷ
= −1ŷ
w, w0
76 / 87

Finding Optimal Parameters
A good initial guess: get such , that error of classiﬁcation is minimal:
Notion: .
Discontinuous optimization (arrrrgh!)
w, w0
 = = sgn(d( ))
∑
i∈events
1 ≠y
i
ŷ
i
ŷ
i
xi
= 1, = 01true 1f alse
77 / 87

Finding Optimal Parameters - 2
Discontinuous optimization
solution: let's make decision rule smooth
(x)p+1
(x)p−1
= f (d(x))
= 1 − (x)p+1
⎧
⎩
⎨
⎪
⎪
f (0) = 0.5
f (x) > 0.5
f (x) < 0.5
if x > 0
if x < 0
78 / 87

Logistic function
Properties
1. monotonic,
2.
3.
4.
σ(x) = =
e
x
1 + e
x
1
1 + e
−x
σ(x) ∈ (0, 1)
σ(x) + σ(−x) = 1
(x) = σ(x)(1 − σ(x))σ′
2 σ(x) = 1 + tanh(x/2)
79 / 87

Logistic regression
Deﬁne probabilities obtained with logistic function
and optimize log-likelihood:
Important exercise: ﬁnd an expression and build a plot for
d(x)
(x)p+1
(x)p−1
=
=
=
< w, x > +w0
σ(d(x))
σ(−d(x))
 = − ln( ( )) = L( , ) → min
∑
i∈events
py
i
xi
∑
i
xi yi
L( , ) = − ln( ( ))xi yi py
i
xi
80 / 87

Linear model for regression
How to use linear function
for regression?
Simpliﬁcation of notion:
.
d(x) =< w, x > +w0
= 1, x = (1, , …, )x0 x1 xd
d(x) =< w, x >
81 / 87

Linear regression (ordinary least squares)
We can use linear function for regression:
This is a linear system with variables and equations.
Minimize OLS aka MSE (mean squared error):
Explicit solution:
d( ) = d(x) =< w, x >xi yi
d + 1 n
 = (d( ) − → min
∑
i
xi yi )
2
( ) w =∑i
xi x
T
i
∑i
yi xi
82 / 87

Linear regression
can use some other error
but no explicit solution in other cases
demonstrates properties of linear models
reliable estimates when
able to completely ﬁt to the data if
undeﬁned when
n >> d
n = d
d < n
83 / 87

Data Scientist Pipeline
Experiments in appropriate high-level language or environment
After experiments are over — implement ﬁnal algorithm in low-level
language (C++, CUDA, FPGA)
Second point is not always needed
84 / 87

Scientific Python
NumPy
vectorized computations in python
Matplotlib
for drawing
Pandas
for data manipulation and analysis (based on NumPy)
85 / 87

Scientific Python
Scikit-learn
most popular library for machine learning
Scipy
libraries for science and engineering
Root_numpy
convenient way to work with ROOT ﬁles
86 / 87

MLHEP Lectures - day 1, basic track

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (15)

Ähnlich wie MLHEP Lectures - day 1, basic track

Ähnlich wie MLHEP Lectures - day 1, basic track (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

MLHEP Lectures - day 1, basic track