Introduction to machine learning terminology.
Applications within High Energy Physics and outside HEP.
* Basic problems: classification and regression.
* Nearest neighbours approach and spacial indices
* Overfitting (intro)
* Curse of dimensionality
* ROC curve, ROC AUC
* Bayes optimal classifier
* Density estimation: KDE and histograms
* Parametric density estimation
* Mixtures for density estimation and EM algorithm
* Generative approach vs discriminative approach
* Linear decision rule, intro to logistic regression
* Linear regression
1. Machine Learning in High Energy Physics
Lectures 1 & 2
Alex Rogozhnikov
Lund, MLHEP 2016
1 / 87
2. Intro notes
two tracks:
introductory course (this one)
advanced track: Mon, Tue, Wed, then two tracks are merged
Introductory track:
two lectures and two practice seminars on each day
Kaggle challenges
'Triggers' — only for advanced track, lasts for 3 days
'Higgs' — for both tracks, lasts for 7 days
know material? Spend more time on challenges!
1 / 87
3. Intro notes — 2
chat rooms
gitter
if you want to share something between teams — please do it publicly
(via chat)
repository
glossary is in the repository
2 / 87
4. What is Machine Learning about?
a method of teaching computers to make and improve predictions or
behaviors based on some data?
a field of computer science, probability theory, and optimization theory
which allows complex tasks to be solved for which a logical/procedural
approach would not be possible or feasible?
a type of AI that provides computers with the ability to learn without being
explicitly programmed?
somewhat in between of statistics, AI, optimization theory, signal processing
and pattern matching?
3 / 87
5. What is Machine Learning about
Inference of statistical dependencies which give us ability to predict
4 / 87
6. What is Machine Learning about
Inference of statistical dependencies which give us ability to predict
Data is cheap, knowledge is precious
5 / 87
7. Machine Learning is used in
search engines
spam detection
security: virus detection, DDOS defense
computer vision and speech recognition
market basket analysis, customer relationship management (CRM), churn
prediction
credit scoring / insurance scoring, fraud detection
health monitoring
traffic jam prediction, self-driving cars
advertisement systems / recommendation systems / news clustering
6 / 87
8. Machine Learning is used in
search engines
spam detection
security: virus detection, DDOS defense
computer vision and speech recognition
market basket analysis, customer relationship management (CRM), churn
prediction
credit scoring / insurance scoring, fraud detection
health monitoring
traffic jam prediction, self-driving cars
advertisement systems / recommendation systems / news clustering
and hundreds more
7 / 87
9. Machine Learning in High Energy Physics
Triggers (LHCb, CMS to join soon)
Particle identification
Calibration
Tagging
Stripping line
Analysis
8 / 87
10. Machine Learning in High Energy Physics
Triggers (LHCb, CMS to join soon)
Particle identification
Calibration
Tagging
Stripping line
Analysis
On each stage different data is used and different information is inferred, but
the ideas beyond are quite similar.
9 / 87
11. General notion
In supervised learning the training data is represented as a set of pairs
is an index of event
is a vector of features available for event
is a target — the value we need to predict
features = observables = variables
,xi yi
i
xi
yi
10 / 87
12. Classification problem
, where is finite set of labels.
Examples
particle identification based on information about track:
binary classification:
— is signal, is background
∈ Yyi Y
xi
Y
=
=
(p, η, E, charge, PV , FlightTime)χ2
{electron, muon, pion, ... }
Y = {0, 1} 1 0
11 / 87
14. Regression problem
Examples:
predicting price of a house by it's positions
predicting number of customers / money income
reconstructing real momentum of particle
Why do we need automatic classification/regression?
in applications up to thousands of features
higher quality
much faster adaptation to new problems
y ∈ ℝ
13 / 87
15. Classification based on nearest neighbours
Given training set of objects and their labels we predict the label for the
new observation :
Here and after is the distance in the space of features.
{ , }xi yi
x
= , j = arg ρ(x, )ŷ yj min
i
xi
ρ(x, )x̃
14 / 87
16. Visualization of decision rule
Consider a classification problem with 2 features:
,= ( , )xi x
1
i
x
2
i
∈ Y = {0, 1}yi
15 / 87
17. Nearest Neighbours ( NN)
A better way is to use neighbors:
k k
k
(x) =pỹ
# of knn events of x in class ỹ
k
16 / 87
21. Overfitting
What is the quality of classification on training dataset when ?
answer: it is ideal (closest neighbor is event itself)
k = 1
20 / 87
22. Overfitting
What is the quality of classification on training dataset when ?
answer: it is ideal (closest neighbor is event itself)
quality is lower when
k = 1
k > 1
21 / 87
23. Overfitting
What is the quality of classification on training dataset when ?
answer: it is ideal (closest neighbor is event itself)
quality is lower when
this doesn't mean is the best,
it means we cannot use training events to estimate quality
when classifier's decision rule is too complex and captures details from
training data that are not relevant to distribution, we call this an overfitting
(more details tomorrow)
k = 1
k > 1
k = 1
22 / 87
25. NN with weights
Average neighbours' output with
weights:
the closer the neighbour, the
higher weights of its contribution,
i.e.:
k
=ŷ
∑j∈knn(x)
wj yj
∑j∈knn(x)
wj
= 1/ρ(x, )wj xj
24 / 87
26. Computational complexity
Given that dimensionality of space is and there are training samples:
training time: ~ O(save a link to the data)
prediction time: for each sample
d n
n × d
25 / 87
28. Ball tree
training time ~
prediction time ~ for each sample
Other option exists: KD-tree.
O(d × n log(n))
log(n) × d
27 / 87
29. Overview of NN
Awesomely simple classifier and regressor
Provides too optimistic quality on training data
Quite slow, though optimizations exist
Too sensitive to scale of features
Hard times with data of high dimensions
k
28 / 87
30. Sensitivity to scale of features
Euclidean distance:
ρ(x, = ( − + ( − + ⋯ + ( −x̃ )
2
x1 x̃ 1 )
2
x2 x̃ 2 )
2
xd x̃ d )
2
29 / 87
33. Distance function matters
Minkowski distance
Canberra
Cosine metric
ρ(x, = ( −x̃ )
p
∑l
xl x̃ l )
p
ρ(x, ) =x̃
∑
l
| − |xl x̃ l
| | + | |xl x̃ l
ρ(x, ) =x̃
< x, >x̃
|x| | |x̃
32 / 87
34. Problems with high dimensions
With higher dimensions the neighboring points are further.
Example: consider training data points being distributed unformly in the unit
cube:
expected number of point in the ball of size is proportional to the
to collect the same amount on neighbors, we need to put
NN suffers from curse of dimensionality.
d >> 1
n
r r
d
r = → 1const
1/d
k
33 / 87
35. Measuring quality of binary classification
The classifier's output in binary classification is real variable (say, signal is blue
and background is red)
Which classifier provides better discrimination?
34 / 87
36. Measuring quality of binary classification
The classifier's output in binary classification is real variable (say, signal is blue
and background is red)
Which classifier provides better discrimination?
Discrimination is identical in all three cases
35 / 87
39. ROC curve
These distributions have the same ROC
curve:
(ROC curve is passed signal vs passed bck
dependency)
38 / 87
40. ROC curve
Defined only for binary classification
Contains important information:
all possible combinations of signal and background efficiencies you may
achieve by setting threshold
39 / 87
41. ROC curve
Defined only for binary classification
Contains important information:
all possible combinations of signal and background efficiencies you may
achieve by setting threshold
Particular values of thresholds (and initial pdfs) don't matter, ROC curve
doesn't contain this information
40 / 87
42. ROC curve
Defined only for binary classification
Contains important information:
all possible combinations of signal and background efficiencies you may
achieve by setting threshold
Particular values of thresholds (and initial pdfs) don't matter, ROC curve
doesn't contain this information
ROC curve = information about order of events:
b b s b s b ... s s b s s
41 / 87
43. ROC curve
Defined only for binary classification
Contains important information:
all possible combinations of signal and background efficiencies you may
achieve by setting threshold
Particular values of thresholds (and initial pdfs) don't matter, ROC curve
doesn't contain this information
ROC curve = information about order of events:
b b s b s b ... s s b s s
Comparison of algorithms should be based on the information from ROC
curve.
42 / 87
46. ROC AUC (area under the ROC curve)
where are predictions of
random background and signal
events.
ROC AUC = P( < )rb rs
,rb rs
45 / 87
47. Classifier have the same ROC
AUC, but which is better for
triggers at the LHC? (we need
to pass very few background)
46 / 87
48. Classifier have the same ROC
AUC, but which is better for
triggers at the LHC? (we need
to pass very few background)
Applications frequently demand
different metric.
47 / 87
50. Recapitulation
1. Statistical ML: applications and problems
2. ML in HEP
3. nearest neighbours classifier and regressor.
4. ROC curve, ROC AUC
k
49 / 87
51. Statistical Machine Learning
Machine learning we use in practice is based on statistics
Main assumption: the data is generated from probabilistic distribution:
Does there really exist the distribution of people / pages / texts?
p(x, y)
50 / 87
52. Statistical Machine Learning
Machine learning we use in practice is based on statistics
Main assumption: the data is generated from probabilistic distribution:
Does there really exist the distribution of people / pages / texts?
In HEP these distributions do exist
p(x, y)
51 / 87
53. Optimal classification. Bayes optimal classifier
Assuming that we know real distributions we reconstruct using Bayes'
rule
p(x, y)
p(y|x) = =
p(x, y)
p(x)
p(y)p(x|y)
p(x)
=
p(y = 1 | x)
p(y = 0 | x)
p(y = 1) p(x | y = 1)
p(y = 0) p(x | y = 0)
52 / 87
54. Optimal classification. Bayes optimal classifier
Assuming that we know real distributions we reconstruct using Bayes'
rule
Lemma (Neyman–Pearson):
The best classification quality is provided by (Bayes optimal
classifier)
p(x, y)
p(y|x) = =
p(x, y)
p(x)
p(y)p(x|y)
p(x)
=
p(y = 1 | x)
p(y = 0 | x)
p(y = 1) p(x | y = 1)
p(y = 0) p(x | y = 0)
p(y = 1 | x)
p(y = 0 | x)
53 / 87
55. Optimal Binary Classification
Bayes optimal classifier has highest possible ROC curve.
Since the classification quality depends only on order, gives
optimal classification quality too!
p(y = 1 | x)
= ×
p(y = 1 | x)
p(y = 0 | x)
p(y = 1)
p(y = 0)
p(x | y = 1)
p(x | y = 0)
54 / 87
56. Optimal Binary Classification
Bayes optimal classifier has highest possible ROC curve.
Since the classification quality depends only on order, gives
optimal classification quality too!
How can we estimate terms from this expression?
p(y = 1 | x)
= ×
p(y = 1 | x)
p(y = 0 | x)
p(y = 1)
p(y = 0)
p(x | y = 1)
p(x | y = 0)
55 / 87
57. Histograms density estimation
Counting number of samples in each bin and normalizing.
fast
choice of binning is crucial
number of bins grows exponentially curse of dimensionality→
56 / 87
58. Kernel density estimation
is kernel, is
bandwidth
Typically, gaussian kernel is
used,
but there are many others.
Approach is very close to
weighted NN.
f (x) = K
( )
1
nh ∑
i
x − xi
h
K(x) h
k 57 / 87
60. Kernel density estimation
bandwidth selection
Silverman's rule of thumb:
may be irrelevant if the data is far from
being gaussian
h = σ̂
( )
4
3n
1
5
59 / 87
61. Parametric density estimation
Family of density functions: .
Problem: estimate parameters of a Gaussian
distribution.
f (x; θ)
f (x; μ, Σ) = exp
(
− (x − μ (x − μ)
)
1
(2π |Σ
)
d/2
|
1/2
1
2
)
T
Σ
−1
60 / 87
64. QDA complexity
samples, dimensions
training consists of fitting and takes
computing covariance matrix
inverting covariance matrix
prediction takes for each sample spent on computing dot product
n d
p(x | y = 0) p(x | y = 1) O(n + )d
2
d
3
O(n )d
2
O( )d
3
O( )d
2
63 / 87
65. QDA overview
simple decision rule
fast prediction
many parameters to reconstruct in high dimensions
data almost never has gaussian distribution
64 / 87
66. Gaussian mixtures for density estimation
Mixture of distributions:
Mixture of Gaussian distributions:
Parameters to be found: , ,
f (x) = (x, ) = 1
∑
c−components
πc fc θc
∑
c−components
πc
f (x) = f (x; , )
∑
c−components
πc μc Σ
c
, …,π1 πC , …,μ1 μC , …,Σ
1 Σ
C
65 / 87
68. Gaussian mixtures: finding parameters
Criterion is maximizing likelihood (using MLE to find optimal parameters)
no analytic solution
we can use general-purpose optimization methods
log f ( ; θ) →∑
i
xi max
θ
67 / 87
69. Gaussian mixtures: finding parameters
Criterion is maximizing likelihood (using MLE to find optimal parameters)
no analytic solution
we can use general-purpose optimization methods
In mixtures parameters are split in two groups:
— parameters of components
— contributions of components
log f ( ; θ) →∑
i
xi max
θ
, …,θ1 θC
, …,π1 πC
68 / 87
70. Expectation-Maximization algorithm [Dempster et al., 1977]
Idea: introduce set of hidden variables
Expectation:
Maximization:
Maximization step is trivial for Gaussian distributions.
EM-algorithm is more stable and has good convergence properties.
(x)πc
(x) ← p(x ∈ c) =πc
(x; )πc fc θc
(x; )∑c̃
πc̃ fc̃ θc̃
πc
θc
←
←
( )
∑
i
πc xi
arg (x) log (x, )max
θ ∑
i
πc fc θc
69 / 87
73. Classification model based on mixtures density estimation is called MDA
(mixture discriminant analysis)
Generative approach
Generative approach: trying to reconstruct , then use Bayes classification
formula to predict.
QDA, MDA are generative classifiers.
p(x, y)
72 / 87
74. Classification model based on mixtures density estimation is called MDA
(mixture discriminant analysis)
Generative approach
Generative approach: trying to reconstruct , then use Bayes classification
formula to predict.
QDA, MDA are generative classifiers.
Problems of generative approach
Real life distributions hardly can be reconstructed
Especially in high-dimensional spaces
So, we switch to discriminative approach: guessing directly
p(x, y)
p(y|x)
73 / 87
76. If we can avoid density estimation, we'd better do it.
75 / 87
77. Linear decision rule
Decision function is linear:
This is a parametric model (finding parameters ).
QDA & MDA are parametric as well.
d(x) =< w, x > +w0
{
d(x) > 0 →
d(x) < 0 →
= +1ŷ
= −1ŷ
w, w0
76 / 87
78. Finding Optimal Parameters
A good initial guess: get such , that error of classification is minimal:
Notion: .
Discontinuous optimization (arrrrgh!)
w, w0
= = sgn(d( ))
∑
i∈events
1 ≠y
i
ŷ
i
ŷ
i
xi
= 1, = 01true 1f alse
77 / 87
79. Finding Optimal Parameters - 2
Discontinuous optimization
solution: let's make decision rule smooth
(x)p+1
(x)p−1
= f (d(x))
= 1 − (x)p+1
⎧
⎩
⎨
⎪
⎪
f (0) = 0.5
f (x) > 0.5
f (x) < 0.5
if x > 0
if x < 0
78 / 87
81. Logistic regression
Define probabilities obtained with logistic function
and optimize log-likelihood:
Important exercise: find an expression and build a plot for
d(x)
(x)p+1
(x)p−1
=
=
=
< w, x > +w0
σ(d(x))
σ(−d(x))
= − ln( ( )) = L( , ) → min
∑
i∈events
py
i
xi
∑
i
xi yi
L( , ) = − ln( ( ))xi yi py
i
xi
80 / 87
82. Linear model for regression
How to use linear function
for regression?
Simplification of notion:
.
d(x) =< w, x > +w0
= 1, x = (1, , …, )x0 x1 xd
d(x) =< w, x >
81 / 87
83. Linear regression (ordinary least squares)
We can use linear function for regression:
This is a linear system with variables and equations.
Minimize OLS aka MSE (mean squared error):
Explicit solution:
d( ) = d(x) =< w, x >xi yi
d + 1 n
= (d( ) − → min
∑
i
xi yi )
2
( ) w =∑i
xi x
T
i
∑i
yi xi
82 / 87
84. Linear regression
can use some other error
but no explicit solution in other cases
demonstrates properties of linear models
reliable estimates when
able to completely fit to the data if
undefined when
n >> d
n = d
d < n
83 / 87
85. Data Scientist Pipeline
Experiments in appropriate high-level language or environment
After experiments are over — implement final algorithm in low-level
language (C++, CUDA, FPGA)
Second point is not always needed
84 / 87
87. Scientific Python
Scikit-learn
most popular library for machine learning
Scipy
libraries for science and engineering
Root_numpy
convenient way to work with ROOT files
86 / 87