SlideShare ist ein Scribd-Unternehmen logo
1 von 88
Downloaden Sie, um offline zu lesen
Machine Learning in High Energy Physics
Lectures 1 & 2
Alex Rogozhnikov
Lund, MLHEP 2016
1 / 87
Intro notes
two tracks:
introductory course (this one)
advanced track: Mon, Tue, Wed, then two tracks are merged
Introductory track:
two lectures and two practice seminars on each day
Kaggle challenges
'Triggers' — only for advanced track, lasts for 3 days
'Higgs' — for both tracks, lasts for 7 days
know material? Spend more time on challenges!
1 / 87
Intro notes — 2
chat rooms
gitter
if you want to share something between teams — please do it publicly
(via chat)
repository
glossary is in the repository
2 / 87
What is Machine Learning about?
a method of teaching computers to make and improve predictions or
behaviors based on some data?
a field of computer science, probability theory, and optimization theory
which allows complex tasks to be solved for which a logical/procedural
approach would not be possible or feasible?
a type of AI that provides computers with the ability to learn without being
explicitly programmed?
somewhat in between of statistics, AI, optimization theory, signal processing
and pattern matching?
3 / 87
What is Machine Learning about
Inference of statistical dependencies which give us ability to predict
4 / 87
What is Machine Learning about
Inference of statistical dependencies which give us ability to predict
Data is cheap, knowledge is precious
5 / 87
Machine Learning is used in
search engines
spam detection
security: virus detection, DDOS defense
computer vision and speech recognition
market basket analysis, customer relationship management (CRM), churn
prediction
credit scoring / insurance scoring, fraud detection
health monitoring
traffic jam prediction, self-driving cars
advertisement systems / recommendation systems / news clustering
6 / 87
Machine Learning is used in
search engines
spam detection
security: virus detection, DDOS defense
computer vision and speech recognition
market basket analysis, customer relationship management (CRM), churn
prediction
credit scoring / insurance scoring, fraud detection
health monitoring
traffic jam prediction, self-driving cars
advertisement systems / recommendation systems / news clustering
and hundreds more
7 / 87
Machine Learning in High Energy Physics
Triggers (LHCb, CMS to join soon)
Particle identification
Calibration
Tagging
Stripping line
Analysis
8 / 87
Machine Learning in High Energy Physics
Triggers (LHCb, CMS to join soon)
Particle identification
Calibration
Tagging
Stripping line
Analysis
On each stage different data is used and different information is inferred, but
the ideas beyond are quite similar.
9 / 87
General notion
In supervised learning the training data is represented as a set of pairs
is an index of event
is a vector of features available for event
is a target — the value we need to predict
features = observables = variables
,xi yi
i
xi
yi
10 / 87
Classification problem
, where is finite set of labels.
Examples
particle identification based on information about track:
binary classification:
— is signal, is background
∈ Yyi Y
xi
Y
=
=
(p, η, E, charge, PV , FlightTime)χ2
{electron, muon, pion, ... }
Y = {0, 1} 1 0
11 / 87
Regression problem
Examples:
predicting price of a house by it's positions
predicting number of customers / money income
reconstructing real momentum of particle
y ∈ ℝ
12 / 87
Regression problem
Examples:
predicting price of a house by it's positions
predicting number of customers / money income
reconstructing real momentum of particle
Why do we need automatic classification/regression?
in applications up to thousands of features
higher quality
much faster adaptation to new problems
y ∈ ℝ
13 / 87
Classification based on nearest neighbours
Given training set of objects and their labels we predict the label for the
new observation :
Here and after is the distance in the space of features.
{ , }xi yi
x
= , j = arg ρ(x, )ŷ  yj min
i
xi
ρ(x, )x̃ 
14 / 87
Visualization of decision rule
Consider a classification problem with 2 features:
,= ( , )xi x
1
i
x
2
i
∈ Y = {0, 1}yi
15 / 87
Nearest Neighbours ( NN)
A better way is to use neighbors:
k k
k
(x) =pỹ 
# of knn events of x in class ỹ 
k
16 / 87
17 / 87
k = 1, 2, 5, 30
18 / 87
Overfitting
What is the quality of classification on training dataset when ?k = 1
19 / 87
Overfitting
What is the quality of classification on training dataset when ?
answer: it is ideal (closest neighbor is event itself)
k = 1
20 / 87
Overfitting
What is the quality of classification on training dataset when ?
answer: it is ideal (closest neighbor is event itself)
quality is lower when
k = 1
k > 1
21 / 87
Overfitting
What is the quality of classification on training dataset when ?
answer: it is ideal (closest neighbor is event itself)
quality is lower when
this doesn't mean is the best,
it means we cannot use training events to estimate quality
when classifier's decision rule is too complex and captures details from
training data that are not relevant to distribution, we call this an overfitting
(more details tomorrow)
k = 1
k > 1
k = 1
22 / 87
Regression using NN
Regression with nearest neighbours is done by averaging of output
k
=ŷ 
1
k ∑
j∈knn(x)
yj
23 / 87
NN with weights
Average neighbours' output with
weights:
the closer the neighbour, the
higher weights of its contribution,
i.e.:
k
=ŷ 
∑j∈knn(x)
wj yj
∑j∈knn(x)
wj
= 1/ρ(x, )wj xj
24 / 87
Computational complexity
Given that dimensionality of space is and there are training samples:
training time: ~ O(save a link to the data)
prediction time: for each sample
d n
n × d
25 / 87
Spacial index: ball tree
26 / 87
Ball tree
training time ~
prediction time ~ for each sample
Other option exists: KD-tree.
O(d × n log(n))
log(n) × d
27 / 87
Overview of NN
Awesomely simple classifier and regressor
Provides too optimistic quality on training data
Quite slow, though optimizations exist
Too sensitive to scale of features
Hard times with data of high dimensions
k
28 / 87
Sensitivity to scale of features
Euclidean distance:
ρ(x, = ( − + ( − + ⋯ + ( −x̃ )
2
x1 x̃ 1 )
2
x2 x̃ 2 )
2
xd x̃ d )
2
29 / 87
Sensitivity to scale of features
Euclidean distance:
Change scale of first feature:
ρ(x, = ( − + ( − + ⋯ + ( −x̃ )
2
x1 x̃ 1 )
2
x2 x̃ 2 )
2
xd x̃ d )
2
ρ(x, x̃ )
2
ρ(x, x̃ )
2
=
∼
(10 − 10 + ( − + ⋯ + ( −x1 x̃ 1 )
2
x2 x̃ 2 )
2
xd x̃ d )
2
100 ( −x1 x̃ 1 )
2
30 / 87
Sensitivity to scale of features
Euclidean distance:
Change scale of first feature:
Scaling of features frequently increases quality.
ρ(x, = ( − + ( − + ⋯ + ( −x̃ )
2
x1 x̃ 1 )
2
x2 x̃ 2 )
2
xd x̃ d )
2
ρ(x, x̃ )
2
ρ(x, x̃ )
2
=
∼
(10 − 10 + ( − + ⋯ + ( −x1 x̃ 1 )
2
x2 x̃ 2 )
2
xd x̃ d )
2
100 ( −x1 x̃ 1 )
2
31 / 87
Distance function matters
Minkowski distance
Canberra
Cosine metric
ρ(x, = ( −x̃ )
p
∑l
xl x̃ l )
p
ρ(x, ) =x̃ 
∑
l
| − |xl x̃ l
| | + | |xl x̃ l
ρ(x, ) =x̃ 
< x, >x̃ 
|x| | |x̃ 
32 / 87
Problems with high dimensions
With higher dimensions the neighboring points are further.
Example: consider training data points being distributed unformly in the unit
cube:
expected number of point in the ball of size is proportional to the
to collect the same amount on neighbors, we need to put
NN suffers from curse of dimensionality.
d >> 1
n
r r
d
r = → 1const
1/d
k
33 / 87
Measuring quality of binary classification
The classifier's output in binary classification is real variable (say, signal is blue
and background is red)
Which classifier provides better discrimination?
34 / 87
Measuring quality of binary classification
The classifier's output in binary classification is real variable (say, signal is blue
and background is red)
Which classifier provides better discrimination?
Discrimination is identical in all three cases
35 / 87
ROC curve demonstration
36 / 87
ROC curve
37 / 87
ROC curve
These distributions have the same ROC
curve:
(ROC curve is passed signal vs passed bck
dependency)
38 / 87
ROC curve
Defined only for binary classification
Contains important information:
all possible combinations of signal and background efficiencies you may
achieve by setting threshold
39 / 87
ROC curve
Defined only for binary classification
Contains important information:
all possible combinations of signal and background efficiencies you may
achieve by setting threshold
Particular values of thresholds (and initial pdfs) don't matter, ROC curve
doesn't contain this information
40 / 87
ROC curve
Defined only for binary classification
Contains important information:
all possible combinations of signal and background efficiencies you may
achieve by setting threshold
Particular values of thresholds (and initial pdfs) don't matter, ROC curve
doesn't contain this information
ROC curve = information about order of events:
b b s b s b ... s s b s s
41 / 87
ROC curve
Defined only for binary classification
Contains important information:
all possible combinations of signal and background efficiencies you may
achieve by setting threshold
Particular values of thresholds (and initial pdfs) don't matter, ROC curve
doesn't contain this information
ROC curve = information about order of events:
b b s b s b ... s s b s s
Comparison of algorithms should be based on the information from ROC
curve.
42 / 87
Terminology and Conventions
fpr = background efficiency = b
tpr = signal efficiency = s
43 / 87
Terminology and Conventions
fpr = background efficiency = b
tpr = signal efficiency = s
→
44 / 87
ROC AUC (area under the ROC curve)
where are predictions of
random background and signal
events.
ROC AUC = P( < )rb rs
,rb rs
45 / 87
Classifier have the same ROC
AUC, but which is better for
triggers at the LHC? (we need
to pass very few background)
46 / 87
Classifier have the same ROC
AUC, but which is better for
triggers at the LHC? (we need
to pass very few background)
Applications frequently demand
different metric.
47 / 87
-minutes breakn
48 / 87
Recapitulation
1. Statistical ML: applications and problems
2. ML in HEP
3. nearest neighbours classifier and regressor.
4. ROC curve, ROC AUC
k
49 / 87
Statistical Machine Learning
Machine learning we use in practice is based on statistics
Main assumption: the data is generated from probabilistic distribution:
Does there really exist the distribution of people / pages / texts?
p(x, y)
50 / 87
Statistical Machine Learning
Machine learning we use in practice is based on statistics
Main assumption: the data is generated from probabilistic distribution:
Does there really exist the distribution of people / pages / texts?
In HEP these distributions do exist
p(x, y)
51 / 87
Optimal classification. Bayes optimal classifier
Assuming that we know real distributions we reconstruct using Bayes'
rule
p(x, y)
p(y|x) = =
p(x, y)
p(x)
p(y)p(x|y)
p(x)
=
p(y = 1 | x)
p(y = 0 | x)
p(y = 1) p(x | y = 1)
p(y = 0) p(x | y = 0)
52 / 87
Optimal classification. Bayes optimal classifier
Assuming that we know real distributions we reconstruct using Bayes'
rule
Lemma (Neyman–Pearson):
The best classification quality is provided by (Bayes optimal
classifier)
p(x, y)
p(y|x) = =
p(x, y)
p(x)
p(y)p(x|y)
p(x)
=
p(y = 1 | x)
p(y = 0 | x)
p(y = 1) p(x | y = 1)
p(y = 0) p(x | y = 0)
p(y = 1 | x)
p(y = 0 | x)
53 / 87
Optimal Binary Classification
Bayes optimal classifier has highest possible ROC curve.
Since the classification quality depends only on order, gives
optimal classification quality too!
p(y = 1 | x)
= ×
p(y = 1 | x)
p(y = 0 | x)
p(y = 1)
p(y = 0)
p(x | y = 1)
p(x | y = 0)
54 / 87
Optimal Binary Classification
Bayes optimal classifier has highest possible ROC curve.
Since the classification quality depends only on order, gives
optimal classification quality too!
How can we estimate terms from this expression?
p(y = 1 | x)
= ×
p(y = 1 | x)
p(y = 0 | x)
p(y = 1)
p(y = 0)
p(x | y = 1)
p(x | y = 0)
55 / 87
Histograms density estimation
Counting number of samples in each bin and normalizing.
fast
choice of binning is crucial
number of bins grows exponentially curse of dimensionality→
56 / 87
Kernel density estimation
is kernel, is
bandwidth
Typically, gaussian kernel is
used,
but there are many others.
Approach is very close to
weighted NN.
f (x) = K
( )
1
nh ∑
i
x − xi
h
K(x) h
k 57 / 87
Kernel density estimation
bandwidth selection
Silverman's rule of thumb:
h = σ̂ 
( )
4
3n
1
5
58 / 87
Kernel density estimation
bandwidth selection
Silverman's rule of thumb:
may be irrelevant if the data is far from
being gaussian
h = σ̂ 
( )
4
3n
1
5
59 / 87
Parametric density estimation
Family of density functions: .
Problem: estimate parameters of a Gaussian
distribution.
f (x; θ)
f (x; μ, Σ) = exp
(
− (x − μ (x − μ)
)
1
(2π |Σ
)
d/2
|
1/2
1
2
)
T
Σ
−1
60 / 87
QDA (Quadratic discriminant analysis)
Reconstructing probabilities from data, assuming those
are multidimensional normal distributions:
p(x | y = 1), p(x | y = 0)
p(x | y = 0) ∼  ( , )μ0 Σ
0
p(x | y = 1) ∼  ( , )μ1 Σ
1
= = const =
p(y = 1 | x)
p(y = 0 | x)
p(y = 1)
p(y = 0)
p(x | y = 1)
p(x | y = 0)
n1
n0
exp(− (x − (x − ))
1
2
μ1 )
T
Σ
−1
1
μ1
exp(− (x − (x − ))
1
2
μ0 )
T
Σ
−1
0
μ0
= exp
(
− (x − (x − ) + (x − (x − ) + const
)
1
2
μ1 )
T
Σ
−1
1
μ1
1
2
μ0 )
T
Σ
−1
0
μ0
61 / 87
62 / 87
QDA complexity
samples, dimensions
training consists of fitting and takes
computing covariance matrix
inverting covariance matrix
prediction takes for each sample spent on computing dot product
n d
p(x | y = 0) p(x | y = 1) O(n + )d
2
d
3
O(n )d
2
O( )d
3
O( )d
2
63 / 87
QDA overview
simple decision rule
fast prediction
many parameters to reconstruct in high dimensions
data almost never has gaussian distribution
64 / 87
Gaussian mixtures for density estimation
Mixture of distributions:
Mixture of Gaussian distributions:
Parameters to be found: , ,
f (x) = (x, ) = 1
∑
c−components
πc fc θc
∑
c−components
πc
f (x) = f (x; , )
∑
c−components
πc μc Σ
c
, …,π1 πC , …,μ1 μC , …,Σ
1 Σ
C
65 / 87
66 / 87
Gaussian mixtures: finding parameters
Criterion is maximizing likelihood (using MLE to find optimal parameters)
no analytic solution
we can use general-purpose optimization methods
log f ( ; θ) →∑
i
xi max
θ
67 / 87
Gaussian mixtures: finding parameters
Criterion is maximizing likelihood (using MLE to find optimal parameters)
no analytic solution
we can use general-purpose optimization methods
In mixtures parameters are split in two groups:
— parameters of components
— contributions of components
log f ( ; θ) →∑
i
xi max
θ
, …,θ1 θC
, …,π1 πC
68 / 87
Expectation-Maximization algorithm [Dempster et al., 1977]
Idea: introduce set of hidden variables
Expectation:
Maximization:
Maximization step is trivial for Gaussian distributions.
EM-algorithm is more stable and has good convergence properties.
(x)πc
(x) ← p(x ∈ c) =πc
(x; )πc fc θc
(x; )∑c̃ 
πc̃ fc̃  θc̃ 
πc
θc
←
←
( )
∑
i
πc xi
arg (x) log (x, )max
θ ∑
i
πc fc θc
69 / 87
EM algorithm
70 / 87
EM algorithm
71 / 87
Classification model based on mixtures density estimation is called MDA
(mixture discriminant analysis)
Generative approach
Generative approach: trying to reconstruct , then use Bayes classification
formula to predict.
QDA, MDA are generative classifiers.
p(x, y)
72 / 87
Classification model based on mixtures density estimation is called MDA
(mixture discriminant analysis)
Generative approach
Generative approach: trying to reconstruct , then use Bayes classification
formula to predict.
QDA, MDA are generative classifiers.
Problems of generative approach
Real life distributions hardly can be reconstructed
Especially in high-dimensional spaces
So, we switch to discriminative approach: guessing directly
p(x, y)
p(y|x)
73 / 87
Classification: truck vs car
74 / 87
If we can avoid density estimation, we'd better do it.
75 / 87
Linear decision rule
Decision function is linear:
This is a parametric model (finding parameters ).
QDA & MDA are parametric as well.
d(x) =< w, x > +w0
{
d(x) > 0 →
d(x) < 0 →
= +1ŷ 
= −1ŷ 
w, w0
76 / 87
Finding Optimal Parameters
A good initial guess: get such , that error of classification is minimal:
Notion: .
Discontinuous optimization (arrrrgh!)
w, w0
 = = sgn(d( ))
∑
i∈events
1 ≠y
i
ŷ 
i
ŷ 
i
xi
= 1, = 01true 1f alse
77 / 87
Finding Optimal Parameters - 2
Discontinuous optimization
solution: let's make decision rule smooth
(x)p+1
(x)p−1
= f (d(x))
= 1 − (x)p+1
⎧
⎩
⎨
⎪
⎪
f (0) = 0.5
f (x) > 0.5
f (x) < 0.5
if x > 0
if x < 0
78 / 87
Logistic function
Properties
1. monotonic,
2.
3.
4.
σ(x) = =
e
x
1 + e
x
1
1 + e
−x
σ(x) ∈ (0, 1)
σ(x) + σ(−x) = 1
(x) = σ(x)(1 − σ(x))σ′
2 σ(x) = 1 + tanh(x/2)
79 / 87
Logistic regression
Define probabilities obtained with logistic function
and optimize log-likelihood:
Important exercise: find an expression and build a plot for
d(x)
(x)p+1
(x)p−1
=
=
=
< w, x > +w0
σ(d(x))
σ(−d(x))
 = − ln( ( )) = L( , ) → min
∑
i∈events
py
i
xi
∑
i
xi yi
L( , ) = − ln( ( ))xi yi py
i
xi
80 / 87
Linear model for regression
How to use linear function
for regression?
Simplification of notion:
.
d(x) =< w, x > +w0
= 1, x = (1, , …, )x0 x1 xd
d(x) =< w, x >
81 / 87
Linear regression (ordinary least squares)
We can use linear function for regression:
This is a linear system with variables and equations.
Minimize OLS aka MSE (mean squared error):
Explicit solution:
d( ) = d(x) =< w, x >xi yi
d + 1 n
 = (d( ) − → min
∑
i
xi yi )
2
( ) w =∑i
xi x
T
i
∑i
yi xi
82 / 87
Linear regression
can use some other error
but no explicit solution in other cases
demonstrates properties of linear models
reliable estimates when
able to completely fit to the data if
undefined when
n >> d
n = d
d < n
83 / 87
Data Scientist Pipeline
Experiments in appropriate high-level language or environment
After experiments are over — implement final algorithm in low-level
language (C++, CUDA, FPGA)
Second point is not always needed
84 / 87
Scientific Python
NumPy
vectorized computations in python
Matplotlib
for drawing
Pandas
for data manipulation and analysis (based on NumPy)
85 / 87
Scientific Python
Scikit-learn
most popular library for machine learning
Scipy
libraries for science and engineering
Root_numpy
convenient way to work with ROOT files
86 / 87
87 / 87

Weitere ähnliche Inhalte

Was ist angesagt?

Machine learning in science and industry — day 4
Machine learning in science and industry — day 4Machine learning in science and industry — day 4
Machine learning in science and industry — day 4arogozhnikov
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variationsAndres Mendez-Vazquez
 
(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural NetworkMasahiro Suzuki
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural NetworksMasahiro Suzuki
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modeljins0618
 
Ridge regression, lasso and elastic net
Ridge regression, lasso and elastic netRidge regression, lasso and elastic net
Ridge regression, lasso and elastic netVivian S. Zhang
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Universitat Politècnica de Catalunya
 
23 Machine Learning Feature Generation
23 Machine Learning Feature Generation23 Machine Learning Feature Generation
23 Machine Learning Feature GenerationAndres Mendez-Vazquez
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learningYogendra Singh
 
Vc dimension in Machine Learning
Vc dimension in Machine LearningVc dimension in Machine Learning
Vc dimension in Machine LearningVARUN KUMAR
 
11 Machine Learning Important Issues in Machine Learning
11 Machine Learning Important Issues in Machine Learning11 Machine Learning Important Issues in Machine Learning
11 Machine Learning Important Issues in Machine LearningAndres Mendez-Vazquez
 

Was ist angesagt? (15)

Machine learning in science and industry — day 4
Machine learning in science and industry — day 4Machine learning in science and industry — day 4
Machine learning in science and industry — day 4
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
 
(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network
 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture model
 
Ridge regression, lasso and elastic net
Ridge regression, lasso and elastic netRidge regression, lasso and elastic net
Ridge regression, lasso and elastic net
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
 
23 Machine Learning Feature Generation
23 Machine Learning Feature Generation23 Machine Learning Feature Generation
23 Machine Learning Feature Generation
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Vc dimension in Machine Learning
Vc dimension in Machine LearningVc dimension in Machine Learning
Vc dimension in Machine Learning
 
11 Machine Learning Important Issues in Machine Learning
11 Machine Learning Important Issues in Machine Learning11 Machine Learning Important Issues in Machine Learning
11 Machine Learning Important Issues in Machine Learning
 

Ähnlich wie MLHEP Lectures - day 1, basic track

Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1arogozhnikov
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzerbutest
 
Instance Based Learning in Machine Learning
Instance Based Learning in Machine LearningInstance Based Learning in Machine Learning
Instance Based Learning in Machine LearningPavithra Thippanaik
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier홍배 김
 
2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revisedKrish_ver2
 
Kaggle Projects Presentation Sawinder Pal Kaur
Kaggle Projects Presentation Sawinder Pal KaurKaggle Projects Presentation Sawinder Pal Kaur
Kaggle Projects Presentation Sawinder Pal KaurSawinder Pal Kaur
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Zihui Li
 
Application of combined support vector machines in process fault diagnosis
Application of combined support vector machines in process fault diagnosisApplication of combined support vector machines in process fault diagnosis
Application of combined support vector machines in process fault diagnosisDr.Pooja Jain
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorizationmidi
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonChun-Ming Chang
 
ENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-MeansENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-Meanstthonet
 
Analytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion miningAnalytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion miningcsandit
 
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...cscpconf
 
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MININGANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MININGcsandit
 

Ähnlich wie MLHEP Lectures - day 1, basic track (20)

Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzer
 
Instance Based Learning in Machine Learning
Instance Based Learning in Machine LearningInstance Based Learning in Machine Learning
Instance Based Learning in Machine Learning
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 
2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised
 
Kaggle Projects Presentation Sawinder Pal Kaur
Kaggle Projects Presentation Sawinder Pal KaurKaggle Projects Presentation Sawinder Pal Kaur
Kaggle Projects Presentation Sawinder Pal Kaur
 
Text categorization
Text categorizationText categorization
Text categorization
 
[ppt]
[ppt][ppt]
[ppt]
 
[ppt]
[ppt][ppt]
[ppt]
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Application of combined support vector machines in process fault diagnosis
Application of combined support vector machines in process fault diagnosisApplication of combined support vector machines in process fault diagnosis
Application of combined support vector machines in process fault diagnosis
 
Lect4
Lect4Lect4
Lect4
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in Python
 
Data analysis of weather forecasting
Data analysis of weather forecastingData analysis of weather forecasting
Data analysis of weather forecasting
 
ENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-MeansENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-Means
 
Estimating Space-Time Covariance from Finite Sample Sets
Estimating Space-Time Covariance from Finite Sample SetsEstimating Space-Time Covariance from Finite Sample Sets
Estimating Space-Time Covariance from Finite Sample Sets
 
Analytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion miningAnalytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion mining
 
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
 
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MININGANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
 

Kürzlich hochgeladen

PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 

Kürzlich hochgeladen (20)

PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 

MLHEP Lectures - day 1, basic track

  • 1. Machine Learning in High Energy Physics Lectures 1 & 2 Alex Rogozhnikov Lund, MLHEP 2016 1 / 87
  • 2. Intro notes two tracks: introductory course (this one) advanced track: Mon, Tue, Wed, then two tracks are merged Introductory track: two lectures and two practice seminars on each day Kaggle challenges 'Triggers' — only for advanced track, lasts for 3 days 'Higgs' — for both tracks, lasts for 7 days know material? Spend more time on challenges! 1 / 87
  • 3. Intro notes — 2 chat rooms gitter if you want to share something between teams — please do it publicly (via chat) repository glossary is in the repository 2 / 87
  • 4. What is Machine Learning about? a method of teaching computers to make and improve predictions or behaviors based on some data? a field of computer science, probability theory, and optimization theory which allows complex tasks to be solved for which a logical/procedural approach would not be possible or feasible? a type of AI that provides computers with the ability to learn without being explicitly programmed? somewhat in between of statistics, AI, optimization theory, signal processing and pattern matching? 3 / 87
  • 5. What is Machine Learning about Inference of statistical dependencies which give us ability to predict 4 / 87
  • 6. What is Machine Learning about Inference of statistical dependencies which give us ability to predict Data is cheap, knowledge is precious 5 / 87
  • 7. Machine Learning is used in search engines spam detection security: virus detection, DDOS defense computer vision and speech recognition market basket analysis, customer relationship management (CRM), churn prediction credit scoring / insurance scoring, fraud detection health monitoring traffic jam prediction, self-driving cars advertisement systems / recommendation systems / news clustering 6 / 87
  • 8. Machine Learning is used in search engines spam detection security: virus detection, DDOS defense computer vision and speech recognition market basket analysis, customer relationship management (CRM), churn prediction credit scoring / insurance scoring, fraud detection health monitoring traffic jam prediction, self-driving cars advertisement systems / recommendation systems / news clustering and hundreds more 7 / 87
  • 9. Machine Learning in High Energy Physics Triggers (LHCb, CMS to join soon) Particle identification Calibration Tagging Stripping line Analysis 8 / 87
  • 10. Machine Learning in High Energy Physics Triggers (LHCb, CMS to join soon) Particle identification Calibration Tagging Stripping line Analysis On each stage different data is used and different information is inferred, but the ideas beyond are quite similar. 9 / 87
  • 11. General notion In supervised learning the training data is represented as a set of pairs is an index of event is a vector of features available for event is a target — the value we need to predict features = observables = variables ,xi yi i xi yi 10 / 87
  • 12. Classification problem , where is finite set of labels. Examples particle identification based on information about track: binary classification: — is signal, is background ∈ Yyi Y xi Y = = (p, η, E, charge, PV , FlightTime)χ2 {electron, muon, pion, ... } Y = {0, 1} 1 0 11 / 87
  • 13. Regression problem Examples: predicting price of a house by it's positions predicting number of customers / money income reconstructing real momentum of particle y ∈ ℝ 12 / 87
  • 14. Regression problem Examples: predicting price of a house by it's positions predicting number of customers / money income reconstructing real momentum of particle Why do we need automatic classification/regression? in applications up to thousands of features higher quality much faster adaptation to new problems y ∈ ℝ 13 / 87
  • 15. Classification based on nearest neighbours Given training set of objects and their labels we predict the label for the new observation : Here and after is the distance in the space of features. { , }xi yi x = , j = arg ρ(x, )ŷ  yj min i xi ρ(x, )x̃  14 / 87
  • 16. Visualization of decision rule Consider a classification problem with 2 features: ,= ( , )xi x 1 i x 2 i ∈ Y = {0, 1}yi 15 / 87
  • 17. Nearest Neighbours ( NN) A better way is to use neighbors: k k k (x) =pỹ  # of knn events of x in class ỹ  k 16 / 87
  • 19. k = 1, 2, 5, 30 18 / 87
  • 20. Overfitting What is the quality of classification on training dataset when ?k = 1 19 / 87
  • 21. Overfitting What is the quality of classification on training dataset when ? answer: it is ideal (closest neighbor is event itself) k = 1 20 / 87
  • 22. Overfitting What is the quality of classification on training dataset when ? answer: it is ideal (closest neighbor is event itself) quality is lower when k = 1 k > 1 21 / 87
  • 23. Overfitting What is the quality of classification on training dataset when ? answer: it is ideal (closest neighbor is event itself) quality is lower when this doesn't mean is the best, it means we cannot use training events to estimate quality when classifier's decision rule is too complex and captures details from training data that are not relevant to distribution, we call this an overfitting (more details tomorrow) k = 1 k > 1 k = 1 22 / 87
  • 24. Regression using NN Regression with nearest neighbours is done by averaging of output k =ŷ  1 k ∑ j∈knn(x) yj 23 / 87
  • 25. NN with weights Average neighbours' output with weights: the closer the neighbour, the higher weights of its contribution, i.e.: k =ŷ  ∑j∈knn(x) wj yj ∑j∈knn(x) wj = 1/ρ(x, )wj xj 24 / 87
  • 26. Computational complexity Given that dimensionality of space is and there are training samples: training time: ~ O(save a link to the data) prediction time: for each sample d n n × d 25 / 87
  • 27. Spacial index: ball tree 26 / 87
  • 28. Ball tree training time ~ prediction time ~ for each sample Other option exists: KD-tree. O(d × n log(n)) log(n) × d 27 / 87
  • 29. Overview of NN Awesomely simple classifier and regressor Provides too optimistic quality on training data Quite slow, though optimizations exist Too sensitive to scale of features Hard times with data of high dimensions k 28 / 87
  • 30. Sensitivity to scale of features Euclidean distance: ρ(x, = ( − + ( − + ⋯ + ( −x̃ ) 2 x1 x̃ 1 ) 2 x2 x̃ 2 ) 2 xd x̃ d ) 2 29 / 87
  • 31. Sensitivity to scale of features Euclidean distance: Change scale of first feature: ρ(x, = ( − + ( − + ⋯ + ( −x̃ ) 2 x1 x̃ 1 ) 2 x2 x̃ 2 ) 2 xd x̃ d ) 2 ρ(x, x̃ ) 2 ρ(x, x̃ ) 2 = ∼ (10 − 10 + ( − + ⋯ + ( −x1 x̃ 1 ) 2 x2 x̃ 2 ) 2 xd x̃ d ) 2 100 ( −x1 x̃ 1 ) 2 30 / 87
  • 32. Sensitivity to scale of features Euclidean distance: Change scale of first feature: Scaling of features frequently increases quality. ρ(x, = ( − + ( − + ⋯ + ( −x̃ ) 2 x1 x̃ 1 ) 2 x2 x̃ 2 ) 2 xd x̃ d ) 2 ρ(x, x̃ ) 2 ρ(x, x̃ ) 2 = ∼ (10 − 10 + ( − + ⋯ + ( −x1 x̃ 1 ) 2 x2 x̃ 2 ) 2 xd x̃ d ) 2 100 ( −x1 x̃ 1 ) 2 31 / 87
  • 33. Distance function matters Minkowski distance Canberra Cosine metric ρ(x, = ( −x̃ ) p ∑l xl x̃ l ) p ρ(x, ) =x̃  ∑ l | − |xl x̃ l | | + | |xl x̃ l ρ(x, ) =x̃  < x, >x̃  |x| | |x̃  32 / 87
  • 34. Problems with high dimensions With higher dimensions the neighboring points are further. Example: consider training data points being distributed unformly in the unit cube: expected number of point in the ball of size is proportional to the to collect the same amount on neighbors, we need to put NN suffers from curse of dimensionality. d >> 1 n r r d r = → 1const 1/d k 33 / 87
  • 35. Measuring quality of binary classification The classifier's output in binary classification is real variable (say, signal is blue and background is red) Which classifier provides better discrimination? 34 / 87
  • 36. Measuring quality of binary classification The classifier's output in binary classification is real variable (say, signal is blue and background is red) Which classifier provides better discrimination? Discrimination is identical in all three cases 35 / 87
  • 39. ROC curve These distributions have the same ROC curve: (ROC curve is passed signal vs passed bck dependency) 38 / 87
  • 40. ROC curve Defined only for binary classification Contains important information: all possible combinations of signal and background efficiencies you may achieve by setting threshold 39 / 87
  • 41. ROC curve Defined only for binary classification Contains important information: all possible combinations of signal and background efficiencies you may achieve by setting threshold Particular values of thresholds (and initial pdfs) don't matter, ROC curve doesn't contain this information 40 / 87
  • 42. ROC curve Defined only for binary classification Contains important information: all possible combinations of signal and background efficiencies you may achieve by setting threshold Particular values of thresholds (and initial pdfs) don't matter, ROC curve doesn't contain this information ROC curve = information about order of events: b b s b s b ... s s b s s 41 / 87
  • 43. ROC curve Defined only for binary classification Contains important information: all possible combinations of signal and background efficiencies you may achieve by setting threshold Particular values of thresholds (and initial pdfs) don't matter, ROC curve doesn't contain this information ROC curve = information about order of events: b b s b s b ... s s b s s Comparison of algorithms should be based on the information from ROC curve. 42 / 87
  • 44. Terminology and Conventions fpr = background efficiency = b tpr = signal efficiency = s 43 / 87
  • 45. Terminology and Conventions fpr = background efficiency = b tpr = signal efficiency = s → 44 / 87
  • 46. ROC AUC (area under the ROC curve) where are predictions of random background and signal events. ROC AUC = P( < )rb rs ,rb rs 45 / 87
  • 47. Classifier have the same ROC AUC, but which is better for triggers at the LHC? (we need to pass very few background) 46 / 87
  • 48. Classifier have the same ROC AUC, but which is better for triggers at the LHC? (we need to pass very few background) Applications frequently demand different metric. 47 / 87
  • 50. Recapitulation 1. Statistical ML: applications and problems 2. ML in HEP 3. nearest neighbours classifier and regressor. 4. ROC curve, ROC AUC k 49 / 87
  • 51. Statistical Machine Learning Machine learning we use in practice is based on statistics Main assumption: the data is generated from probabilistic distribution: Does there really exist the distribution of people / pages / texts? p(x, y) 50 / 87
  • 52. Statistical Machine Learning Machine learning we use in practice is based on statistics Main assumption: the data is generated from probabilistic distribution: Does there really exist the distribution of people / pages / texts? In HEP these distributions do exist p(x, y) 51 / 87
  • 53. Optimal classification. Bayes optimal classifier Assuming that we know real distributions we reconstruct using Bayes' rule p(x, y) p(y|x) = = p(x, y) p(x) p(y)p(x|y) p(x) = p(y = 1 | x) p(y = 0 | x) p(y = 1) p(x | y = 1) p(y = 0) p(x | y = 0) 52 / 87
  • 54. Optimal classification. Bayes optimal classifier Assuming that we know real distributions we reconstruct using Bayes' rule Lemma (Neyman–Pearson): The best classification quality is provided by (Bayes optimal classifier) p(x, y) p(y|x) = = p(x, y) p(x) p(y)p(x|y) p(x) = p(y = 1 | x) p(y = 0 | x) p(y = 1) p(x | y = 1) p(y = 0) p(x | y = 0) p(y = 1 | x) p(y = 0 | x) 53 / 87
  • 55. Optimal Binary Classification Bayes optimal classifier has highest possible ROC curve. Since the classification quality depends only on order, gives optimal classification quality too! p(y = 1 | x) = × p(y = 1 | x) p(y = 0 | x) p(y = 1) p(y = 0) p(x | y = 1) p(x | y = 0) 54 / 87
  • 56. Optimal Binary Classification Bayes optimal classifier has highest possible ROC curve. Since the classification quality depends only on order, gives optimal classification quality too! How can we estimate terms from this expression? p(y = 1 | x) = × p(y = 1 | x) p(y = 0 | x) p(y = 1) p(y = 0) p(x | y = 1) p(x | y = 0) 55 / 87
  • 57. Histograms density estimation Counting number of samples in each bin and normalizing. fast choice of binning is crucial number of bins grows exponentially curse of dimensionality→ 56 / 87
  • 58. Kernel density estimation is kernel, is bandwidth Typically, gaussian kernel is used, but there are many others. Approach is very close to weighted NN. f (x) = K ( ) 1 nh ∑ i x − xi h K(x) h k 57 / 87
  • 59. Kernel density estimation bandwidth selection Silverman's rule of thumb: h = σ̂  ( ) 4 3n 1 5 58 / 87
  • 60. Kernel density estimation bandwidth selection Silverman's rule of thumb: may be irrelevant if the data is far from being gaussian h = σ̂  ( ) 4 3n 1 5 59 / 87
  • 61. Parametric density estimation Family of density functions: . Problem: estimate parameters of a Gaussian distribution. f (x; θ) f (x; μ, Σ) = exp ( − (x − μ (x − μ) ) 1 (2π |Σ ) d/2 | 1/2 1 2 ) T Σ −1 60 / 87
  • 62. QDA (Quadratic discriminant analysis) Reconstructing probabilities from data, assuming those are multidimensional normal distributions: p(x | y = 1), p(x | y = 0) p(x | y = 0) ∼  ( , )μ0 Σ 0 p(x | y = 1) ∼  ( , )μ1 Σ 1 = = const = p(y = 1 | x) p(y = 0 | x) p(y = 1) p(y = 0) p(x | y = 1) p(x | y = 0) n1 n0 exp(− (x − (x − )) 1 2 μ1 ) T Σ −1 1 μ1 exp(− (x − (x − )) 1 2 μ0 ) T Σ −1 0 μ0 = exp ( − (x − (x − ) + (x − (x − ) + const ) 1 2 μ1 ) T Σ −1 1 μ1 1 2 μ0 ) T Σ −1 0 μ0 61 / 87
  • 64. QDA complexity samples, dimensions training consists of fitting and takes computing covariance matrix inverting covariance matrix prediction takes for each sample spent on computing dot product n d p(x | y = 0) p(x | y = 1) O(n + )d 2 d 3 O(n )d 2 O( )d 3 O( )d 2 63 / 87
  • 65. QDA overview simple decision rule fast prediction many parameters to reconstruct in high dimensions data almost never has gaussian distribution 64 / 87
  • 66. Gaussian mixtures for density estimation Mixture of distributions: Mixture of Gaussian distributions: Parameters to be found: , , f (x) = (x, ) = 1 ∑ c−components πc fc θc ∑ c−components πc f (x) = f (x; , ) ∑ c−components πc μc Σ c , …,π1 πC , …,μ1 μC , …,Σ 1 Σ C 65 / 87
  • 68. Gaussian mixtures: finding parameters Criterion is maximizing likelihood (using MLE to find optimal parameters) no analytic solution we can use general-purpose optimization methods log f ( ; θ) →∑ i xi max θ 67 / 87
  • 69. Gaussian mixtures: finding parameters Criterion is maximizing likelihood (using MLE to find optimal parameters) no analytic solution we can use general-purpose optimization methods In mixtures parameters are split in two groups: — parameters of components — contributions of components log f ( ; θ) →∑ i xi max θ , …,θ1 θC , …,π1 πC 68 / 87
  • 70. Expectation-Maximization algorithm [Dempster et al., 1977] Idea: introduce set of hidden variables Expectation: Maximization: Maximization step is trivial for Gaussian distributions. EM-algorithm is more stable and has good convergence properties. (x)πc (x) ← p(x ∈ c) =πc (x; )πc fc θc (x; )∑c̃  πc̃ fc̃  θc̃  πc θc ← ← ( ) ∑ i πc xi arg (x) log (x, )max θ ∑ i πc fc θc 69 / 87
  • 73. Classification model based on mixtures density estimation is called MDA (mixture discriminant analysis) Generative approach Generative approach: trying to reconstruct , then use Bayes classification formula to predict. QDA, MDA are generative classifiers. p(x, y) 72 / 87
  • 74. Classification model based on mixtures density estimation is called MDA (mixture discriminant analysis) Generative approach Generative approach: trying to reconstruct , then use Bayes classification formula to predict. QDA, MDA are generative classifiers. Problems of generative approach Real life distributions hardly can be reconstructed Especially in high-dimensional spaces So, we switch to discriminative approach: guessing directly p(x, y) p(y|x) 73 / 87
  • 76. If we can avoid density estimation, we'd better do it. 75 / 87
  • 77. Linear decision rule Decision function is linear: This is a parametric model (finding parameters ). QDA & MDA are parametric as well. d(x) =< w, x > +w0 { d(x) > 0 → d(x) < 0 → = +1ŷ  = −1ŷ  w, w0 76 / 87
  • 78. Finding Optimal Parameters A good initial guess: get such , that error of classification is minimal: Notion: . Discontinuous optimization (arrrrgh!) w, w0  = = sgn(d( )) ∑ i∈events 1 ≠y i ŷ  i ŷ  i xi = 1, = 01true 1f alse 77 / 87
  • 79. Finding Optimal Parameters - 2 Discontinuous optimization solution: let's make decision rule smooth (x)p+1 (x)p−1 = f (d(x)) = 1 − (x)p+1 ⎧ ⎩ ⎨ ⎪ ⎪ f (0) = 0.5 f (x) > 0.5 f (x) < 0.5 if x > 0 if x < 0 78 / 87
  • 80. Logistic function Properties 1. monotonic, 2. 3. 4. σ(x) = = e x 1 + e x 1 1 + e −x σ(x) ∈ (0, 1) σ(x) + σ(−x) = 1 (x) = σ(x)(1 − σ(x))σ′ 2 σ(x) = 1 + tanh(x/2) 79 / 87
  • 81. Logistic regression Define probabilities obtained with logistic function and optimize log-likelihood: Important exercise: find an expression and build a plot for d(x) (x)p+1 (x)p−1 = = = < w, x > +w0 σ(d(x)) σ(−d(x))  = − ln( ( )) = L( , ) → min ∑ i∈events py i xi ∑ i xi yi L( , ) = − ln( ( ))xi yi py i xi 80 / 87
  • 82. Linear model for regression How to use linear function for regression? Simplification of notion: . d(x) =< w, x > +w0 = 1, x = (1, , …, )x0 x1 xd d(x) =< w, x > 81 / 87
  • 83. Linear regression (ordinary least squares) We can use linear function for regression: This is a linear system with variables and equations. Minimize OLS aka MSE (mean squared error): Explicit solution: d( ) = d(x) =< w, x >xi yi d + 1 n  = (d( ) − → min ∑ i xi yi ) 2 ( ) w =∑i xi x T i ∑i yi xi 82 / 87
  • 84. Linear regression can use some other error but no explicit solution in other cases demonstrates properties of linear models reliable estimates when able to completely fit to the data if undefined when n >> d n = d d < n 83 / 87
  • 85. Data Scientist Pipeline Experiments in appropriate high-level language or environment After experiments are over — implement final algorithm in low-level language (C++, CUDA, FPGA) Second point is not always needed 84 / 87
  • 86. Scientific Python NumPy vectorized computations in python Matplotlib for drawing Pandas for data manipulation and analysis (based on NumPy) 85 / 87
  • 87. Scientific Python Scikit-learn most popular library for machine learning Scipy libraries for science and engineering Root_numpy convenient way to work with ROOT files 86 / 87