SlideShare ist ein Scribd-Unternehmen logo
1 von 67
Downloaden Sie, um offline zu lesen
MACHINE LEARNING IN HIGH
ENERGY PHYSICS
LECTURE #2
Alex Rogozhnikov, 2015
RECAPITULATION
classification, regression
kNN classifier and regressor
ROC curve, ROC AUC
Given knowledge about distributions, we can build optimal
classifier
OPTIMAL BAYESIAN CLASSIFIER
=
p(y = 1 | x)
p(y = 0 | x)
p(y = 1) p(x | y = 1)
p(y = 0) p(x | y = 0)
But distributions are complex, contain many parameters.
QDA
QDA follows generative approach.
LOGISTIC REGRESSION
Decision function d(x) =< w, x > +w0
Sharp rule: = sgn d(x)ŷ
Optimizing weights to maximize log-likelihood
LOGISTIC REGRESSION
Smooth rule:
d(x) =< w, x > +w0
(x)p+1
(x)p−1
=
=
σ(d(x))
σ(−d(x))
w, w0
 = − ln( ( )) = L( , ) → min
1
N ∑
i∈events
pyi
xi
1
N ∑
i
xi yi
LOGISTIC LOSS
Loss penalty for single observation
L( , ) = − ln( ( )) =
{
xi yi pyi
xi
ln(1 + ),e
−d( )xi
ln(1 + ),e
d( )xi
= +1yi
= −1yi
GRADIENT DESCENT & STOCHASTIC
OPTIMIZATION
Problem:
finding to minimize
is step size
(also `shrinkage`, `learning rate`)
w 
w ← w − η
∂
∂w
η
STOCHASTIC GRADIENT DESCENT
 = L( , ) → min
1
N ∑
i
xi yi
On each iteration make a step with respect to only one
event:
1. take — random event from training data
2.
i
w ← w − η
∂( , )xi yi
∂w
Each iteration is done much faster, but training process is
less stable.
POLYNOMIAL DECISION RULE
d(x) = + +w0
∑
i
wi xi
∑
ij
wij xi xj
is again linear model, introduce new features:
and reusing logistic regression.
z = {1} ∪ { ∪ {xi }i xi xj }ij
d(x) =
∑
i
wi zi
We can add as one more variable to dataset and
forget about intercept
= 1x0
d(x) = + =w0 ∑N
i=1
wi xi ∑N
i=0
wi xi
PROJECTING IN HIGHER DIMENSION SPACE
SVM with polynomial kernel visualization
After adding new features, classes may become separable.
is projection operator (which adds new features).
Assume
and look for optimal
We need only kernel:
KERNEL TRICK
P
d(x) = < w, P(x) >
w = P( )
∑
i
αi xi
αi
d(x) = < P( ), P(x) >= K( , x)
∑
i
αi xi
∑
i
αi xi
K(x, y) =< P(x), P(y) >
Popular kernel is gaussian Radial Basis Function:
Corresponds to projection to Hilbert space.
KERNEL TRICK
K(x, y) = e
−c||x−y||
2
Exercise: find a correspong projection.
SVM selects decision rule with maximal possible margin.
SUPPORT VECTOR MACHINE
SVM uses different loss function (only signal losses
compared):
HINGE LOSS FUNCTION
SVM + RBF KERNEL
SVM + RBF KERNEL
OVERFITTING
Knn with k=1 gives ideal classification of training data.
OVERFITTING
There are two definitions of overfitting, which often
coincide.
DIFFERENCE-OVERFITTING
There is significant difference in quality of predictions
between train and test.
COMPLEXITY-OVERFITTING
Formula has too high complexity (e.g. too many
parameters), increasing the number of parameters drives to
lower quality.
MEASURING QUALITY
To get unbiased estimate, one should test formula on
independent samples (and be sure that no train information
was given to algorithm during training)
In most cases, simply splitting data into train and holdout is
enough.
More approaches in seminar.
Difference-overfitting is inessential, provided that we
measure quality on holdout (though easy to check).
Complexity-overfitting is problem — we need to test
different parameters for optimality (more examples through
the course).
Don't use distribution comparison to detect overfitting
REGULARIZATION
When number of weights is high, overfitting is very probable
Adding regularization term to loss function:
 = L( , ) + → min
1
N ∑
i
xi yi reg
regularization :
regularization:
regularization:
L2 = α |reg ∑j
wj |
2
L1 = β | |reg ∑j
wj
+L1 L2 = α | + β | |reg ∑j
wj |
2
∑j
wj
, — REGULARIZATIONSL2 L1
L2 regularization L1 (solid), L1 + L2 (dashed)
REGULARIZATIONS
regularization encourages sparsityL1
REGULARIZATIONSLp
=Lp ∑i
w
p
i
What is the expression for ?
But nobody uses it, even . Why?
Because it is not convex
L0
= [ ≠ 0]L0 ∑i
wi
, 0 < p < 1Lp
LOGISTIC REGRESSION
classifier based on linear decision rule
training is reduced to convex optimization
other decision rules are achieved by adding new features
stochastic optimization is used
can handle > 1000 features, requires regularization
no iteraction between features
[ARTIFICIAL] NEURAL NETWORKS
Based on our understanding of natural neural networks
neurons are organized in networks
receptors activate some neurons, neurons are activating
other neurons, etc.
connection is via synapses
STRUCTURE OF ARTIFICIAL FEED-
FORWARD NETWORK
ACTIVATION OF NEURON
Neuron states: n =
{
1,
0,
activated
not activated
Let to be state of to be weight of connection between
-th neuron and output neuron:
ni wi
i
n =
{
1,
0,
> 0∑i
wi ni
otherwise∑i
Problem: find set of weights, that minimizes error on train
dataset. (discrete optimization)
SMOOTH ACTIVATIONS:
ONE HIDDEN
LAYER
= σ( )hi
∑
j
wij xj
= σ( )yi
∑
i
vij hj
VISUALIZATION OF NN
NEURAL NETWORKS
Powerful general purpose algorithm for classification and
regression
Non-interpretable formula
Optimization problem is non-convex with local optimums
and has many parameters
Stochastic optimization speeds up process and helps not to
be caught in local minimum.
Overfitting due to large amount of parameters
— regularizations (and other tricks),L1 L2
MINUTES BREAKx
DEEP LEARNING
Gradient diminishes as number of hidden layers grows.
Usually 1-2 hidden layers are used.
But modern ANN for image recognition have 7-15 layers.
CONVOLUTIONAL NEURAL NETWORK
DECISION TREES
Example: predict outside play based on weather conditions.
DECISION TREES: IDEA
DECISION TREES
DECISION TREES
DECISION TREE
fast & intuitive prediction
building optimal decision tree is
building tree from root using greedy optimization
each time we split one leaf, finding optimal feature and
threshold
need criterion to select best splitting (feature, threshold)
NP complete
SPLITTING CRITERIONS
TotalImpurity = impurity(leaf ) × size(leaf )∑leaf
Misclass.
Gini
Entropy
=
=
=
min(p, 1 − p)
p(1 − p)
− p log p − (1 − p) log(1 − p)
SPLITTING CRITERIONS
Why using Gini or Entropy not misclassification?
REGRESSION TREE
Greedy optimization (minimizing MSE):
GlobalMSE ∼ ( −∑i
yi ŷ
i
)
2
Can be rewritten as:
GlobalMSE ∼ MSE(leaf) × size(leaf)∑leaf
MSE(leaf) is like 'impurity' of leaf
MSE(leaf) = ( −1
size(leaf)
∑i∈leaf
yi ŷ
i
)
2
In most cases, regression trees are optimizing MSE:
GlobalMSE ∼ ( −
∑
i
yi ŷ
i
)
2
But other options also exist, i.e. MAE:
GlobalMAE ∼ | − |
∑
i
yi ŷ
i
For MAE optimal value of leaf is median, not mean.
DECISION TREES INSTABILITY
Little variation in training dataset produce different
classification rule.
PRE-STOPPING OF DECISION TREE
Tree keeps splitting until each event is correctly classified.
PRE-STOPPING
We can stop the process of splitting by imposing different
restrictions.
limit the depth of tree
set minimal number of samples needed to split the leaf
limit the minimal number of samples in leaf
more advanced: maximal number of leaves in tree
Any combinations of rules above is possible.
no prepruning max_depth
min # of samples in leaf maximal number of leaves
POST-PRUNING
When tree tree is already built we can try optimize it to
simplify formula.
Generally, much slower than pre-stopping.
SUMMARY OF DECISION TREE
1. Very intuitive algorithm for regression and classification
2. Fast prediction
3. Scale-independent
4. Supports multiclassification
But
1. Training optimal tree is NP-complex
2. Trained greedily by optimizing Gini index or entropy (fast!)
3. Non-stable
4. Uses only trivial conditions
MISSING VALUES IN DECISION TREES
If event being predicted lacks , we use prior probabilities.x1
FEATURE IMPORTANCES
Different approaches exist to measure importance of feature
in final model
Importance of feature quality provided by one feature≠
FEATURE IMPORTANCES
tree: counting number of splits made over this feature
tree: counting gain in purity (e.g. Gini)
fast and adequate
common recipe: train without one feature,
compare quality on test with/without one feature
requires many evaluations
common recipe: feature shuffling
take one column in test dataset and shuffle them. Compare
quality with/without shuffling.
THE END
Tomorrow: ensembles and boosting

Weitere ähnliche Inhalte

Was ist angesagt?

Machine learning in science and industry — day 2
Machine learning in science and industry — day 2Machine learning in science and industry — day 2
Machine learning in science and industry — day 2arogozhnikov
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modeljins0618
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variationsAndres Mendez-Vazquez
 
Ridge regression, lasso and elastic net
Ridge regression, lasso and elastic netRidge regression, lasso and elastic net
Ridge regression, lasso and elastic netVivian S. Zhang
 
K-means, EM and Mixture models
K-means, EM and Mixture modelsK-means, EM and Mixture models
K-means, EM and Mixture modelsVu Pham
 
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMulticlass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMarjan Sterjev
 
Using Principal Component Analysis to Remove Correlated Signal from Astronomi...
Using Principal Component Analysis to Remove Correlated Signal from Astronomi...Using Principal Component Analysis to Remove Correlated Signal from Astronomi...
Using Principal Component Analysis to Remove Correlated Signal from Astronomi...CvilleDataScience
 
Gaussian processing
Gaussian processingGaussian processing
Gaussian processing홍배 김
 
Variational Autoencoder Tutorial
Variational Autoencoder Tutorial Variational Autoencoder Tutorial
Variational Autoencoder Tutorial Hojin Yang
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Universitat Politècnica de Catalunya
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applicationsFrank Nielsen
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learningYogendra Singh
 
2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machine2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machinenozomuhamada
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 

Was ist angesagt? (18)

Machine learning in science and industry — day 2
Machine learning in science and industry — day 2Machine learning in science and industry — day 2
Machine learning in science and industry — day 2
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture model
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
 
Ridge regression, lasso and elastic net
Ridge regression, lasso and elastic netRidge regression, lasso and elastic net
Ridge regression, lasso and elastic net
 
K-means, EM and Mixture models
K-means, EM and Mixture modelsK-means, EM and Mixture models
K-means, EM and Mixture models
 
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMulticlass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark Examples
 
Using Principal Component Analysis to Remove Correlated Signal from Astronomi...
Using Principal Component Analysis to Remove Correlated Signal from Astronomi...Using Principal Component Analysis to Remove Correlated Signal from Astronomi...
Using Principal Component Analysis to Remove Correlated Signal from Astronomi...
 
Gaussian processing
Gaussian processingGaussian processing
Gaussian processing
 
Variational Autoencoder Tutorial
Variational Autoencoder Tutorial Variational Autoencoder Tutorial
Variational Autoencoder Tutorial
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 
K-means and GMM
K-means and GMMK-means and GMM
K-means and GMM
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
 
2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machine2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machine
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 

Andere mochten auch

Comparison of Machine Learning Algorithms
Comparison of Machine Learning Algorithms Comparison of Machine Learning Algorithms
Comparison of Machine Learning Algorithms butest
 
Iaetsd early detection of breast cancer
Iaetsd early detection of breast cancerIaetsd early detection of breast cancer
Iaetsd early detection of breast cancerIaetsd Iaetsd
 
Wavelets AND counterlets
Wavelets  AND  counterletsWavelets  AND  counterlets
Wavelets AND counterletsAvichal Sharma
 
Comparative Literature Studies
Comparative Literature StudiesComparative Literature Studies
Comparative Literature StudiesDilip Barad
 

Andere mochten auch (6)

Comparison of Machine Learning Algorithms
Comparison of Machine Learning Algorithms Comparison of Machine Learning Algorithms
Comparison of Machine Learning Algorithms
 
Iaetsd early detection of breast cancer
Iaetsd early detection of breast cancerIaetsd early detection of breast cancer
Iaetsd early detection of breast cancer
 
10.1.1.151.4974
10.1.1.151.497410.1.1.151.4974
10.1.1.151.4974
 
Wavelets AND counterlets
Wavelets  AND  counterletsWavelets  AND  counterlets
Wavelets AND counterlets
 
Comparative Literature Studies
Comparative Literature StudiesComparative Literature Studies
Comparative Literature Studies
 
My
MyMy
My
 

Ähnlich wie MLHEP 2015: Introductory Lecture #2

The world of loss function
The world of loss functionThe world of loss function
The world of loss function홍배 김
 
Lecture 5: Neural Networks II
Lecture 5: Neural Networks IILecture 5: Neural Networks II
Lecture 5: Neural Networks IISang Jun Lee
 
Two algorithms to accelerate training of back-propagation neural networks
Two algorithms to accelerate training of back-propagation neural networksTwo algorithms to accelerate training of back-propagation neural networks
Two algorithms to accelerate training of back-propagation neural networksESCOM
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsElvis DOHMATOB
 
Dominance-Based Pareto-Surrogate for Multi-Objective Optimization
Dominance-Based Pareto-Surrogate for Multi-Objective OptimizationDominance-Based Pareto-Surrogate for Multi-Objective Optimization
Dominance-Based Pareto-Surrogate for Multi-Objective OptimizationIlya Loshchilov
 
Regularization and variable selection via elastic net
Regularization and variable selection via elastic netRegularization and variable selection via elastic net
Regularization and variable selection via elastic netKyusonLim
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier홍배 김
 
Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep LearningSourya Dey
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural NetworksMasahiro Suzuki
 
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjdArjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd12345arjitcs
 
Boundness of a neural network weights using the notion of a limit of a sequence
Boundness of a neural network weights using the notion of a limit of a sequenceBoundness of a neural network weights using the notion of a limit of a sequence
Boundness of a neural network weights using the notion of a limit of a sequenceIJDKP
 
Machine Learning - Regression model
Machine Learning - Regression modelMachine Learning - Regression model
Machine Learning - Regression modelRADO7900
 
Julia Kreutzer - 2017 - Bandit Structured Prediction for Neural Seq2Seq Learning
Julia Kreutzer - 2017 - Bandit Structured Prediction for Neural Seq2Seq LearningJulia Kreutzer - 2017 - Bandit Structured Prediction for Neural Seq2Seq Learning
Julia Kreutzer - 2017 - Bandit Structured Prediction for Neural Seq2Seq LearningAssociation for Computational Linguistics
 
Artificial Neural Networks
Artificial Neural NetworksArtificial Neural Networks
Artificial Neural NetworksArslan Zulfiqar
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习AdaboostShocky1
 
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...Atsushi Nitanda
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural NetworkDessy Amirudin
 

Ähnlich wie MLHEP 2015: Introductory Lecture #2 (20)

The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
Lecture 5: Neural Networks II
Lecture 5: Neural Networks IILecture 5: Neural Networks II
Lecture 5: Neural Networks II
 
Two algorithms to accelerate training of back-propagation neural networks
Two algorithms to accelerate training of back-propagation neural networksTwo algorithms to accelerate training of back-propagation neural networks
Two algorithms to accelerate training of back-propagation neural networks
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
 
Dominance-Based Pareto-Surrogate for Multi-Objective Optimization
Dominance-Based Pareto-Surrogate for Multi-Objective OptimizationDominance-Based Pareto-Surrogate for Multi-Objective Optimization
Dominance-Based Pareto-Surrogate for Multi-Objective Optimization
 
Regularization and variable selection via elastic net
Regularization and variable selection via elastic netRegularization and variable selection via elastic net
Regularization and variable selection via elastic net
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 
Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep Learning
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks
 
Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
 
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjdArjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
 
Boundness of a neural network weights using the notion of a limit of a sequence
Boundness of a neural network weights using the notion of a limit of a sequenceBoundness of a neural network weights using the notion of a limit of a sequence
Boundness of a neural network weights using the notion of a limit of a sequence
 
Machine Learning - Regression model
Machine Learning - Regression modelMachine Learning - Regression model
Machine Learning - Regression model
 
Julia Kreutzer - 2017 - Bandit Structured Prediction for Neural Seq2Seq Learning
Julia Kreutzer - 2017 - Bandit Structured Prediction for Neural Seq2Seq LearningJulia Kreutzer - 2017 - Bandit Structured Prediction for Neural Seq2Seq Learning
Julia Kreutzer - 2017 - Bandit Structured Prediction for Neural Seq2Seq Learning
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
tutorial.ppt
tutorial.ppttutorial.ppt
tutorial.ppt
 
Artificial Neural Networks
Artificial Neural NetworksArtificial Neural Networks
Artificial Neural Networks
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习Adaboost
 
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Cl...
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
 

Kürzlich hochgeladen

SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Creating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsCreating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsNurulAfiqah307317
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 

Kürzlich hochgeladen (20)

SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Creating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsCreating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening Designs
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 

MLHEP 2015: Introductory Lecture #2

  • 1. MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #2 Alex Rogozhnikov, 2015
  • 3. Given knowledge about distributions, we can build optimal classifier OPTIMAL BAYESIAN CLASSIFIER = p(y = 1 | x) p(y = 0 | x) p(y = 1) p(x | y = 1) p(y = 0) p(x | y = 0) But distributions are complex, contain many parameters.
  • 5. LOGISTIC REGRESSION Decision function d(x) =< w, x > +w0 Sharp rule: = sgn d(x)ŷ
  • 6. Optimizing weights to maximize log-likelihood LOGISTIC REGRESSION Smooth rule: d(x) =< w, x > +w0 (x)p+1 (x)p−1 = = σ(d(x)) σ(−d(x)) w, w0  = − ln( ( )) = L( , ) → min 1 N ∑ i∈events pyi xi 1 N ∑ i xi yi
  • 7. LOGISTIC LOSS Loss penalty for single observation L( , ) = − ln( ( )) = { xi yi pyi xi ln(1 + ),e −d( )xi ln(1 + ),e d( )xi = +1yi = −1yi
  • 8. GRADIENT DESCENT & STOCHASTIC OPTIMIZATION Problem: finding to minimize is step size (also `shrinkage`, `learning rate`) w  w ← w − η ∂ ∂w η
  • 9. STOCHASTIC GRADIENT DESCENT  = L( , ) → min 1 N ∑ i xi yi On each iteration make a step with respect to only one event: 1. take — random event from training data 2. i w ← w − η ∂( , )xi yi ∂w Each iteration is done much faster, but training process is less stable.
  • 10. POLYNOMIAL DECISION RULE d(x) = + +w0 ∑ i wi xi ∑ ij wij xi xj is again linear model, introduce new features: and reusing logistic regression. z = {1} ∪ { ∪ {xi }i xi xj }ij d(x) = ∑ i wi zi We can add as one more variable to dataset and forget about intercept = 1x0 d(x) = + =w0 ∑N i=1 wi xi ∑N i=0 wi xi
  • 11. PROJECTING IN HIGHER DIMENSION SPACE SVM with polynomial kernel visualization After adding new features, classes may become separable.
  • 12. is projection operator (which adds new features). Assume and look for optimal We need only kernel: KERNEL TRICK P d(x) = < w, P(x) > w = P( ) ∑ i αi xi αi d(x) = < P( ), P(x) >= K( , x) ∑ i αi xi ∑ i αi xi K(x, y) =< P(x), P(y) >
  • 13. Popular kernel is gaussian Radial Basis Function: Corresponds to projection to Hilbert space. KERNEL TRICK K(x, y) = e −c||x−y|| 2 Exercise: find a correspong projection.
  • 14. SVM selects decision rule with maximal possible margin. SUPPORT VECTOR MACHINE
  • 15. SVM uses different loss function (only signal losses compared): HINGE LOSS FUNCTION
  • 16. SVM + RBF KERNEL
  • 17. SVM + RBF KERNEL
  • 18. OVERFITTING Knn with k=1 gives ideal classification of training data.
  • 20. There are two definitions of overfitting, which often coincide. DIFFERENCE-OVERFITTING There is significant difference in quality of predictions between train and test. COMPLEXITY-OVERFITTING Formula has too high complexity (e.g. too many parameters), increasing the number of parameters drives to lower quality.
  • 21. MEASURING QUALITY To get unbiased estimate, one should test formula on independent samples (and be sure that no train information was given to algorithm during training) In most cases, simply splitting data into train and holdout is enough. More approaches in seminar.
  • 22. Difference-overfitting is inessential, provided that we measure quality on holdout (though easy to check). Complexity-overfitting is problem — we need to test different parameters for optimality (more examples through the course). Don't use distribution comparison to detect overfitting
  • 23.
  • 24. REGULARIZATION When number of weights is high, overfitting is very probable Adding regularization term to loss function:  = L( , ) + → min 1 N ∑ i xi yi reg regularization : regularization: regularization: L2 = α |reg ∑j wj | 2 L1 = β | |reg ∑j wj +L1 L2 = α | + β | |reg ∑j wj | 2 ∑j wj
  • 25. , — REGULARIZATIONSL2 L1 L2 regularization L1 (solid), L1 + L2 (dashed)
  • 27.
  • 28. REGULARIZATIONSLp =Lp ∑i w p i What is the expression for ? But nobody uses it, even . Why? Because it is not convex L0 = [ ≠ 0]L0 ∑i wi , 0 < p < 1Lp
  • 29. LOGISTIC REGRESSION classifier based on linear decision rule training is reduced to convex optimization other decision rules are achieved by adding new features stochastic optimization is used can handle > 1000 features, requires regularization no iteraction between features
  • 30. [ARTIFICIAL] NEURAL NETWORKS Based on our understanding of natural neural networks neurons are organized in networks receptors activate some neurons, neurons are activating other neurons, etc. connection is via synapses
  • 31. STRUCTURE OF ARTIFICIAL FEED- FORWARD NETWORK
  • 32. ACTIVATION OF NEURON Neuron states: n = { 1, 0, activated not activated Let to be state of to be weight of connection between -th neuron and output neuron: ni wi i n = { 1, 0, > 0∑i wi ni otherwise∑i Problem: find set of weights, that minimizes error on train dataset. (discrete optimization)
  • 33. SMOOTH ACTIVATIONS: ONE HIDDEN LAYER = σ( )hi ∑ j wij xj = σ( )yi ∑ i vij hj
  • 35. NEURAL NETWORKS Powerful general purpose algorithm for classification and regression Non-interpretable formula Optimization problem is non-convex with local optimums and has many parameters Stochastic optimization speeds up process and helps not to be caught in local minimum. Overfitting due to large amount of parameters — regularizations (and other tricks),L1 L2
  • 37. DEEP LEARNING Gradient diminishes as number of hidden layers grows. Usually 1-2 hidden layers are used. But modern ANN for image recognition have 7-15 layers.
  • 38.
  • 40.
  • 41. DECISION TREES Example: predict outside play based on weather conditions.
  • 42.
  • 46. DECISION TREE fast & intuitive prediction building optimal decision tree is building tree from root using greedy optimization each time we split one leaf, finding optimal feature and threshold need criterion to select best splitting (feature, threshold) NP complete
  • 47. SPLITTING CRITERIONS TotalImpurity = impurity(leaf ) × size(leaf )∑leaf Misclass. Gini Entropy = = = min(p, 1 − p) p(1 − p) − p log p − (1 − p) log(1 − p)
  • 48. SPLITTING CRITERIONS Why using Gini or Entropy not misclassification?
  • 49. REGRESSION TREE Greedy optimization (minimizing MSE): GlobalMSE ∼ ( −∑i yi ŷ i ) 2 Can be rewritten as: GlobalMSE ∼ MSE(leaf) × size(leaf)∑leaf MSE(leaf) is like 'impurity' of leaf MSE(leaf) = ( −1 size(leaf) ∑i∈leaf yi ŷ i ) 2
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55. In most cases, regression trees are optimizing MSE: GlobalMSE ∼ ( − ∑ i yi ŷ i ) 2 But other options also exist, i.e. MAE: GlobalMAE ∼ | − | ∑ i yi ŷ i For MAE optimal value of leaf is median, not mean.
  • 56. DECISION TREES INSTABILITY Little variation in training dataset produce different classification rule.
  • 57. PRE-STOPPING OF DECISION TREE Tree keeps splitting until each event is correctly classified.
  • 58. PRE-STOPPING We can stop the process of splitting by imposing different restrictions. limit the depth of tree set minimal number of samples needed to split the leaf limit the minimal number of samples in leaf more advanced: maximal number of leaves in tree Any combinations of rules above is possible.
  • 59. no prepruning max_depth min # of samples in leaf maximal number of leaves
  • 60. POST-PRUNING When tree tree is already built we can try optimize it to simplify formula. Generally, much slower than pre-stopping.
  • 61.
  • 62.
  • 63. SUMMARY OF DECISION TREE 1. Very intuitive algorithm for regression and classification 2. Fast prediction 3. Scale-independent 4. Supports multiclassification But 1. Training optimal tree is NP-complex 2. Trained greedily by optimizing Gini index or entropy (fast!) 3. Non-stable 4. Uses only trivial conditions
  • 64. MISSING VALUES IN DECISION TREES If event being predicted lacks , we use prior probabilities.x1
  • 65. FEATURE IMPORTANCES Different approaches exist to measure importance of feature in final model Importance of feature quality provided by one feature≠
  • 66. FEATURE IMPORTANCES tree: counting number of splits made over this feature tree: counting gain in purity (e.g. Gini) fast and adequate common recipe: train without one feature, compare quality on test with/without one feature requires many evaluations common recipe: feature shuffling take one column in test dataset and shuffle them. Compare quality with/without shuffling.