SlideShare ist ein Scribd-Unternehmen logo
1 von 89
ML Algorithms
Part 2 Draft
Do you really understand more...
Outline
Learning Theory: Model, Evaluation Metrics
Algorithems: NaĂŻve Bayesian, Clustering Methods (K-means)
Ensemble Learning, EM Algorithm
Restricted Boltzman Machines
Neural Networks: BP, Word2Vec, GloVe, CNN, RNN
Other: Singular Vector Decompostion, Matrix Factorization
Discriminative & Generative Model
Decision Function Y=f(X) or conditional probability P(Y|X)
Generative Model: learn P(X,Y), calculate posterior prob dist P(Y|X):
HMM, Bayesian Classifier
Discriminative Model: learn P(Y|X) directly
DT, NN,SVM,Boosting,CRF
Metrics - Classification
Accuracy
Precision & Recall (generally binary classification)
TP(Pos->Pos) FP(Neg->Pos)
FN(Pos->Neg) TN(Neg->Neg)
F1 score
P & R are high, F1 is high.
Metrics - Regression
Mean Absolute Errors & M. Square E.
R-squared error (coefficient of determination):
A higher value means a good model.
Dataset ML Strategy
From Andrew Ng, NIPS, 2016
Bias-variance Trade-off =
Avoidable Bias + Variance
More Data; Regularization; New Model
Naive Bayesian Classifier
Two Assumptions
Bayes’ Theorem
Features are conditionally independent
How to do?
Learn Joint probability distribution p(x,y)
Learn p(x)
Calculate p(y)
NaĂŻve Bayesian: Notations
Dataset:
Priori probability: K classes
Conditional probability:
Then learn Joint: P(X,Y) = P(Y) P(X|Y) = P(X)P(Y|X)
Calculate Posterior: P(Y|X)
NaĂŻve Bayesian: Attribute Conditional independent assumption
Each input x is a vector with n elements.
Given the class y, features are conditionally
independent.
Y
x1 x2
NaĂŻve Bayesian: Calculation
https://en.wikipedia.org/wiki/Bayes'_theorem
NaĂŻve Bayesian: Calculation
So we can rewrite this part:
NaĂŻve Bayesian: Definition
Want the class that has the max of the probabilities:
For everyone, the denominators are the same, so we can simplify it:
NaĂŻve Bayesian: Parameter Estimation
Maximum Likelihood Estimation
Goal: estimate priori P(Y=c) for each class
estimate P(X=x | Y=c) for each class
Count(x,y)
--------------- = P(x|y)
Count(y)
NaĂŻve Bayesian: Cost
Choose 0-1 cost function:
Right answer = 0
Wrong answer = 1
-- L gives “how big” it is making the mistake...we want to minimize it!
Add up those we go wrong
1- Right cases
Minimize the errors ⇔ Maximize the right
NaĂŻve Bayesian: Calculation
Calculate a prob with an input x:
Decide the class of the input x:
A demo from the book.
NaĂŻve Bayesian: Bayesian Estimation
When do MLE, could be 0 counts. Then you times it...all become 0.
Add one more thing, lambda >= 0. If lambda = 0, it is MLE; If lambda = 1, it is
Laplace smoothing.
Laplace smoothing, it is a probalistic distribution.
Priori Prob’s Bayesian Estimation:
Bayesian: More Models
NaĂŻve Bayesian: attribute Conditional independent assumption
Semi-Naive Bayesain: One-Dependent Estimator
Y
x1 x2 xn
Naive Bayesian
Y
x1 x2 xn
Super-Parent One-Dependent
Estimator(SPODE)
x3
Takeaway (1): Probabilistic Graphical Models
Y
x1 x2 xn
Naive Bayesian
P(Y)
P(xn|Y)
P(y)P(x1|y)P(x2|y)..P(xn|y)
Joint Prob! What we want!
P(x1|Y)
Find out More: https://www.coursera.org/learn/probabilistic-graphical-models Prof.Daphne Koller
Takeaway (2): Bayesian Network
A
C D E
P(A)P(B)P(C|A)P(D|A,B)P(E|B)
Joint Prob! What we want!
Named also Belief Network
DAG: Directed Acyclic Graph
CPT: Conditional Probability Table
B
P(A) P(B)
P(C|A)
P(E|B)P(D|A,B)
Given P(A),
C and D,
independent:
C⊄D|A
Clustering: Unsupervised Learning
Similarty Methods: Eculidean Distance, etc.
In-group: high similarity
Out-group: high distance
Methods
Prototype-based Clustering: K-means, LVQ, MoG
Density-based Clustering: DBSCAN
Hierarchical Clustering: AGNES,DIANA
Prototype-based Clustering: k-means
Dataset D, Clusters C
Error(each one in each cluster)
K-means: Algorithm
(Input: whole dataset points x1,...xm)
Initialization: Randomly place centroids c1..ck
Repeat until convergence (stop when no points changing):
- for each data point xi:
find nearest centroid, set the point to the centroid cluster
- for each cluster:
calculate new centroid (means of all new points)
https://www.youtube.com/watch?v=_aWzGGNrcic
argmin D (xi,cj) for all cj
O(#iter * #clusters * #instances *
#dimensions)
K-means: A Quick Demo
Two clusters, squares are the centroids.
Calculate new centroids,
finish one iteration.
K-means
k-modes:
-cat,frequency
k-prototype:
- num+cat
From Uiversity College Dublin, Prof. Tahar.
Clustering: Other Methods(1)
Prototype-based Clustering
LVQ: Find out a group of vectors to describe the clusters, with labels
MoG: describe the model by a probabilistic model.
Density-based Clustering
DBSCAN
Clustering: Other Methods(2)
Hierarchical Clustering : AGNES,DIANA
Ensemble Learning
Strong learning method: learns model with good performance;
Week learning method: learns model slightly better than random.
Ensemble learning methods:
Iterative method: Boosting
Parallel method: Bagging
Boosting: Algorithm
Boosting: gather/ensemble week learners to become a strong one.
Through majority voting on classification; average on regression.
Algo:
1) Feed N data to train a week learner/ model: h1.
2) Feed N data to train another one: h2; N pieces of data including: h1 errors and new
data never trained before.
3) Repeat to train hn, N includes previous errors and new data.
4) Get final model: h_final=Majority Vote(h1,h2,..,hn)
Bagging: Bootstrap AGGregatING
Bootstrap sampling: take k/n with replacement at each time. Samples are
overlapped. Appro. 36.8% would not be selected. If we sample m times:
Training: in parallelization.
Ensemble: voting on classification; average on regression.
Random Forest: a variant of Bagging.
Random Forest: Random Attribute Selection
Bagging: select a number of data to train each DT.
Random Attribute Selection
Each tree, each node, select k attributes, then find out the optimum.
Emperical k:
where d is the total number of attributes.
Expectation Maximization: Motivation
Expectation Maximization:
- An iterative method to maximum the likelihood function.
- Always used for estimate parameters in statistical models.
Example:
Three coins: A, B and C. For each throw step, first throw coin A. If positive,
throw B, otherwise, throw C. Repeat those steps n times independently.
The face up probability of A, B and C is 𝝅, đ˜± and đ˜Č.
If we only can observe the results not the process, how can we get the
face up probability of those coins?
Expectation Maximization
Suppose we are in step i, we have once observed results y (1 or 0). we did not
know next throw results z. 𝛳 = ( 𝝅, đ˜±, đ˜Č) the parameters of the model. Then the
whole probability of next throw will be:
according to Bayesian:
Expectation Maximization
Extend to whole steps, if overall observed results follows
Non-observed values refer as
The face up probability among whole steps will be:
Now, using EM algorithm to estimate the parameters.
Expectation Maximization: Expectation(1)
Expectation function using initial parameters:
Parameters in step i, refer as:
The expectation function is step i:
Expectation Maximization: Expectation(2)
Calculate
then,
the next goal is to maximum function Q.
Expectation Maximization: Maximization
In order to compute the maximum of function Q. We calculate its derivative
value of each parameter.
Expectation Maximization
Convergence condition: ϔ1 and ϔ2 is a quit small positive value
or
Restricted Boltzmann Machine
Basic concepts:
h: status vector of hidden units (0/1)
v: status vector of visible units (0/1)
b: bias vector of hiddens units
a: bias vector of visible units
W: weight vector between each
hidden units and visible units
Fundamental theory of RBM:
energe between visible units v and hidden units h:
probability distribution through energy function:
z is the partition function in physic, which is not important part in RBM
Restricted Boltzmann Machine
Suppose activate probability of hidden units
Visible units
Input(x) ⇒ v:
- calculate activate probability of every hidden units by v, p(h1, v1)
- using gibbs sampling selecting a sample to represents whole hidden layer: h1 ~ p(h1, v1)
- calculate activate probability of every visible units by h1, p(v2, h1)
- using gibbs sampling selecting a sample to represents whole visible layer: v2 ~ p(v2, h1)
- calculate activate probability of every hidden units by v2, p(h2, v2)
- then update:
RBM: Contrastive Divergence (train)
Deductive active probability (take hidden units as example)
- For a hidden unit k:
- then, introduce two equation below:
- It is obviously to see:
- where, it represents the energy parts of j quals to k and not equals to k
Deductive active probability
According to contrastive divergence, when calculate a hidden units, visible units already know.
Meanwhile, other hidden units also should be known.
- First, using Bayes function
- Because other hidden units status in 0 or 1:
- cooperate with the probability distribution function
Deductive active probability
(previous step):
- with transformed energy function:
- With same theory, we get the active probability of visible units:
Back Propagation: Main Purpose
The main purpose of back propagation is to find the partial derivatives ∂C/∂
w means all parameters in the network including weights and biases.
Cost function:
where n is the sample number and a(x) is the output of the network
Note:
It is useful to first point out the naive forward propagation algorithm implied by the
chain rule. Then we can find out the advantages of back propagation by simply
comparing 2 algorithms.
Naive Forward Propagation
Naive forward propagation algorithm using chain rule to compute the partial
derivatives on every node of the network in a forward manner.
How it works:
1.Compute ∂ui/∂uj for every pair of nodes where ui
is at a higher level than uj.
2.In order to compute , , we need to compute
for every input of ,. Thus the total work
in the algorithm is O(VE).
( v = node number, e = edge number)
Back Propagation -Calculate Error
If we try to change the value of
(z is the input of the node), it will affect the
result of the next layer and finally affect the
output.
Assume is close to 0, then change the
value of z will not help us to minimize the cost C, in this case we can say that
node is close to optimum.
Naturally we can define error as:
Back Propagation -Calculate Error
Using chain rule, we can deduce
By replacing the parital derivative with vector:
∇aC is a vector with elements equal to ∂C/∂aLj
⊙ is hadamard operater
Back Propagation -Back Propagation
From layer l+1 to layer l, trying to use to represent
(1)
where
(2)
By combine (1) and (2):
Back Propagation -From error to parameters
After calculating the error, we finally need one more step:
Using error to calculate the derivatives on parameters(weights and biases)
Given the equation:
For bias b:
For weight w:
Convolutional Neural Network -Convolution Layer
Convolutional Neural Network -Pooling Layer
The main idea of “Max Pooling Layer” is to capture the most important activation
(maximum overtime):
e.g.
Q: This operation shrinks the feature number(from n-h+1 to 1), how can we get more features?
A: Applying multiple filters with different window sizes and different weights.
Convolutional Neural Network -Multi-channel
Start with 2 copies of the word vectors, then pre-train one of them(word2vec
or Glove), this action change the value of vectors of it. Keep the other one
static.
Apply the same filter on them, then sum the Ci of them before max pooling
Convolutional Neural Network -Dropout
Create a masking vector r with random variables 0 or 1. To delete some of the
features to prevent co-adaption(overfitting)
Kim (2014) reports 2 – 4% improved accuracy and ability to use very large networks without overfitting
Word Vectors- Taxonomy
Idea: Use a big “graph” to define relationships between words, it has
tree structure to declare the relationship between words.
Famous example: WordNet
Disadvantages:
Difficult to maintain(when new word comes in).
Require human labour.
Hard to compute word similarity.
Word Vectors-One Hot Representation
Word Vectors-Window Based Coocurrence-matrix
Problems:
Space consuming.
Difficult to update.
Solution:
Reduce dimension.(Singular
Value Decomposition)
Word Vectors-Word2vec
Word2vec: Predicts neighbor words.
Previous approach: capturing co-occurrence statistics.
Advantage:
Faster and can easily incorporate a new sentence/ document or add a
word to the vocabulary.
Good representation. (Solve analogy by vector subtraction)
Word Vectors-Word2vec
Error function:
(Where m is the window size. theta means all the parameters that we want to optimize.)
Word Vectors-Word2vec
Problem:
With large vocabularies this objective function is not scalable and
would train too slow.(use glove instead)
Solution:
Negative sampling.
Word Vectors-Word2vec
Word Vectors-Skip Gram Model
big window size may damage the syntactic accuracy.
Word Vectors-Continuous Bag of Words
Differ from word2vec, this model trying to predict the center word from the
surrounding words.
The result will be slightly different from skip-gram model, by average them we
can get a better result.
Word Vectors-Comparison
Word Vectors-Glove
Collect the co-occurrence statistics information from the whole corpus instead
of going over one window at a time.
Then optimize the vectors using the following equation.
Where Pij is the coocurrence counts from the matrix.
Fast training(do not have to go over every window), Scalable to huge corpora,
Good performance with small corpus.
Word Vectors-Glove Overview
Word-word Co-occurance Matrix, X : V * V (V is the vocab size)
Xij: the number of times word j occurs in the context of word i.
Xi = t : the number of times any word appears in the
context of word i.
Pi j = P(j|i) =Xi j/Xi : probability that word j appear in the context of word i.
P(solid | ice) = word “solid” appear in the context of word “ice”.
GloVe: Global Vectors for Word Representation
Symmetric
Word Vectors-GloVe
Ratios of co-occur: define a function:
Linearity:
Vector:
Word Vectors-GloVe
Add bias: log(Xi) independent from k,add as bi
Square error
Word Vectors-GloVe: f(x) as weights for words
f(0)=0
monotonically increasing
large x, small f(x)
Usually,
xmax=100,α=3/4
Word Vectors-Evaluation Good
Recurrent Neural Network-Forward prop
Recurrent Neural Network-Forward prop
Recurrent Neural Network-Loss Function
The cross-entropy in one single time step:
Overall cross-entrophy cost:
Where V is the vocabulary and T means the text.
Exp:
yt = [0.3,0.6,0.1] the, a, movie
y1 = [0.001,0.009,0.9]
y2 = [0.001,0.299,0.7]
y3 = [0.001,0.9,0.009]
J1= yt*logJ1
J2= yt*logJ2
J3 = yt*logJ3
J3>J2>J1 because y3 is more close to
yt than others.
Recurrent Neural Network-Back Propagation
Recurrent Neural Network-Back Propagation
eg:
Recurrent Neural Network-Back Propagation
Recurrent Neural Network- Gradient Vanishing
Recurrent Neural Network-Clip gradient
SVD: Singular Value Decomposition
Starts from Matrix Multiplication: Y = A*B
Check codes and plots here.
Want to find the direction & extend:
SVD: Singular Value Decomposition
Eigen-Value & Eigen-Vector of A:
(Have more than one pairs. )
Import numpy to calculate. =>
But only for square matrix!
SVD: Singular Value Decomposition
If A has n e-values:
Then:
So, AQ=Q x Sigma
Q
Sigma
Q
SVD: Singular Value Decomposition
For the non-square matrix? A: m * n.
Similar idea:
But how to get e-values and e-vectors?
SVD: Singular Value Decomposition
Find a square-matrix!
Calculate
 then:
let r << # of e-value to represent A
O(n^3)
SVD: Application
Copression with loss!
Reduce parameter size!
if m = 1,000 n = 1,000
let r = 10
10^6 reduced to 20100 parameters
MF: Matrix Factorization in RecSys
Different from SVD. In MF, we let Y = U V , divide into two matrices. We will
walk through with recommender system.
http://www.dataperspective.info/2014/05/basic-recommendation-engine-using-r.html
Movies
Ratings
Us
ers
MF: Matrix Factorization in RecSys
Rating (user i, movie j) = User i Vector · Movie j Vector
So how to find the User Matrix and Movie Matrix?
Movies
RatingsUsers
= User Matrix
Movie
Matrix
user i
movie j
Rij
MF: Matrix Factorization in RecSys
Predict rating of user i, movie j:
Loss function (error and L2 regularization):
So we want to minimize L and get U and M, using SGD:
argmin L
Once U and M are computed, we can predict any ratings!
Takeaway(1): Recognize Algorithms!
http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
Takeaway(2): Recognize Algorithms- KNN
Useful Links
Stanford CS224d: Deep Learning for Natural Language Processing
Stanford CS231n: Convolutional Neural Networks for Visual Recognition
SVM Notes :http://www.robots.ox.ac.uk/~az/lectures/ml/lect2.pdf
Nuts and Bolts of Applying Deep Learning (Andrew Ng) - YouTube

Weitere Àhnliche Inhalte

Was ist angesagt?

Reweighting and Boosting to uniforimty in HEP
Reweighting and Boosting to uniforimty in HEPReweighting and Boosting to uniforimty in HEP
Reweighting and Boosting to uniforimty in HEParogozhnikov
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data miningZHAO Sam
 
Text categorization
Text categorizationText categorization
Text categorizationPhuong Nguyen
 
Instance based learning
Instance based learningInstance based learning
Instance based learningSlideshare
 
Machine learning applications in aerospace domain
Machine learning applications in aerospace domainMachine learning applications in aerospace domain
Machine learning applications in aerospace domain홍배 êč€
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackarogozhnikov
 
Data Science - Part IX - Support Vector Machine
Data Science - Part IX -  Support Vector MachineData Science - Part IX -  Support Vector Machine
Data Science - Part IX - Support Vector MachineDerek Kane
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackarogozhnikov
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4arogozhnikov
 
2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revisedKrish_ver2
 
æœș晹歩äč Adaboost
æœș晹歩äč Adaboostæœș晹歩äč Adaboost
æœș晹歩äč AdaboostShocky1
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzerbutest
 
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavAgile Testing Alliance
 
Radial Basis Function Interpolation
Radial Basis Function InterpolationRadial Basis Function Interpolation
Radial Basis Function InterpolationJesse Bettencourt
 
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...홍배 êč€
 
Image Classification And Support Vector Machine
Image Classification And Support Vector MachineImage Classification And Support Vector Machine
Image Classification And Support Vector MachineShao-Chuan Wang
 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Mohammad Junaid Khan
 
Neural network for machine learning
Neural network for machine learningNeural network for machine learning
Neural network for machine learningUjjawal
 
Training and Inference for Deep Gaussian Processes
Training and Inference for Deep Gaussian ProcessesTraining and Inference for Deep Gaussian Processes
Training and Inference for Deep Gaussian ProcessesKeyon Vafa
 

Was ist angesagt? (20)

Reweighting and Boosting to uniforimty in HEP
Reweighting and Boosting to uniforimty in HEPReweighting and Boosting to uniforimty in HEP
Reweighting and Boosting to uniforimty in HEP
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data mining
 
Text categorization
Text categorizationText categorization
Text categorization
 
Instance based learning
Instance based learningInstance based learning
Instance based learning
 
Machine learning applications in aerospace domain
Machine learning applications in aerospace domainMachine learning applications in aerospace domain
Machine learning applications in aerospace domain
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic track
 
Data Science - Part IX - Support Vector Machine
Data Science - Part IX -  Support Vector MachineData Science - Part IX -  Support Vector Machine
Data Science - Part IX - Support Vector Machine
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic track
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4
 
2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised
 
æœș晹歩äč Adaboost
æœș晹歩äč Adaboostæœș晹歩äč Adaboost
æœș晹歩äč Adaboost
 
Svm
SvmSvm
Svm
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzer
 
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
 
Radial Basis Function Interpolation
Radial Basis Function InterpolationRadial Basis Function Interpolation
Radial Basis Function Interpolation
 
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
 
Image Classification And Support Vector Machine
Image Classification And Support Vector MachineImage Classification And Support Vector Machine
Image Classification And Support Vector Machine
 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
 
Neural network for machine learning
Neural network for machine learningNeural network for machine learning
Neural network for machine learning
 
Training and Inference for Deep Gaussian Processes
Training and Inference for Deep Gaussian ProcessesTraining and Inference for Deep Gaussian Processes
Training and Inference for Deep Gaussian Processes
 

Andere mochten auch

Representation Learning in Medical Documents
Representation Learning in Medical DocumentsRepresentation Learning in Medical Documents
Representation Learning in Medical DocumentsZihui Li
 
Mind tickle overview
Mind tickle overviewMind tickle overview
Mind tickle overviewKrishna Depura
 
Shell scripting
Shell scriptingShell scripting
Shell scriptingsimha.dev.lin
 
Intro to Linux Shell Scripting
Intro to Linux Shell ScriptingIntro to Linux Shell Scripting
Intro to Linux Shell Scriptingvceder
 
Shell Scripting in Linux
Shell Scripting in LinuxShell Scripting in Linux
Shell Scripting in LinuxAnu Chaudhry
 
Shell programming
Shell programmingShell programming
Shell programmingMoayad Moawiah
 
The Coming Intelligent Digital Assistant Era and Its Impact on Online Platforms
The Coming Intelligent Digital Assistant Era and Its Impact on Online PlatformsThe Coming Intelligent Digital Assistant Era and Its Impact on Online Platforms
The Coming Intelligent Digital Assistant Era and Its Impact on Online PlatformsCognizant
 
Les Ă©volutions adaptatives
Les Ă©volutions adaptativesLes Ă©volutions adaptatives
Les Ă©volutions adaptativesRESPONSIV
 
Poseidon Adventures
Poseidon AdventuresPoseidon Adventures
Poseidon AdventuresMatthew Phenix
 
SI-PI, Khristina Damayanti, Hapzi Ali, Isu Sosial Dan Etika Dalam Sistem Info...
SI-PI, Khristina Damayanti, Hapzi Ali, Isu Sosial Dan Etika Dalam Sistem Info...SI-PI, Khristina Damayanti, Hapzi Ali, Isu Sosial Dan Etika Dalam Sistem Info...
SI-PI, Khristina Damayanti, Hapzi Ali, Isu Sosial Dan Etika Dalam Sistem Info...khristina damayanti
 

Andere mochten auch (13)

Handicrafts buyer seller meet
Handicrafts buyer seller meetHandicrafts buyer seller meet
Handicrafts buyer seller meet
 
Representation Learning in Medical Documents
Representation Learning in Medical DocumentsRepresentation Learning in Medical Documents
Representation Learning in Medical Documents
 
Mind tickle overview
Mind tickle overviewMind tickle overview
Mind tickle overview
 
Effective Techniques to De-stress
Effective Techniques to De-stressEffective Techniques to De-stress
Effective Techniques to De-stress
 
Shell scripting
Shell scriptingShell scripting
Shell scripting
 
Intro to Linux Shell Scripting
Intro to Linux Shell ScriptingIntro to Linux Shell Scripting
Intro to Linux Shell Scripting
 
Shell Scripting in Linux
Shell Scripting in LinuxShell Scripting in Linux
Shell Scripting in Linux
 
Shell programming
Shell programmingShell programming
Shell programming
 
The Coming Intelligent Digital Assistant Era and Its Impact on Online Platforms
The Coming Intelligent Digital Assistant Era and Its Impact on Online PlatformsThe Coming Intelligent Digital Assistant Era and Its Impact on Online Platforms
The Coming Intelligent Digital Assistant Era and Its Impact on Online Platforms
 
Les Ă©volutions adaptatives
Les Ă©volutions adaptativesLes Ă©volutions adaptatives
Les Ă©volutions adaptatives
 
Poseidon Adventures
Poseidon AdventuresPoseidon Adventures
Poseidon Adventures
 
SI-PI, Khristina Damayanti, Hapzi Ali, Isu Sosial Dan Etika Dalam Sistem Info...
SI-PI, Khristina Damayanti, Hapzi Ali, Isu Sosial Dan Etika Dalam Sistem Info...SI-PI, Khristina Damayanti, Hapzi Ali, Isu Sosial Dan Etika Dalam Sistem Info...
SI-PI, Khristina Damayanti, Hapzi Ali, Isu Sosial Dan Etika Dalam Sistem Info...
 
Storyboard
StoryboardStoryboard
Storyboard
 

Ähnlich wie Machine Learning Algorithms Review(Part 2)

Lecture7 cross validation
Lecture7 cross validationLecture7 cross validation
Lecture7 cross validationStéphane Canu
 
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsData Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsSalah Amean
 
2.7 other classifiers
2.7 other classifiers2.7 other classifiers
2.7 other classifiersKrish_ver2
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier홍배 êč€
 
Machine learning pt.1: Artificial Neural Networks Âź All Rights Reserved
Machine learning pt.1: Artificial Neural Networks Âź All Rights ReservedMachine learning pt.1: Artificial Neural Networks Âź All Rights Reserved
Machine learning pt.1: Artificial Neural Networks Âź All Rights ReservedJonathan Mitchell
 
PRML Chapter 4
PRML Chapter 4PRML Chapter 4
PRML Chapter 4Sunwoo Kim
 
[ppt]
[ppt][ppt]
[ppt]butest
 
[ppt]
[ppt][ppt]
[ppt]butest
 
19 - Neural Networks I.pptx
19 - Neural Networks I.pptx19 - Neural Networks I.pptx
19 - Neural Networks I.pptxEmanAl15
 
Statistical Machine________ Learning.ppt
Statistical Machine________ Learning.pptStatistical Machine________ Learning.ppt
Statistical Machine________ Learning.pptSandeepGupta229023
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional VerificationSai Kiran Kadam
 
Machine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.pptMachine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.pptAnshika865276
 
Lecture 2
Lecture 2Lecture 2
Lecture 2butest
 
tutorial.ppt
tutorial.ppttutorial.ppt
tutorial.pptVara Prasad
 
Machine learning and Neural Networks
Machine learning and Neural NetworksMachine learning and Neural Networks
Machine learning and Neural Networksbutest
 
Classifiers
ClassifiersClassifiers
ClassifiersAyurdata
 
Machine Learning: An Introduction Fu Chang
Machine Learning: An Introduction Fu ChangMachine Learning: An Introduction Fu Chang
Machine Learning: An Introduction Fu Changbutest
 

Ähnlich wie Machine Learning Algorithms Review(Part 2) (20)

Lecture7 cross validation
Lecture7 cross validationLecture7 cross validation
Lecture7 cross validation
 
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsData Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
 
2.7 other classifiers
2.7 other classifiers2.7 other classifiers
2.7 other classifiers
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 
Machine learning
Machine learningMachine learning
Machine learning
 
Machine learning pt.1: Artificial Neural Networks Âź All Rights Reserved
Machine learning pt.1: Artificial Neural Networks Âź All Rights ReservedMachine learning pt.1: Artificial Neural Networks Âź All Rights Reserved
Machine learning pt.1: Artificial Neural Networks Âź All Rights Reserved
 
PRML Chapter 4
PRML Chapter 4PRML Chapter 4
PRML Chapter 4
 
[ppt]
[ppt][ppt]
[ppt]
 
[ppt]
[ppt][ppt]
[ppt]
 
19 - Neural Networks I.pptx
19 - Neural Networks I.pptx19 - Neural Networks I.pptx
19 - Neural Networks I.pptx
 
09 classadvanced
09 classadvanced09 classadvanced
09 classadvanced
 
Statistical Machine________ Learning.ppt
Statistical Machine________ Learning.pptStatistical Machine________ Learning.ppt
Statistical Machine________ Learning.ppt
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional Verification
 
Machine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.pptMachine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.ppt
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
tutorial.ppt
tutorial.ppttutorial.ppt
tutorial.ppt
 
Machine learning and Neural Networks
Machine learning and Neural NetworksMachine learning and Neural Networks
Machine learning and Neural Networks
 
Classifiers
ClassifiersClassifiers
Classifiers
 
Class9_PCA_final.ppt
Class9_PCA_final.pptClass9_PCA_final.ppt
Class9_PCA_final.ppt
 
Machine Learning: An Introduction Fu Chang
Machine Learning: An Introduction Fu ChangMachine Learning: An Introduction Fu Chang
Machine Learning: An Introduction Fu Chang
 

KĂŒrzlich hochgeladen

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel AraĂșjo
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

KĂŒrzlich hochgeladen (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Machine Learning Algorithms Review(Part 2)

  • 1. ML Algorithms Part 2 Draft Do you really understand more...
  • 2. Outline Learning Theory: Model, Evaluation Metrics Algorithems: NaĂŻve Bayesian, Clustering Methods (K-means) Ensemble Learning, EM Algorithm Restricted Boltzman Machines Neural Networks: BP, Word2Vec, GloVe, CNN, RNN Other: Singular Vector Decompostion, Matrix Factorization
  • 3. Discriminative & Generative Model Decision Function Y=f(X) or conditional probability P(Y|X) Generative Model: learn P(X,Y), calculate posterior prob dist P(Y|X): HMM, Bayesian Classifier Discriminative Model: learn P(Y|X) directly DT, NN,SVM,Boosting,CRF
  • 4. Metrics - Classification Accuracy Precision & Recall (generally binary classification) TP(Pos->Pos) FP(Neg->Pos) FN(Pos->Neg) TN(Neg->Neg) F1 score P & R are high, F1 is high.
  • 5. Metrics - Regression Mean Absolute Errors & M. Square E. R-squared error (coefficient of determination): A higher value means a good model.
  • 6. Dataset ML Strategy From Andrew Ng, NIPS, 2016 Bias-variance Trade-off = Avoidable Bias + Variance More Data; Regularization; New Model
  • 7. Naive Bayesian Classifier Two Assumptions Bayes’ Theorem Features are conditionally independent How to do? Learn Joint probability distribution p(x,y) Learn p(x) Calculate p(y)
  • 8. NaĂŻve Bayesian: Notations Dataset: Priori probability: K classes Conditional probability: Then learn Joint: P(X,Y) = P(Y) P(X|Y) = P(X)P(Y|X) Calculate Posterior: P(Y|X)
  • 9. NaĂŻve Bayesian: Attribute Conditional independent assumption Each input x is a vector with n elements. Given the class y, features are conditionally independent. Y x1 x2
  • 11. NaĂŻve Bayesian: Calculation So we can rewrite this part:
  • 12. NaĂŻve Bayesian: Definition Want the class that has the max of the probabilities: For everyone, the denominators are the same, so we can simplify it:
  • 13. NaĂŻve Bayesian: Parameter Estimation Maximum Likelihood Estimation Goal: estimate priori P(Y=c) for each class estimate P(X=x | Y=c) for each class Count(x,y) --------------- = P(x|y) Count(y)
  • 14. NaĂŻve Bayesian: Cost Choose 0-1 cost function: Right answer = 0 Wrong answer = 1 -- L gives “how big” it is making the mistake...we want to minimize it! Add up those we go wrong 1- Right cases Minimize the errors ⇔ Maximize the right
  • 15. NaĂŻve Bayesian: Calculation Calculate a prob with an input x: Decide the class of the input x: A demo from the book.
  • 16. NaĂŻve Bayesian: Bayesian Estimation When do MLE, could be 0 counts. Then you times it...all become 0. Add one more thing, lambda >= 0. If lambda = 0, it is MLE; If lambda = 1, it is Laplace smoothing. Laplace smoothing, it is a probalistic distribution. Priori Prob’s Bayesian Estimation:
  • 17. Bayesian: More Models NaĂŻve Bayesian: attribute Conditional independent assumption Semi-Naive Bayesain: One-Dependent Estimator Y x1 x2 xn Naive Bayesian Y x1 x2 xn Super-Parent One-Dependent Estimator(SPODE) x3
  • 18. Takeaway (1): Probabilistic Graphical Models Y x1 x2 xn Naive Bayesian P(Y) P(xn|Y) P(y)P(x1|y)P(x2|y)..P(xn|y) Joint Prob! What we want! P(x1|Y) Find out More: https://www.coursera.org/learn/probabilistic-graphical-models Prof.Daphne Koller
  • 19. Takeaway (2): Bayesian Network A C D E P(A)P(B)P(C|A)P(D|A,B)P(E|B) Joint Prob! What we want! Named also Belief Network DAG: Directed Acyclic Graph CPT: Conditional Probability Table B P(A) P(B) P(C|A) P(E|B)P(D|A,B) Given P(A), C and D, independent: C⊄D|A
  • 20. Clustering: Unsupervised Learning Similarty Methods: Eculidean Distance, etc. In-group: high similarity Out-group: high distance Methods Prototype-based Clustering: K-means, LVQ, MoG Density-based Clustering: DBSCAN Hierarchical Clustering: AGNES,DIANA
  • 21. Prototype-based Clustering: k-means Dataset D, Clusters C Error(each one in each cluster)
  • 22. K-means: Algorithm (Input: whole dataset points x1,...xm) Initialization: Randomly place centroids c1..ck Repeat until convergence (stop when no points changing): - for each data point xi: find nearest centroid, set the point to the centroid cluster - for each cluster: calculate new centroid (means of all new points) https://www.youtube.com/watch?v=_aWzGGNrcic argmin D (xi,cj) for all cj O(#iter * #clusters * #instances * #dimensions)
  • 23. K-means: A Quick Demo Two clusters, squares are the centroids. Calculate new centroids, finish one iteration.
  • 25. Clustering: Other Methods(1) Prototype-based Clustering LVQ: Find out a group of vectors to describe the clusters, with labels MoG: describe the model by a probabilistic model. Density-based Clustering DBSCAN
  • 26. Clustering: Other Methods(2) Hierarchical Clustering : AGNES,DIANA
  • 27. Ensemble Learning Strong learning method: learns model with good performance; Week learning method: learns model slightly better than random. Ensemble learning methods: Iterative method: Boosting Parallel method: Bagging
  • 28. Boosting: Algorithm Boosting: gather/ensemble week learners to become a strong one. Through majority voting on classification; average on regression. Algo: 1) Feed N data to train a week learner/ model: h1. 2) Feed N data to train another one: h2; N pieces of data including: h1 errors and new data never trained before. 3) Repeat to train hn, N includes previous errors and new data. 4) Get final model: h_final=Majority Vote(h1,h2,..,hn)
  • 29. Bagging: Bootstrap AGGregatING Bootstrap sampling: take k/n with replacement at each time. Samples are overlapped. Appro. 36.8% would not be selected. If we sample m times: Training: in parallelization. Ensemble: voting on classification; average on regression. Random Forest: a variant of Bagging.
  • 30. Random Forest: Random Attribute Selection Bagging: select a number of data to train each DT. Random Attribute Selection Each tree, each node, select k attributes, then find out the optimum. Emperical k: where d is the total number of attributes.
  • 31. Expectation Maximization: Motivation Expectation Maximization: - An iterative method to maximum the likelihood function. - Always used for estimate parameters in statistical models. Example: Three coins: A, B and C. For each throw step, first throw coin A. If positive, throw B, otherwise, throw C. Repeat those steps n times independently. The face up probability of A, B and C is 𝝅, đ˜± and đ˜Č. If we only can observe the results not the process, how can we get the face up probability of those coins?
  • 32. Expectation Maximization Suppose we are in step i, we have once observed results y (1 or 0). we did not know next throw results z. 𝛳 = ( 𝝅, đ˜±, đ˜Č) the parameters of the model. Then the whole probability of next throw will be: according to Bayesian:
  • 33. Expectation Maximization Extend to whole steps, if overall observed results follows Non-observed values refer as The face up probability among whole steps will be: Now, using EM algorithm to estimate the parameters.
  • 34. Expectation Maximization: Expectation(1) Expectation function using initial parameters: Parameters in step i, refer as: The expectation function is step i:
  • 36. Expectation Maximization: Maximization In order to compute the maximum of function Q. We calculate its derivative value of each parameter.
  • 37. Expectation Maximization Convergence condition: Ï”1 and Ï”2 is a quit small positive value or
  • 38. Restricted Boltzmann Machine Basic concepts: h: status vector of hidden units (0/1) v: status vector of visible units (0/1) b: bias vector of hiddens units a: bias vector of visible units W: weight vector between each hidden units and visible units
  • 39. Fundamental theory of RBM: energe between visible units v and hidden units h: probability distribution through energy function: z is the partition function in physic, which is not important part in RBM Restricted Boltzmann Machine
  • 40. Suppose activate probability of hidden units Visible units Input(x) ⇒ v: - calculate activate probability of every hidden units by v, p(h1, v1) - using gibbs sampling selecting a sample to represents whole hidden layer: h1 ~ p(h1, v1) - calculate activate probability of every visible units by h1, p(v2, h1) - using gibbs sampling selecting a sample to represents whole visible layer: v2 ~ p(v2, h1) - calculate activate probability of every hidden units by v2, p(h2, v2) - then update: RBM: Contrastive Divergence (train)
  • 41. Deductive active probability (take hidden units as example) - For a hidden unit k: - then, introduce two equation below: - It is obviously to see: - where, it represents the energy parts of j quals to k and not equals to k
  • 42. Deductive active probability According to contrastive divergence, when calculate a hidden units, visible units already know. Meanwhile, other hidden units also should be known. - First, using Bayes function - Because other hidden units status in 0 or 1: - cooperate with the probability distribution function
  • 43. Deductive active probability (previous step): - with transformed energy function: - With same theory, we get the active probability of visible units:
  • 44. Back Propagation: Main Purpose The main purpose of back propagation is to find the partial derivatives ∂C/∂ w means all parameters in the network including weights and biases. Cost function: where n is the sample number and a(x) is the output of the network Note: It is useful to first point out the naive forward propagation algorithm implied by the chain rule. Then we can find out the advantages of back propagation by simply comparing 2 algorithms.
  • 45. Naive Forward Propagation Naive forward propagation algorithm using chain rule to compute the partial derivatives on every node of the network in a forward manner. How it works: 1.Compute ∂ui/∂uj for every pair of nodes where ui is at a higher level than uj. 2.In order to compute , , we need to compute for every input of ,. Thus the total work in the algorithm is O(VE). ( v = node number, e = edge number)
  • 46. Back Propagation -Calculate Error If we try to change the value of (z is the input of the node), it will affect the result of the next layer and finally affect the output. Assume is close to 0, then change the value of z will not help us to minimize the cost C, in this case we can say that node is close to optimum. Naturally we can define error as:
  • 47. Back Propagation -Calculate Error Using chain rule, we can deduce By replacing the parital derivative with vector: ∇aC is a vector with elements equal to ∂C/∂aLj ⊙ is hadamard operater
  • 48. Back Propagation -Back Propagation From layer l+1 to layer l, trying to use to represent (1) where (2) By combine (1) and (2):
  • 49. Back Propagation -From error to parameters After calculating the error, we finally need one more step: Using error to calculate the derivatives on parameters(weights and biases) Given the equation: For bias b: For weight w:
  • 50. Convolutional Neural Network -Convolution Layer
  • 51. Convolutional Neural Network -Pooling Layer The main idea of “Max Pooling Layer” is to capture the most important activation (maximum overtime): e.g. Q: This operation shrinks the feature number(from n-h+1 to 1), how can we get more features? A: Applying multiple filters with different window sizes and different weights.
  • 52. Convolutional Neural Network -Multi-channel Start with 2 copies of the word vectors, then pre-train one of them(word2vec or Glove), this action change the value of vectors of it. Keep the other one static. Apply the same filter on them, then sum the Ci of them before max pooling
  • 53. Convolutional Neural Network -Dropout Create a masking vector r with random variables 0 or 1. To delete some of the features to prevent co-adaption(overfitting) Kim (2014) reports 2 – 4% improved accuracy and ability to use very large networks without overfitting
  • 54. Word Vectors- Taxonomy Idea: Use a big “graph” to define relationships between words, it has tree structure to declare the relationship between words. Famous example: WordNet Disadvantages: Difficult to maintain(when new word comes in). Require human labour. Hard to compute word similarity.
  • 55. Word Vectors-One Hot Representation
  • 56. Word Vectors-Window Based Coocurrence-matrix Problems: Space consuming. Difficult to update. Solution: Reduce dimension.(Singular Value Decomposition)
  • 57. Word Vectors-Word2vec Word2vec: Predicts neighbor words. Previous approach: capturing co-occurrence statistics. Advantage: Faster and can easily incorporate a new sentence/ document or add a word to the vocabulary. Good representation. (Solve analogy by vector subtraction)
  • 58. Word Vectors-Word2vec Error function: (Where m is the window size. theta means all the parameters that we want to optimize.)
  • 59. Word Vectors-Word2vec Problem: With large vocabularies this objective function is not scalable and would train too slow.(use glove instead) Solution: Negative sampling.
  • 61. Word Vectors-Skip Gram Model big window size may damage the syntactic accuracy.
  • 62. Word Vectors-Continuous Bag of Words Differ from word2vec, this model trying to predict the center word from the surrounding words. The result will be slightly different from skip-gram model, by average them we can get a better result.
  • 64. Word Vectors-Glove Collect the co-occurrence statistics information from the whole corpus instead of going over one window at a time. Then optimize the vectors using the following equation. Where Pij is the coocurrence counts from the matrix. Fast training(do not have to go over every window), Scalable to huge corpora, Good performance with small corpus.
  • 65. Word Vectors-Glove Overview Word-word Co-occurance Matrix, X : V * V (V is the vocab size) Xij: the number of times word j occurs in the context of word i. Xi = t : the number of times any word appears in the context of word i. Pi j = P(j|i) =Xi j/Xi : probability that word j appear in the context of word i. P(solid | ice) = word “solid” appear in the context of word “ice”. GloVe: Global Vectors for Word Representation
  • 66. Symmetric Word Vectors-GloVe Ratios of co-occur: define a function: Linearity: Vector:
  • 67. Word Vectors-GloVe Add bias: log(Xi) independent from k,add as bi Square error
  • 68. Word Vectors-GloVe: f(x) as weights for words f(0)=0 monotonically increasing large x, small f(x) Usually, xmax=100,α=3/4
  • 72. Recurrent Neural Network-Loss Function The cross-entropy in one single time step: Overall cross-entrophy cost: Where V is the vocabulary and T means the text. Exp: yt = [0.3,0.6,0.1] the, a, movie y1 = [0.001,0.009,0.9] y2 = [0.001,0.299,0.7] y3 = [0.001,0.9,0.009] J1= yt*logJ1 J2= yt*logJ2 J3 = yt*logJ3 J3>J2>J1 because y3 is more close to yt than others.
  • 74. Recurrent Neural Network-Back Propagation eg:
  • 76. Recurrent Neural Network- Gradient Vanishing
  • 78. SVD: Singular Value Decomposition Starts from Matrix Multiplication: Y = A*B Check codes and plots here. Want to find the direction & extend:
  • 79. SVD: Singular Value Decomposition Eigen-Value & Eigen-Vector of A: (Have more than one pairs. ) Import numpy to calculate. => But only for square matrix!
  • 80. SVD: Singular Value Decomposition If A has n e-values: Then: So, AQ=Q x Sigma Q Sigma Q
  • 81. SVD: Singular Value Decomposition For the non-square matrix? A: m * n. Similar idea: But how to get e-values and e-vectors?
  • 82. SVD: Singular Value Decomposition Find a square-matrix! Calculate
 then: let r << # of e-value to represent A O(n^3)
  • 83. SVD: Application Copression with loss! Reduce parameter size! if m = 1,000 n = 1,000 let r = 10 10^6 reduced to 20100 parameters
  • 84. MF: Matrix Factorization in RecSys Different from SVD. In MF, we let Y = U V , divide into two matrices. We will walk through with recommender system. http://www.dataperspective.info/2014/05/basic-recommendation-engine-using-r.html Movies Ratings Us ers
  • 85. MF: Matrix Factorization in RecSys Rating (user i, movie j) = User i Vector · Movie j Vector So how to find the User Matrix and Movie Matrix? Movies RatingsUsers = User Matrix Movie Matrix user i movie j Rij
  • 86. MF: Matrix Factorization in RecSys Predict rating of user i, movie j: Loss function (error and L2 regularization): So we want to minimize L and get U and M, using SGD: argmin L Once U and M are computed, we can predict any ratings!
  • 89. Useful Links Stanford CS224d: Deep Learning for Natural Language Processing Stanford CS231n: Convolutional Neural Networks for Visual Recognition SVM Notes :http://www.robots.ox.ac.uk/~az/lectures/ml/lect2.pdf Nuts and Bolts of Applying Deep Learning (Andrew Ng) - YouTube

Hinweis der Redaktion

  1. http://s1.daumcdn.net/editor/fp/service_nc/pencil/Pencil_chromestore.html
  2. TODO: 敎理éĄșćșć’Œćˆ†ç±»
  3. ćć·ź-æ–čć·źæƒèĄĄ
  4. 李èˆȘă€Šç»ŸèźĄć­Šäč æ–čæł•ă€‹
  5. æ čæźć·ČçŸ„ć˜é‡ïŒŒæŽšæ–­ć‡șæœȘçŸ„ć˜é‡çš„ćˆ†ćžƒçš„èż‡çš‹ïŒŒInference。 Hidden Markov ModelïŒšćŠšæ€çš„BN
  6. LVQ: learning vector quantizaion ć­Šäč ć‘é‡é‡ćŒ– MoG: mixture of Gaussian é«˜æ–Żæ··ćˆèšç±» DBSCAN: Density-based spatial clustering of applications with noise AGNES: AGglomerative NESting DIANA: Divisive Analysis
  7. LVQ: learning vector quantizaion ć­Šäč ć‘é‡é‡ćŒ– MoG: mixture of Gaussian é«˜æ–Żæ··ćˆèšç±» DBSCAN: Density-based spatial clustering of applications with noise AGNES: AGglomerative NESting DIANA: Divisive Analysis
  8. ćŻčæŻ”æ•ŁćșŠ x = [ [0 1 0 0 0] [0 0 0 0 0] [0 0 1 0 0] ] after loop 100: a, b, W query(u1, i2): =>a2
  9. Basic assumption: If u is a node at level t+1 and z is any node at level ≀t whose output is an input to u, then computing ∂u/∂z takes unit time on our computer.
  10. https://zhuanlan.zhihu.com/p/21242347
  11. \frac{ \partial h9}{ \partial h6} = \frac{ \partial h7}{ \partial h6}*\frac{ \partial h8}{ \partial h7}*\frac{ \partial h9}{ \partial h8}
  12. This could happen when t is a large number(layer of the neural network)
  13. http://xiahouzuoxin.github.io/notes/html/%E7%9F%A9%E9%98%B5%E7%89%B9%E5%BE%81%E5%80%BC%E5%88%86%E8%A7%A3%E4%B8%8E%E5%A5%87%E5%BC%82%E5%80%BC%E5%88%86%E8%A7%A3%E5%90%AB%E4%B9%89%E8%A7%A3%E6%9E%90%E5%8F%8A%E5%BA%94%E7%94%A8.html
  14. \widehat { r_{ i,j } } =m_{ j }^{ T }\cdot u_{ i }\\ \sum_{i,j} { ( r_{ i,j } -m_{ j }^{ T }\cdot u_{ i } )^2 + \lambda(||u_i||+||m_j|| )^2 } \\