SlideShare ist ein Scribd-Unternehmen logo
1 von 70
SIMS 290-2:
Applied Natural Language Processing

Barbara Rosario
October 4, 2004

1
Today
Algorithms for Classification
Binary classification

Perceptron
Winnow
Support Vector Machines (SVM)
Kernel Methods

Multi-Class classification
Decision Trees
Naïve Bayes
K nearest neighbor

2
Binary Classification: examples
Spam filtering (spam, not spam)
Customer service message classification (urgent vs.
not urgent)
Information retrieval (relevant, not relevant)
Sentiment classification (positive, negative)
Sometime it can be convenient to treat a multi-way
problem like a binary one: one class versus all the
others, for all classes

3
Binary Classification
Given: some data items that belong to a positive (+1
) or a negative (-1 ) class
Task: Train the classifier and predict the class for a
new data item
Geometrically: find a separator

4
Linear versus Non Linear
algorithms
Linearly separable data: if all the data points can
be correctly classified by a linear (hyperplanar)
decision boundary

5
Linearly separable data

Linear Decision boundary

Class1
Class2
6
Non linearly separable data

Class1
Class2
7
Non linearly separable data

Non Linear Classifier

Class1
Class2
8
Linear versus Non Linear
algorithms
Linear or Non linear separable data?
We can find out only empirically

Linear algorithms (algorithms that find a linear decision
boundary)
When we think the data is linearly separable
Advantages
– Simpler, less parameters

Disadvantages
– High dimensional data (like for NLT) is usually not linearly separable

Examples: Perceptron, Winnow, SVM
Note: we can use linear algorithms also for non linear problems
(see Kernel methods)

9
Linear versus Non Linear
algorithms
Non Linear
When the data is non linearly separable
Advantages
– More accurate

Disadvantages
– More complicated, more parameters

Example: Kernel methods

Note: the distinction between linear and non linear
applies also for multi-class classification (we’ll see
this later)
10
Simple linear algorithms
Perceptron and Winnow algorithm
Linear
Binary classification
Online (process data sequentially, one data point at the
time)
Mistake driven
Simple single layer Neural Networks

11
Linear binary classification
Data: {(xi,yi)}i=1...n
x in Rd (x is a vector in d-dimensional space)
 feature vector
y in {-1,+1}
 label (class, category)

Question:
Design a linear decision boundary: wx + b (equation of hyperplane) such
that the classification rule associated with it has minimal probability of error
classification rule :
– y = sign(w x + b) which means:
– if wx + b > 0 then y = +1
– if wx + b < 0 then y = -1

From Gert Lanckriet, Statistical Learning Theory Tutorial

12
Linear binary classification
Find a good hyperplane
(w,b) in R d+1
that correctly classifies data
points as much as possible
In online fashion: one data
point at the time, update
weights as necessary

wx + b = 0
Classification Rule:
y = sign(wx + b)

From Gert Lanckriet, Statistical Learning Theory Tutorial

13
Perceptron algorithm
Initialize: w1 = 0
Updating rule For each data point x

If class(x) != decision(x,w)

wk

then
wk+1  wk + yixi
k k+1
else
wk+1  wk

0

+1

wk+1

-1
wk x + b = 0

Function decision(x, w)
If wx + b > 0 return +1
Else return -1

From Gert Lanckriet, Statistical Learning Theory Tutorial

Wk+1 x + b = 0

14
Perceptron algorithm
Online: can adjust to changing target, over time
Advantages
Simple and computationally efficient
Guaranteed to learn a linearly separable problem
(convergence, global optimum)

Limitations
Only linear separations
Only converges for linearly separable data
Not really “efficient with many features”

From Gert Lanckriet, Statistical Learning Theory Tutorial

15
Winnow algorithm
Another online algorithm for learning perceptron
weights:
f(x) = sign(wx + b)
Linear, binary classification
Update-rule: again error-driven, but multiplicative
(instead of additive)

From Gert Lanckriet, Statistical Learning Theory Tutorial

16
Winnow algorithm
Initialize: w1 = 0
Updating rule For each data point x

wk

If class(x) != decision(x,w)
then

wk+1  wk + yixi
 Perceptron
w k+1  w k *exp(y i x i )  Winnow

k k+1
else
wk+1  wk

0

+1

wk+1

-1
wk x + b= 0

Function decision(x, w)
If wx + b > 0 return +1
Else return -1
From Gert Lanckriet, Statistical Learning Theory Tutorial

Wk+1 x + b = 0

17
Perceptron vs. Winnow
Assume
N available features
only K relevant items, with K<<N

Perceptron: number of mistakes: O( K N)
Winnow: number of mistakes: O(K log N)
Winnow is more robust to high-dimensional feature spaces

From Gert Lanckriet, Statistical Learning Theory Tutorial

18
Perceptron vs. Winnow
Perceptron
Online: can adjust to changing
target, over time
Advantages

Simple and computationally
efficient
Guaranteed to learn a linearly
separable problem

Limitations

only linear separations
only converges for linearly
separable data
not really “efficient with many
features”

Winnow
Online: can adjust to changing
target, over time
Advantages

Simple and computationally
efficient
Guaranteed to learn a linearly
separable problem
Suitable for problems with
many irrelevant attributes

Limitations

only linear separations
only converges for linearly
separable data
not really “efficient with many
features”

Used in NLP

From Gert Lanckriet, Statistical Learning Theory Tutorial

19
Weka
Winnow in Weka

20
Large margin classifier
Another family of linear
algorithms
Intuition (Vapnik, 1965)
If the classes are linearly separable:
Separate the data
Place hyper-plane “far” from the
data: large margin
Statistical results guarantee
good generalization
BAD

From Gert Lanckriet, Statistical Learning Theory Tutorial

21
Large margin classifier
Intuition (Vapnik, 1965) if linearly
separable:
Separate the data
Place hyperplane “far” from the
data: large margin
Statistical results guarantee
good generalization
GOOD

 Maximal Margin Classifier

From Gert Lanckriet, Statistical Learning Theory Tutorial

22
Large margin classifier
If not linearly separable
Allow some errors
Still, try to place hyperplane
“far” from each class

From Gert Lanckriet, Statistical Learning Theory Tutorial

23
Large Margin Classifiers
Advantages
Theoretically better (better error bounds)

Limitations
Computationally more expensive, large quadratic
programming

24
Support Vector Machine (SVM)
M

Large Margin Classifier

wTxa + b = 1

wTxb + b = -1

Linearly separable case
Goal: find the
hyperplane that
maximizes the margin

wT x + b = 0

Support vectors

From Gert Lanckriet, Statistical Learning Theory Tutorial

25
Support Vector Machine (SVM)
Text classification
Hand-writing recognition
Computational biology (e.g., micro-array data)
Face detection
Face expression recognition
Time series prediction

From Gert Lanckriet, Statistical Learning Theory Tutorial

26
Non Linear problem

27
Non Linear problem

28
Non Linear problem
Kernel methods
A family of non-linear algorithms
Transform the non linear problem in a linear one (in a
different feature space)
Use linear algorithms to solve the linear problem in
the new space

From Gert Lanckriet, Statistical Learning Theory Tutorial

29
Main intuition of Kernel methods
(Copy here from black board)

30
Basic principle kernel methods
Φ : Rd  RD

(D >> d)

wTΦ(x)+b=0

Φ(X)=[x2 z2 xz]

X=[x z]

f(x) = sign(w1x2+w2z2+w3xz +b)
From Gert Lanckriet, Statistical Learning Theory Tutorial

31
Basic principle kernel methods
Linear separability : more likely in high dimensions
Mapping: Φ maps input into high-dimensional
feature space
Classifier: construct linear classifier in highdimensional feature space
Motivation: appropriate choice of Φ leads to linear
separability
We can do this efficiently!

From Gert Lanckriet, Statistical Learning Theory Tutorial

32
Basic principle kernel methods
We can use the linear algorithms seen before
(Perceptron, SVM) for classification in the higher
dimensional space

33
Multi-class classification
Given: some data items that belong to one of M
possible classes
Task: Train the classifier and predict the class for a
new data item
Geometrically: harder problem, no more simple
geometry

34
Multi-class classification

35
Multi-class classification: Examples
Author identification
Language identification
Text categorization (topics)

36
(Some) Algorithms for Multi-class
classification
Linear
Parallel class separators: Decision Trees
Non parallel class separators: Naïve Bayes

Non Linear
K-nearest neighbors

37
Linear, parallel class separators
(ex: Decision Trees)

38
Linear, NON parallel class separators
(ex: Naïve Bayes)

39
Non Linear (ex: k Nearest Neighbor)

40
Decision Trees
Decision tree is a classifier in the form of a tree
structure, where each node is either:
Leaf node - indicates the value of the target attribute (class)
of examples, or
Decision node - specifies some test to be carried out on a
single attribute-value, with one branch and sub-tree for each
possible outcome of the test.

A decision tree can be used to classify an example
by starting at the root of the tree and moving through
it until a leaf node, which provides the classification of
the instance.
http://dms.irb.hr/tutorial/tut_dtrees.php

41
Training Examples
Goal: learn when we can play Tennis and when we cannot
Day

Outlook

Temp.

Humidity

Wind

Play Tennis

D1

Sunny

Hot

High

Weak

No

D2

Sunny

Hot

High

Strong

No

D3

Overcast

Hot

High

Weak

Yes

D4

Rain

Mild

High

Weak

Yes

D5

Rain

Cool

Normal

Weak

Yes

D6

Rain

Cool

Normal

Strong

No

D7

Overcast

Cool

Normal

Weak

Yes

D8

Sunny

Mild

High

Weak

No

D9

Sunny

Cold

Normal

Weak

Yes

D10

Rain

Mild

Normal

Strong

Yes

D11

Sunny

Mild

Normal

Strong

Yes

D12

Overcast

Mild

High

Strong

Yes

D13

Overcast

Hot

Normal

Weak

Yes

D14

Rain

Mild

High

Strong

No
42
Decision Tree for PlayTennis
Outlook
Sunny
Humidity
High
No

Overcast

Rain

Yes

Normal
Yes

www.math.tau.ac.il/~nin/
Courses/ML04/DecisionTreesCLS.pp

Wind
Strong
No

Weak
Yes
43
Decision Tree for PlayTennis
Outlook
Sunny
Humidity
High
No

Overcast

Rain

Each internal node tests an attribute

Normal
Yes

www.math.tau.ac.il/~nin/
Courses/ML04/DecisionTreesCLS.pp

Each branch corresponds to an
attribute value node
Each leaf node assigns a classification
44
Decision Tree for PlayTennis
Outlook Temperature Humidity Wind
Sunny

Hot

High

Weak

PlayTennis
? No

Outlook
Sunny
Humidity
High
No

Overcast

Rain

Yes

Normal
Yes

www.math.tau.ac.il/~nin/
Courses/ML04/DecisionTreesCLS.pp

Wind
Strong
No

Weak
Yes

45
Decision Tree for Reuter classification

Foundations of Statistical Natural Language Processing,
Manning and Schuetze

46
Decision Tree for Reuter classification

Foundations of Statistical Natural Language Processing,
Manning and Schuetze

47
Building Decision Trees
Given training data, how do we construct them?
The central focus of the decision tree growing
algorithm is selecting which attribute to test at each
node in the tree. The goal is to select the attribute
that is most useful for classifying examples.
Top-down, greedy search through the space of
possible decision trees.
That is, it picks the best attribute and never looks back to
reconsider earlier choices.

48
Building Decision Trees
Splitting criterion
Finding the features and the values to split on
– for example, why test first “cts” and not “vs”?
– Why test on “cts < 2” and not “cts < 5” ?

Split that gives us the maximum information gain (or the
maximum reduction of uncertainty)

Stopping criterion
When all the elements at one node have the same class,
no need to split further

In practice, one first builds a large tree and then one prunes it
back (to avoid overfitting)
See Foundations of Statistical Natural Language Processing ,
Manning and Schuetze for a good introduction

49
Decision Trees: Strengths
Decision trees are able to generate understandable
rules.
Decision trees perform classification without requiring
much computation.
Decision trees are able to handle both continuous
and categorical variables.
Decision trees provide a clear indication of which
features are most important for prediction or
classification.

http://dms.irb.hr/tutorial/tut_dtrees.php

50
Decision Trees: weaknesses
Decision trees are prone to errors in classification
problems with many classes and relatively small
number of training examples.
Decision tree can be computationally expensive to
train.
Need to compare all possible splits
Pruning is also expensive

Most decision-tree algorithms only examine a single
field at a time. This leads to rectangular classification
boxes that may not correspond well with the actual
distribution of records in the decision space.

http://dms.irb.hr/tutorial/tut_dtrees.php

51
Decision Trees
Decision Trees in Weka

52
Naïve Bayes
More powerful that Decision Trees
Decision Trees

Naïve Bayes

53
Naïve Bayes Models
Graphical Models:
graph theory plus
probability theory
Nodes are variables
Edges are conditional
probabilities

A

B

C
P(A)
P(B|A)
P(C|A)
54
Naïve Bayes Models
Graphical Models:
graph theory plus
probability theory
Nodes are variables
Edges are conditional
probabilities
Absence of an edge
between nodes implies
independence between
the variables of the
nodes

A

B

C
P(A)
P(B|A)
P(C|A)

 P(C|
A,B)
55
Naïve Bayes for text classification

Foundations of Statistical Natural Language Processing,
Manning and Schuetze

56
Naïve Bayes for text classification

earn

Shr

34

cts

vs

per

shr
57
Naïve Bayes for text classification
Topic

w1

w2

w3

w4

wn-1

wn

The words depend on the topic: P(wi| Topic)
P(cts|earn) > P(tennis| earn)

Naïve Bayes assumption: all words are independent given the topic
From training set we learn the probabilities P(w i| Topic) for each word
and for each topic in the training set
58
Naïve Bayes for text classification
Topic

w1

w2

w3

w4

wn-1

wn

To: Classify new example
Calculate P(Topic | w1, w2, … wn) for each topic
Bayes decision rule:
Choose the topic T’ for which
P(T’ | w1, w2, … wn) > P(T | w1, w2, … wn) for each T≠ T’
59
Naïve Bayes: Math
Naïve Bayes define a joint probability distribution:
P(Topic , w1, w2, … wn) = P(Topic)∏ P(wi| Topic)
We learn P(Topic) and P(wi| Topic) in training
Test: we need P(Topic | w1, w2, … wn)
P(Topic | w1, w2, … wn) = P(Topic , w1, w2, … wn) / P(w1, w2, … wn)

60
Naïve Bayes: Strengths
Very simple model
Easy to understand
Very easy to implement

Very efficient, fast training and classification
Modest space storage
Widely used because it works really well for text
categorization
Linear, but non parallel decision boundaries

61
Naïve Bayes: weaknesses
Naïve Bayes independence assumption has two consequences:
The linear ordering of words is ignored (bag of words model)
The words are independent of each other given the class: False
– President is more likely to occur in a context that contains election than
in a context that contains poet

Naïve Bayes assumption is inappropriate if there are strong
conditional dependencies between the variables
(But even if the model is not “right”, Naïve Bayes models do well
in a surprisingly large number of cases because often we are
interested in classification accuracy and not in accurate
probability estimations)

62
Naïve Bayes
Naïve Bayes in Weka

63
k Nearest Neighbor Classification
Nearest Neighbor classification rule: to classify a new
object, find the object in the training set that is most
similar. Then assign the category of this nearest
neighbor
K Nearest Neighbor (KNN): consult k nearest
neighbors. Decision based on the majority category
of these neighbors. More robust than k = 1
Example of similarity measure often used in NLP is cosine
similarity

64
1-Nearest Neighbor

65
1-Nearest Neighbor

66
3-Nearest Neighbor

67
3-Nearest Neighbor
But this is closer..
We can weight neighbors
according to their similarity

Assign the category of the majority of the neighbors
68
k Nearest Neighbor Classification
Strengths
Robust
Conceptually simple
Often works well
Powerful (arbitrary decision boundaries)

Weaknesses
Performance is very dependent on the similarity measure
used (and to a lesser extent on the number of neighbors k
used)
Finding a good similarity measure can be difficult
Computationally expensive
69
Summary
Algorithms for Classification
Linear versus non linear classification
Binary classification
Perceptron
Winnow
Support Vector Machines (SVM)
Kernel Methods

Multi-Class classification
Decision Trees
Naïve Bayes
K nearest neighbor

On Wednesday: Weka

70

Weitere ähnliche Inhalte

Was ist angesagt?

Logistic Regression in R | Machine Learning Algorithms | Data Science Trainin...
Logistic Regression in R | Machine Learning Algorithms | Data Science Trainin...Logistic Regression in R | Machine Learning Algorithms | Data Science Trainin...
Logistic Regression in R | Machine Learning Algorithms | Data Science Trainin...Edureka!
 
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Simplilearn
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machinesnextlib
 
Newton backward interpolation
Newton backward interpolationNewton backward interpolation
Newton backward interpolationMUHAMMADUMAIR647
 
Unit 1: Topological spaces (its definition and definition of open sets)
Unit 1:  Topological spaces (its definition and definition of open sets)Unit 1:  Topological spaces (its definition and definition of open sets)
Unit 1: Topological spaces (its definition and definition of open sets)nasserfuzt
 
K means Clustering
K means ClusteringK means Clustering
K means ClusteringEdureka!
 
Support Vector Machine without tears
Support Vector Machine without tearsSupport Vector Machine without tears
Support Vector Machine without tearsAnkit Sharma
 
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...Simplilearn
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingTony Nguyen
 
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Simplilearn
 
Cheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnCheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnKarlijn Willems
 
3.4 derivative and graphs
3.4 derivative and graphs3.4 derivative and graphs
3.4 derivative and graphsmath265
 
3.2 implicit equations and implicit differentiation
3.2 implicit equations and implicit differentiation3.2 implicit equations and implicit differentiation
3.2 implicit equations and implicit differentiationmath265
 
Deep Learning A-Z™: Regression & Classification - Simple Linear Regression - ...
Deep Learning A-Z™: Regression & Classification - Simple Linear Regression - ...Deep Learning A-Z™: Regression & Classification - Simple Linear Regression - ...
Deep Learning A-Z™: Regression & Classification - Simple Linear Regression - ...Kirill Eremenko
 

Was ist angesagt? (20)

Logistic Regression in R | Machine Learning Algorithms | Data Science Trainin...
Logistic Regression in R | Machine Learning Algorithms | Data Science Trainin...Logistic Regression in R | Machine Learning Algorithms | Data Science Trainin...
Logistic Regression in R | Machine Learning Algorithms | Data Science Trainin...
 
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
1551 limits and continuity
1551 limits and continuity1551 limits and continuity
1551 limits and continuity
 
Alpha-Beta Search
Alpha-Beta SearchAlpha-Beta Search
Alpha-Beta Search
 
Newton backward interpolation
Newton backward interpolationNewton backward interpolation
Newton backward interpolation
 
Decision tree
Decision treeDecision tree
Decision tree
 
Unit 1: Topological spaces (its definition and definition of open sets)
Unit 1:  Topological spaces (its definition and definition of open sets)Unit 1:  Topological spaces (its definition and definition of open sets)
Unit 1: Topological spaces (its definition and definition of open sets)
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
 
Gamma function
Gamma functionGamma function
Gamma function
 
Support Vector Machine without tears
Support Vector Machine without tearsSupport Vector Machine without tears
Support Vector Machine without tears
 
Lecture5 - C4.5
Lecture5 - C4.5Lecture5 - C4.5
Lecture5 - C4.5
 
Transforms UNIt 2
Transforms UNIt 2 Transforms UNIt 2
Transforms UNIt 2
 
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
 
Cheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnCheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learn
 
3.4 derivative and graphs
3.4 derivative and graphs3.4 derivative and graphs
3.4 derivative and graphs
 
3.2 implicit equations and implicit differentiation
3.2 implicit equations and implicit differentiation3.2 implicit equations and implicit differentiation
3.2 implicit equations and implicit differentiation
 
Deep Learning A-Z™: Regression & Classification - Simple Linear Regression - ...
Deep Learning A-Z™: Regression & Classification - Simple Linear Regression - ...Deep Learning A-Z™: Regression & Classification - Simple Linear Regression - ...
Deep Learning A-Z™: Regression & Classification - Simple Linear Regression - ...
 

Andere mochten auch

Enhancing the performance of Naive Bayesian Classifier using Information Gain...
Enhancing the performance of Naive Bayesian Classifier using Information Gain...Enhancing the performance of Naive Bayesian Classifier using Information Gain...
Enhancing the performance of Naive Bayesian Classifier using Information Gain...Rafiul Sabbir
 
Machine learning Lecture 3
Machine learning Lecture 3Machine learning Lecture 3
Machine learning Lecture 3Srinivasan R
 
Binary Class and Multi Class Strategies for Machine Learning
Binary Class and Multi Class Strategies for Machine LearningBinary Class and Multi Class Strategies for Machine Learning
Binary Class and Multi Class Strategies for Machine LearningPaxcel Technologies
 
PRESENTATION ON PATENT TERM ADJUSTMENT (PTA)
PRESENTATION ON PATENT TERM ADJUSTMENT (PTA)PRESENTATION ON PATENT TERM ADJUSTMENT (PTA)
PRESENTATION ON PATENT TERM ADJUSTMENT (PTA)Vikram Jeet Singh
 
Patents 101: How to Do a Patent Search
Patents 101: How to Do a Patent SearchPatents 101: How to Do a Patent Search
Patents 101: How to Do a Patent SearchKristina Gomez
 
Neural Networks: Rosenblatt's Perceptron
Neural Networks: Rosenblatt's PerceptronNeural Networks: Rosenblatt's Perceptron
Neural Networks: Rosenblatt's PerceptronMostafa G. M. Mostafa
 
Machine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative ModelsMachine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative Modelsbutest
 
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationArtificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationMohammed Bennamoun
 
WEKA Tutorial
WEKA TutorialWEKA Tutorial
WEKA Tutorialbutest
 
Demonstration of a z transformation of a normal distribution
Demonstration of a z transformation of a normal distributionDemonstration of a z transformation of a normal distribution
Demonstration of a z transformation of a normal distributionkkong
 
Intro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVMIntro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVMNYC Predictive Analytics
 
Neural networks...
Neural networks...Neural networks...
Neural networks...Molly Chugh
 
Normal distribution and sampling distribution
Normal distribution and sampling distributionNormal distribution and sampling distribution
Normal distribution and sampling distributionMridul Arora
 

Andere mochten auch (20)

Enhancing the performance of Naive Bayesian Classifier using Information Gain...
Enhancing the performance of Naive Bayesian Classifier using Information Gain...Enhancing the performance of Naive Bayesian Classifier using Information Gain...
Enhancing the performance of Naive Bayesian Classifier using Information Gain...
 
Machine learning Lecture 3
Machine learning Lecture 3Machine learning Lecture 3
Machine learning Lecture 3
 
Binary Class and Multi Class Strategies for Machine Learning
Binary Class and Multi Class Strategies for Machine LearningBinary Class and Multi Class Strategies for Machine Learning
Binary Class and Multi Class Strategies for Machine Learning
 
PRESENTATION ON PATENT TERM ADJUSTMENT (PTA)
PRESENTATION ON PATENT TERM ADJUSTMENT (PTA)PRESENTATION ON PATENT TERM ADJUSTMENT (PTA)
PRESENTATION ON PATENT TERM ADJUSTMENT (PTA)
 
Machine learning-cheat-sheet
Machine learning-cheat-sheetMachine learning-cheat-sheet
Machine learning-cheat-sheet
 
NORMAL DISTRIBUTION
NORMAL DISTRIBUTIONNORMAL DISTRIBUTION
NORMAL DISTRIBUTION
 
Patents 101: How to Do a Patent Search
Patents 101: How to Do a Patent SearchPatents 101: How to Do a Patent Search
Patents 101: How to Do a Patent Search
 
Neural Networks: Rosenblatt's Perceptron
Neural Networks: Rosenblatt's PerceptronNeural Networks: Rosenblatt's Perceptron
Neural Networks: Rosenblatt's Perceptron
 
Machine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative ModelsMachine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative Models
 
The Perceptron (D1L2 Deep Learning for Speech and Language)
The Perceptron (D1L2 Deep Learning for Speech and Language)The Perceptron (D1L2 Deep Learning for Speech and Language)
The Perceptron (D1L2 Deep Learning for Speech and Language)
 
Patent Ductus Arteriosus
Patent Ductus ArteriosusPatent Ductus Arteriosus
Patent Ductus Arteriosus
 
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationArtificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
 
Perceptron
PerceptronPerceptron
Perceptron
 
WEKA Tutorial
WEKA TutorialWEKA Tutorial
WEKA Tutorial
 
Demonstration of a z transformation of a normal distribution
Demonstration of a z transformation of a normal distributionDemonstration of a z transformation of a normal distribution
Demonstration of a z transformation of a normal distribution
 
Intro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVMIntro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVM
 
Neural networks...
Neural networks...Neural networks...
Neural networks...
 
Normal distribution stat
Normal distribution statNormal distribution stat
Normal distribution stat
 
Logic Programming and ILP
Logic Programming and ILPLogic Programming and ILP
Logic Programming and ILP
 
Normal distribution and sampling distribution
Normal distribution and sampling distributionNormal distribution and sampling distribution
Normal distribution and sampling distribution
 

Ähnlich wie Winnow vs perceptron

lecture14-SVMs (1).ppt
lecture14-SVMs (1).pptlecture14-SVMs (1).ppt
lecture14-SVMs (1).pptmuqadsatareen
 
Introduction to conventional machine learning techniques
Introduction to conventional machine learning techniquesIntroduction to conventional machine learning techniques
Introduction to conventional machine learning techniquesXavier Rafael Palou
 
lecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.pptlecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.pptNaglaaAbdelhady
 
2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revisedKrish_ver2
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
lawper.ppt
lawper.pptlawper.ppt
lawper.pptHiuLXun4
 
Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos butest
 
MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1arogozhnikov
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier홍배 김
 

Ähnlich wie Winnow vs perceptron (20)

lecture14-SVMs (1).ppt
lecture14-SVMs (1).pptlecture14-SVMs (1).ppt
lecture14-SVMs (1).ppt
 
Introduction to conventional machine learning techniques
Introduction to conventional machine learning techniquesIntroduction to conventional machine learning techniques
Introduction to conventional machine learning techniques
 
lecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.pptlecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.ppt
 
Text categorization
Text categorizationText categorization
Text categorization
 
[ppt]
[ppt][ppt]
[ppt]
 
[ppt]
[ppt][ppt]
[ppt]
 
2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
lawper.ppt
lawper.pptlawper.ppt
lawper.ppt
 
Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos
 
MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 
Support Vector Machine.ppt
Support Vector Machine.pptSupport Vector Machine.ppt
Support Vector Machine.ppt
 
svm.ppt
svm.pptsvm.ppt
svm.ppt
 

Kürzlich hochgeladen

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Kürzlich hochgeladen (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Winnow vs perceptron

  • 1. SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004 1
  • 2. Today Algorithms for Classification Binary classification Perceptron Winnow Support Vector Machines (SVM) Kernel Methods Multi-Class classification Decision Trees Naïve Bayes K nearest neighbor 2
  • 3. Binary Classification: examples Spam filtering (spam, not spam) Customer service message classification (urgent vs. not urgent) Information retrieval (relevant, not relevant) Sentiment classification (positive, negative) Sometime it can be convenient to treat a multi-way problem like a binary one: one class versus all the others, for all classes 3
  • 4. Binary Classification Given: some data items that belong to a positive (+1 ) or a negative (-1 ) class Task: Train the classifier and predict the class for a new data item Geometrically: find a separator 4
  • 5. Linear versus Non Linear algorithms Linearly separable data: if all the data points can be correctly classified by a linear (hyperplanar) decision boundary 5
  • 6. Linearly separable data Linear Decision boundary Class1 Class2 6
  • 7. Non linearly separable data Class1 Class2 7
  • 8. Non linearly separable data Non Linear Classifier Class1 Class2 8
  • 9. Linear versus Non Linear algorithms Linear or Non linear separable data? We can find out only empirically Linear algorithms (algorithms that find a linear decision boundary) When we think the data is linearly separable Advantages – Simpler, less parameters Disadvantages – High dimensional data (like for NLT) is usually not linearly separable Examples: Perceptron, Winnow, SVM Note: we can use linear algorithms also for non linear problems (see Kernel methods) 9
  • 10. Linear versus Non Linear algorithms Non Linear When the data is non linearly separable Advantages – More accurate Disadvantages – More complicated, more parameters Example: Kernel methods Note: the distinction between linear and non linear applies also for multi-class classification (we’ll see this later) 10
  • 11. Simple linear algorithms Perceptron and Winnow algorithm Linear Binary classification Online (process data sequentially, one data point at the time) Mistake driven Simple single layer Neural Networks 11
  • 12. Linear binary classification Data: {(xi,yi)}i=1...n x in Rd (x is a vector in d-dimensional space)  feature vector y in {-1,+1}  label (class, category) Question: Design a linear decision boundary: wx + b (equation of hyperplane) such that the classification rule associated with it has minimal probability of error classification rule : – y = sign(w x + b) which means: – if wx + b > 0 then y = +1 – if wx + b < 0 then y = -1 From Gert Lanckriet, Statistical Learning Theory Tutorial 12
  • 13. Linear binary classification Find a good hyperplane (w,b) in R d+1 that correctly classifies data points as much as possible In online fashion: one data point at the time, update weights as necessary wx + b = 0 Classification Rule: y = sign(wx + b) From Gert Lanckriet, Statistical Learning Theory Tutorial 13
  • 14. Perceptron algorithm Initialize: w1 = 0 Updating rule For each data point x If class(x) != decision(x,w) wk then wk+1  wk + yixi k k+1 else wk+1  wk 0 +1 wk+1 -1 wk x + b = 0 Function decision(x, w) If wx + b > 0 return +1 Else return -1 From Gert Lanckriet, Statistical Learning Theory Tutorial Wk+1 x + b = 0 14
  • 15. Perceptron algorithm Online: can adjust to changing target, over time Advantages Simple and computationally efficient Guaranteed to learn a linearly separable problem (convergence, global optimum) Limitations Only linear separations Only converges for linearly separable data Not really “efficient with many features” From Gert Lanckriet, Statistical Learning Theory Tutorial 15
  • 16. Winnow algorithm Another online algorithm for learning perceptron weights: f(x) = sign(wx + b) Linear, binary classification Update-rule: again error-driven, but multiplicative (instead of additive) From Gert Lanckriet, Statistical Learning Theory Tutorial 16
  • 17. Winnow algorithm Initialize: w1 = 0 Updating rule For each data point x wk If class(x) != decision(x,w) then wk+1  wk + yixi  Perceptron w k+1  w k *exp(y i x i )  Winnow k k+1 else wk+1  wk 0 +1 wk+1 -1 wk x + b= 0 Function decision(x, w) If wx + b > 0 return +1 Else return -1 From Gert Lanckriet, Statistical Learning Theory Tutorial Wk+1 x + b = 0 17
  • 18. Perceptron vs. Winnow Assume N available features only K relevant items, with K<<N Perceptron: number of mistakes: O( K N) Winnow: number of mistakes: O(K log N) Winnow is more robust to high-dimensional feature spaces From Gert Lanckriet, Statistical Learning Theory Tutorial 18
  • 19. Perceptron vs. Winnow Perceptron Online: can adjust to changing target, over time Advantages Simple and computationally efficient Guaranteed to learn a linearly separable problem Limitations only linear separations only converges for linearly separable data not really “efficient with many features” Winnow Online: can adjust to changing target, over time Advantages Simple and computationally efficient Guaranteed to learn a linearly separable problem Suitable for problems with many irrelevant attributes Limitations only linear separations only converges for linearly separable data not really “efficient with many features” Used in NLP From Gert Lanckriet, Statistical Learning Theory Tutorial 19
  • 21. Large margin classifier Another family of linear algorithms Intuition (Vapnik, 1965) If the classes are linearly separable: Separate the data Place hyper-plane “far” from the data: large margin Statistical results guarantee good generalization BAD From Gert Lanckriet, Statistical Learning Theory Tutorial 21
  • 22. Large margin classifier Intuition (Vapnik, 1965) if linearly separable: Separate the data Place hyperplane “far” from the data: large margin Statistical results guarantee good generalization GOOD  Maximal Margin Classifier From Gert Lanckriet, Statistical Learning Theory Tutorial 22
  • 23. Large margin classifier If not linearly separable Allow some errors Still, try to place hyperplane “far” from each class From Gert Lanckriet, Statistical Learning Theory Tutorial 23
  • 24. Large Margin Classifiers Advantages Theoretically better (better error bounds) Limitations Computationally more expensive, large quadratic programming 24
  • 25. Support Vector Machine (SVM) M Large Margin Classifier wTxa + b = 1 wTxb + b = -1 Linearly separable case Goal: find the hyperplane that maximizes the margin wT x + b = 0 Support vectors From Gert Lanckriet, Statistical Learning Theory Tutorial 25
  • 26. Support Vector Machine (SVM) Text classification Hand-writing recognition Computational biology (e.g., micro-array data) Face detection Face expression recognition Time series prediction From Gert Lanckriet, Statistical Learning Theory Tutorial 26
  • 29. Non Linear problem Kernel methods A family of non-linear algorithms Transform the non linear problem in a linear one (in a different feature space) Use linear algorithms to solve the linear problem in the new space From Gert Lanckriet, Statistical Learning Theory Tutorial 29
  • 30. Main intuition of Kernel methods (Copy here from black board) 30
  • 31. Basic principle kernel methods Φ : Rd  RD (D >> d) wTΦ(x)+b=0 Φ(X)=[x2 z2 xz] X=[x z] f(x) = sign(w1x2+w2z2+w3xz +b) From Gert Lanckriet, Statistical Learning Theory Tutorial 31
  • 32. Basic principle kernel methods Linear separability : more likely in high dimensions Mapping: Φ maps input into high-dimensional feature space Classifier: construct linear classifier in highdimensional feature space Motivation: appropriate choice of Φ leads to linear separability We can do this efficiently! From Gert Lanckriet, Statistical Learning Theory Tutorial 32
  • 33. Basic principle kernel methods We can use the linear algorithms seen before (Perceptron, SVM) for classification in the higher dimensional space 33
  • 34. Multi-class classification Given: some data items that belong to one of M possible classes Task: Train the classifier and predict the class for a new data item Geometrically: harder problem, no more simple geometry 34
  • 36. Multi-class classification: Examples Author identification Language identification Text categorization (topics) 36
  • 37. (Some) Algorithms for Multi-class classification Linear Parallel class separators: Decision Trees Non parallel class separators: Naïve Bayes Non Linear K-nearest neighbors 37
  • 38. Linear, parallel class separators (ex: Decision Trees) 38
  • 39. Linear, NON parallel class separators (ex: Naïve Bayes) 39
  • 40. Non Linear (ex: k Nearest Neighbor) 40
  • 41. Decision Trees Decision tree is a classifier in the form of a tree structure, where each node is either: Leaf node - indicates the value of the target attribute (class) of examples, or Decision node - specifies some test to be carried out on a single attribute-value, with one branch and sub-tree for each possible outcome of the test. A decision tree can be used to classify an example by starting at the root of the tree and moving through it until a leaf node, which provides the classification of the instance. http://dms.irb.hr/tutorial/tut_dtrees.php 41
  • 42. Training Examples Goal: learn when we can play Tennis and when we cannot Day Outlook Temp. Humidity Wind Play Tennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Weak Yes D8 Sunny Mild High Weak No D9 Sunny Cold Normal Weak Yes D10 Rain Mild Normal Strong Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No 42
  • 43. Decision Tree for PlayTennis Outlook Sunny Humidity High No Overcast Rain Yes Normal Yes www.math.tau.ac.il/~nin/ Courses/ML04/DecisionTreesCLS.pp Wind Strong No Weak Yes 43
  • 44. Decision Tree for PlayTennis Outlook Sunny Humidity High No Overcast Rain Each internal node tests an attribute Normal Yes www.math.tau.ac.il/~nin/ Courses/ML04/DecisionTreesCLS.pp Each branch corresponds to an attribute value node Each leaf node assigns a classification 44
  • 45. Decision Tree for PlayTennis Outlook Temperature Humidity Wind Sunny Hot High Weak PlayTennis ? No Outlook Sunny Humidity High No Overcast Rain Yes Normal Yes www.math.tau.ac.il/~nin/ Courses/ML04/DecisionTreesCLS.pp Wind Strong No Weak Yes 45
  • 46. Decision Tree for Reuter classification Foundations of Statistical Natural Language Processing, Manning and Schuetze 46
  • 47. Decision Tree for Reuter classification Foundations of Statistical Natural Language Processing, Manning and Schuetze 47
  • 48. Building Decision Trees Given training data, how do we construct them? The central focus of the decision tree growing algorithm is selecting which attribute to test at each node in the tree. The goal is to select the attribute that is most useful for classifying examples. Top-down, greedy search through the space of possible decision trees. That is, it picks the best attribute and never looks back to reconsider earlier choices. 48
  • 49. Building Decision Trees Splitting criterion Finding the features and the values to split on – for example, why test first “cts” and not “vs”? – Why test on “cts < 2” and not “cts < 5” ? Split that gives us the maximum information gain (or the maximum reduction of uncertainty) Stopping criterion When all the elements at one node have the same class, no need to split further In practice, one first builds a large tree and then one prunes it back (to avoid overfitting) See Foundations of Statistical Natural Language Processing , Manning and Schuetze for a good introduction 49
  • 50. Decision Trees: Strengths Decision trees are able to generate understandable rules. Decision trees perform classification without requiring much computation. Decision trees are able to handle both continuous and categorical variables. Decision trees provide a clear indication of which features are most important for prediction or classification. http://dms.irb.hr/tutorial/tut_dtrees.php 50
  • 51. Decision Trees: weaknesses Decision trees are prone to errors in classification problems with many classes and relatively small number of training examples. Decision tree can be computationally expensive to train. Need to compare all possible splits Pruning is also expensive Most decision-tree algorithms only examine a single field at a time. This leads to rectangular classification boxes that may not correspond well with the actual distribution of records in the decision space. http://dms.irb.hr/tutorial/tut_dtrees.php 51
  • 53. Naïve Bayes More powerful that Decision Trees Decision Trees Naïve Bayes 53
  • 54. Naïve Bayes Models Graphical Models: graph theory plus probability theory Nodes are variables Edges are conditional probabilities A B C P(A) P(B|A) P(C|A) 54
  • 55. Naïve Bayes Models Graphical Models: graph theory plus probability theory Nodes are variables Edges are conditional probabilities Absence of an edge between nodes implies independence between the variables of the nodes A B C P(A) P(B|A) P(C|A)  P(C| A,B) 55
  • 56. Naïve Bayes for text classification Foundations of Statistical Natural Language Processing, Manning and Schuetze 56
  • 57. Naïve Bayes for text classification earn Shr 34 cts vs per shr 57
  • 58. Naïve Bayes for text classification Topic w1 w2 w3 w4 wn-1 wn The words depend on the topic: P(wi| Topic) P(cts|earn) > P(tennis| earn) Naïve Bayes assumption: all words are independent given the topic From training set we learn the probabilities P(w i| Topic) for each word and for each topic in the training set 58
  • 59. Naïve Bayes for text classification Topic w1 w2 w3 w4 wn-1 wn To: Classify new example Calculate P(Topic | w1, w2, … wn) for each topic Bayes decision rule: Choose the topic T’ for which P(T’ | w1, w2, … wn) > P(T | w1, w2, … wn) for each T≠ T’ 59
  • 60. Naïve Bayes: Math Naïve Bayes define a joint probability distribution: P(Topic , w1, w2, … wn) = P(Topic)∏ P(wi| Topic) We learn P(Topic) and P(wi| Topic) in training Test: we need P(Topic | w1, w2, … wn) P(Topic | w1, w2, … wn) = P(Topic , w1, w2, … wn) / P(w1, w2, … wn) 60
  • 61. Naïve Bayes: Strengths Very simple model Easy to understand Very easy to implement Very efficient, fast training and classification Modest space storage Widely used because it works really well for text categorization Linear, but non parallel decision boundaries 61
  • 62. Naïve Bayes: weaknesses Naïve Bayes independence assumption has two consequences: The linear ordering of words is ignored (bag of words model) The words are independent of each other given the class: False – President is more likely to occur in a context that contains election than in a context that contains poet Naïve Bayes assumption is inappropriate if there are strong conditional dependencies between the variables (But even if the model is not “right”, Naïve Bayes models do well in a surprisingly large number of cases because often we are interested in classification accuracy and not in accurate probability estimations) 62
  • 64. k Nearest Neighbor Classification Nearest Neighbor classification rule: to classify a new object, find the object in the training set that is most similar. Then assign the category of this nearest neighbor K Nearest Neighbor (KNN): consult k nearest neighbors. Decision based on the majority category of these neighbors. More robust than k = 1 Example of similarity measure often used in NLP is cosine similarity 64
  • 68. 3-Nearest Neighbor But this is closer.. We can weight neighbors according to their similarity Assign the category of the majority of the neighbors 68
  • 69. k Nearest Neighbor Classification Strengths Robust Conceptually simple Often works well Powerful (arbitrary decision boundaries) Weaknesses Performance is very dependent on the similarity measure used (and to a lesser extent on the number of neighbors k used) Finding a good similarity measure can be difficult Computationally expensive 69
  • 70. Summary Algorithms for Classification Linear versus non linear classification Binary classification Perceptron Winnow Support Vector Machines (SVM) Kernel Methods Multi-Class classification Decision Trees Naïve Bayes K nearest neighbor On Wednesday: Weka 70