Machine learning for computer vision part 2

Machine Learning Extra : 1BMVASummer School 2014
The bits the whirlwind tour left
out ...
BMVA Summer School 2014 – extra background slides

Machine Learning
Definition:
– “A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P, if its
performance at tasks T, improves with
experience E.”
[Mitchell, 1997]

Algorithm to construct decision trees ….

Building Decision Trees – ID3
 node = root of tree
 Main loop:
A = “best” decision attribute for next node
.....
But which attribute is best to split on ?

Entropy in machine learning
Entropy : a measure of impurity
– S is a sample of training examples
– P is the proportion of positive examples in S
– P⊖ is the proportion of negative examples in S
Entropy measures the impurity of S:

Information Gain – reduction in Entropy
Gain(S,A) = expected reduction in entropy due to splitting
on attribute A
– i.e. expected reduction in impurity in the data
– (improvement in consistent data sorting)

– reduction in entropy in set of examples S if split on attribute A
– Sv
= subset of S for which attribute A has value v
– Gain(S,A) = original entropy – SUM(entropy of sub-nodes if split on A)

Information Gain :
– “information provided about the target function given the value of
some attribute A”
– How well does A sort the data into the required classes?
Generalise to c classes :
– (not just  or ⊖)
EntropyS=−∑
i=1
c
pi
log pi

Building Decision Trees
 Selecting the Next Attribute
– which attribute should we split on next?

Backpropogation Algorithm ….

Backpropagation Algorithm
Assume we have:
– input examples d={1...D}
• each is pair {xd
,td
} = {input
vector, target vector}
– node index n={1 … N}
– weight wji
connects node j → i
– input xji
is the input on the
connection node j → i
• corresponding weight = wji
– output error for node n is δn
• similar to (o – t)
Output
Layer
Input layer
Input, x
Output vector, Ok
Hidden
Layer
nodeindex{1…N}

Backpropagation Algorithm
(1) Input Example
example d
(2) output layer error
based on :
difference between
output and target
(t - o)
derivative of
sigmoid function
(3) Hidden layer error
proportional to
node contribution
to output error
(4) Update weights wij
–

Backpropagation
Termination criteria
– number of iterations
reached
– Or error below
suitable bound
Output layer error
Hidden layer error
Add weights updated
using relevant error

Backpropagation
Output
Layer, unit k
Input layer
Input, x
Output vector, Ok
Hidden
Layer,
unit h

Backpropagation
Output vector, Ok
δh
is expressed as a weighted
sum of the output layer errors δk
to which it contributes (i.e. whk
> 0)
Output
Layer, unit k
Input layer
Input, x
Hidden
Layer,
unit h

Backpropagation
Error is propogated
backwards from network
output ....
to weights of output layer
....
to weights of the hidden
layer
…
Hence the name:
backpropagation
Output
Layer, unit k
Input layer
Input, x
Hidden
Layer,
unit h
Output vector, Ok

Backpropagation
Repeat these stages for every
hidden layer in a
multi-layer network:
(using error δi
where xji
>0)
.......
Output
Layer, unit k
Input layer
Input, x
Hidden
Layer(s),
unit h
Output vector, Ok

Backpropagation
Error is propogated
backwards from network
output ....
to weights of output layer
....
over weights of all N
hidden layers
…
Hence the name:
backpropagation
.......
Output
Layer, unit k
Input layer
Input, x
Hidden
Layer(s),
unit h
Output vector, Ok

Backpropagation
Will perform
gradient descent
over the weight
space of {wji
} for all
connections i → j in
the network
Stochastic gradient
descent
– as updates based on
training one sample
at a time

Understanding (and believing) the SVM stuff ….

Remedial Note: equations of 2D lines
Line:
where:
are 2D vectors.
Offset from origin
Normal to line
2D LINES REMINDER

http://www.mathopenref.com/coordpointdisttrig.html
2D LINES REMINDER

For a defined line equation:
Fixed
Insert point into equation …...
Normal to line
Result is +ve if
point on this side
of line (i.e.> 0).
Result is -ve if
point on this side
of line. ( < 0 )
Result is the distance (+ve or
-ve) of point from line given by:
for:
2D LINES REMINDER

Linear Separator
 Instances (i.e, examples) {xi ,
yi
}
– xi
= point in instance space (Rn
) made
up of n attributes
– yi
=class value for classification of xi
Want a linear separator. Can
view this as constraint
satisfaction problem:
Equivalently,
y = +1
y = -1
Classification of
example function
f(x) = y = {+1, -1}
i.e. 2 classes
N.B. we have a vector of weights coefficients ⃗w

Linear Separator
If we define the distance of
the nearest point to the
margin as 1
→ width of margin is
(i.e. equal width each side)
We thus want to maximize:
finding the parameters:
y = +1
y = -1
Classification of example
function
f(x) = y = {+1, -1}
i.e. 2 classes

which is equivalent to minimizing:

…............. back to main slides

So ….
Find the “hyperplane” (i.e. boundary) with:
a) maximum margin
b) minimum number of (training) examples on the
wrong side of the chosen boundary
(i.e. minimal penalties due to C)
Solve via optimization (in polynomial
time/complexity)

Find hyperplane separator (plane in 3D) via optimization
Non-linear Separation (red / blue data items
on 2D plane).
Kernel projection to higher dimensional space
Non-linear boundary in original dimension
(e.g. circle n 2D) defined by planar boundary (cut) in 3D.
Example:

.... but it is all about the data!

Desirable Data Properties
Machine learning is a Data Driven Approach
The Data is important!
Ideally training/testing data used for learning must be:
– Unbiased
• towards any given subset of the space of examples ...
– Representative
• of the “real-world” data to be encountered in use/deployment
– Accurate
• inaccuracies in training/testing produce inaccuracies results
– Available
• the more training/testing data available the better the results
• greater confidence in the results can be achieved

Data Training Methodologies
Simple approach : Data Splits
– split overall data set into separate training and test sets
• No established rule but 80%:20%, 70%:30% or ⅓:⅔ training to testing
splits common
– Training on one, test on the other
– Test error = error on the test set
– Training error = error on training set
– Weakness: susceptible to bias in data sets or “over-fitting”
• Also less data available for training

Data Training Methodologies
More advanced (and robust): K Cross Validation
– Randomly split (all) the data into k-subsets
– For 1 to k
• train using all the data not in kth
subset
• test resulting learned [classifier|function …]
using kth
subset
– report mean error over all k tests

Key Summary Statistics #1
tp = true positive / tn = true negative
fp = false positive / fn = false negative
Often quoted or plotted when comparing ML techniques

Kappa Statistic
Measure of classification of “N items into C mutually
exclusive categories”
Pr(a) = probability of success of classification ( = accuracy)
Pr(e) = probability of success due to chance
– e.g. 2 categories = 50% (0.5), 3 categories = 33% (0.33) ….. etc.
– Pr(e) can be replaced with Pr(b) to measure agreement between
classifiers/techniques a and b
[Cohen, 1960]

Machine learning for computer vision part 2

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Machine learning for computer vision part 2

Ähnlich wie Machine learning for computer vision part 2 (20)

Mehr von potaters

Mehr von potaters (15)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Machine learning for computer vision part 2