Classification and Information Retrieval metrics for machine learning

Evaluation Metrics for
Classification and
Information Retrieval

Who am I
● Katya
● Natural Language Processing
● CTO at Majio
● Sloth.Works - matches Candidates to Jobs
● Twitter - @kitkate87
● Medium - @ekaterinamihailova

Content
● Classification metrics
● Information Retrieval metrics
● Majio’s evaluation metrics
● Design your own metric

General ML flow
Define goals
and metrics
Gather and
clean data
Build ML
model
Evaluate
results
Analyze
results

Evaluating Image recognition algorithms
● Setup
○ Images with sloths and images without sloths
● Goals
○ Distinguish between a sloth and non sloth - 50% sloth pictures

Confusion matrix
True
Negative
False
Positive
FNFP
TNTP
Algorithm
Truth

Confusion matrix
True
Negative
False
Positive
Algorithm
Truth

Accuracy
acc = T / (T + F)
= (TP + TN) / ALL
acc = ( + ) / ALL

● Setup
● Goals
○ Distinguish between a sloth and non sloth- 50% sloth pictures - accuracy
○ Distinguish between a sloth and non sloth - 1% sloth pictures

Accuracy with 1% sloth pictures
Algorithm - always says it is not a sloth
acc = 99%

Accuracy per class
accP = TP / (TP + FN)
accP = /( + )
accN = TN / (TN + FP)
accN = /( + )
acc = (accP + accN)/2

Accuracy per class with 1% sloth pictures
acc = 50%

● Setup
● Goals
○ Distinguish between a sloth and non sloth - 50% sloth pictures - accuracy
○ Distinguish between a sloth and non sloth - 1% sloth pictures - accuracy per class
○ Distinguish between a sloth and non sloth and ask a person if not sure

● Setup
● Goals
○ Distinguish between a sloth and non sloth and ask a person if not sure - log loss
○ Camera in the forest - 1% sloth pictures

Precision
p = TP / (TP + FP)
p = / ( + )

Precision with 1% sloth pictures
Algorithm – guesses right exactly one monkey and for
everything else says it is not a monkey
p = 100%

Recall (True positive rate)
r = TP / (TP + FN)
r = / ( + )

Recall with 1% sloth pictures
Algorithm - always says it is a sloth
r = 100%

f1-measure with 1% sloth pictures
Algorithm - always says it is a sloth
f = 0%
Algorithm - always says it is NOT a sloth except for 1
f ~ 0%
Algorithm has 30% precision and 70% recall
f = 42%
f = 50%

Parametrized f-measure
f(b) = (1+b) p*r / ((b*p)+r)

Parametrized f-measure with 1% sloth pictures
b = 3; f = 4*p*r/(3*p + r)
f = 52.5%
f = 50%

● Setup
● Goals
○ Camera in the forest - 1% sloth pictures - f-measure
○ Search results for sloth and non sloth - 50% sloth pictures

False positive rate
fpr = FP / (FP + TN)
fpr = / ( + )

False Positive Rate with 1% sloth pictures
fpr = 0%

Search Results
1 2 3 4 5 6 7 8

Search Results
1 0 0 1 1 1 1 1

TPR and FPR for different points
● At point 1- TPR = 2%, FPR = 0%
● At point 25 - TPR = 40%, FPR = 10%
● At point 50 - TPR = 74%, FPR = 26%
● At point 75 - TPR = 96%, FPR = 54%
● At point 100 - TPR = 100%, FPR = 100%

ROC Curve and AUC (Area Under the Curve)
True Positive Rate
False Positive Rate

● Setup
● Goals
○ Search results for sloth and non sloth - 50% sloth pictures - AUC
○ Search results for sloth and non sloth - 1% sloth pictures

Precision and Recall at different points
● At point 1 - Recall = 2%, Precision = 100%

Precision - Recall curve
Precision
Recall

Average Precision
1 0 0 1 1 1 1 1

Average Precision
(1/1 + 0 + 0 + 2/4 + 3/5 + 4/6 + 5/7 + 6/8)
/6
70.5%

Average Precision
/6
(0 + 0 + 1/3 + 2/4 + 3/5 + 4/6 + 5/7 + 6/8)
59.4%

Average Precision
/6
( 1/1 + 2/2 + 3/3 + 4/4 + 5/5 + 6/6 + 0 + 0)
100%

Mean Average Precision
MAP = (70.5% + 59.4% + 100%) / 3 = 76.64%

Geometric Mean Average Precision
MAP = (70.5% * 59.4% * 100%) = 74.81%∛

● Setup
● Goals
○ Search results for sloth and non sloth - 1% sloth pictures - MAP, GMAP
○ Create image search for sloths with different relevance

Cumulative Gain
2 0 1 2 2 1 1 2
11
CG =∑rel(i)

Discounted Cumulative Gain
2 0 1 2 2 1 1 2
DCG =∑rel(i)/log2(i+1)

Discounted Cumulative Gain
5.12
2 0 1/2 0.86 0.77 0.35 1/3 0.31

Ideal Discounted Cumulative Gain
2 1.24 1 0.86 0.39 0.35 1/3 0
6.17

Normalized Discounted Cumulative Gain
NDCG = DCG / IDCG = 5.12 / 6.17 = 0.83

● Setup
● Goals
○ Search results for sloth and non sloth - 1% sloth pictures - MAP, GMAP
○ Create image search for sloths with different relevance - NDCG

Majio Usecase
Matching Candidates to Job
1 2 3

Evaluating Matching Candidates to Job - 1
1 3 2 1 1 2 2 1 2 2
( TP/T - FP/T + 1 ) / 2

1 3 2 1 1 2 2 1 2 2
( 2/4 - 1/4 + 1 ) / 2
62.5%

3 1 2 2 2 2 2 2 2 2

3 1 2 2 2 2 2 2 2 2
( 0/1 - 1/1 + 1 ) / 2
0%

1 3 2 1 1 2 2 1 2 2
Normalized MAP at points 5, 10, 15

1 3 2 1 1 2 2 1 2 2
MAP = (3.3/5 + 5.7/10) / 2
31.8%

1 1 1 1 2 2 2 2 2 3
best MAP=(4.3/5 + 5.7/10) / 2
32.8%

3 2 2 2 2 2 1 1 1 1
worst MAP=(1.3/5 + 5.7/10) / 2
29.8%

1 3 2 1 1 2 2 1 2 2
normalized MAP = (MAP - wMAP) / (bMAP - wMAP)
33.3%

normalized MAP = (MAP - wMAP) / (bMAP - wMAP)
33.3%
40% 30% 20% 10% 9% 8% 7% 6% 5% 4%

40% 30% 20% 10% 9% 8% 7% 6% 5% 4%
0.8 * f1 + 0.2 * AP

Inter-annotator agreement
How much the annotators make the same decision for the
same search result.

The experiment
● 4 Annotators
● 60 randomly generated search results (by order, percentage and cut off line)
● The search results were equally distributed with majio scores between 1 and
100
● Annotators had to give score to the search between 1 (perfect) and 4 (horrible)
● 2 of the search results were there twice but in different context
● At least 3 out 4 annotators have to agree on ranking in order to be accepted

The results
● Inter-annotator agreement on 32 out of 60 rankings
● Two groups of annotators - strict (no 1s fallen behind) and useful (can you do
you with the amount of good candidates we have sent you)
● 2 out of 4 annotators gave different score to the trap rankings
● On the rankings in the inter-annotator agreement the scoring was consistent.
Limits for good and bad ranking acquired values.

Conclusions
● There are a lot of Information Retrieval metrics in the world (only a chosen few
were shown here)
● None is perfect but some are useful
● You can craft a metric yourself but then you have to check how good of a metric
it is
● People don’t generally agree on things in the beginning. Experiment until there is
good enough agreement.

Classification and Information Retrieval metrics for machine learning

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Classification and Information Retrieval metrics for machine learning

Ähnlich wie Classification and Information Retrieval metrics for machine learning (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Classification and Information Retrieval metrics for machine learning