Developing and validating a document classifier: a real-life story - Marko Smiljanic

Marko Smiljanić,
NIRI Inteligent computing Ltd,
CEO
Developing and validating
a document classifier:
a real-life story

Developing and validating
a document classifier:
a real-life story
Marko Smiljanić, CEO
www.niri-ic.com

About us.
 NIRI: 10 years in Intelligent Computing
 Text Mining
 Knowledge Discovery and Management
 All about Data Science

 Business Context
 The Challenge
 The Solution
 Effectiveness
 Laboratory measurements
 Impact estimation
 Reality
 Wrap up
The flow

Business context
 Largest clients include
 Public Employment Services in EU, USA, and Asia
 Staffing companies in EU, USA

Job seekers
Job
Taxonomy
Skill
Taxonomy
ELISE
Platform

Job
Taxonomy
Document
Classification

Occupation Taxonomies
 ISCO (International Standard Classification of Occupations)
 ESCO
 O*NET
 and many more
ISCO level 1 (10)
ISCO level 2 (42)
ISCO level 3 (124)
ISCO level 4 (400)
ESCO level 5 (5000)
“Delivery service worker”
Challenges (for humans)
 Knowing the taxonomy
 Ambiguous taxonomy
 Hybrid positions
 Vague vacancy

Client’s situation
in 2014
Vacancy
Aggregator
and
Classifier
Correct
Code?
Publish
Repair
Code!NO
23%
ОК
65%
nohelp
14%
OK
9%
nocode
12%
2000-4000 per day (into >2000 taxonomy classes)

The Solution:
NIRI will build you a better classifier
Vacancy
Aggregator
and
Classifier
NIRI
Classifier
Publish2000-4000 per day

Really?
How accurate will it be?
How will it fit our process?
 Reduce manual effort
 Increase volume
 Improve final accuracy
Really. We will (try to):

But you need to give us training data
> 1M vacancies
No class
12%
Not verified
14%
Verified
74%

Architecture of our solution
Feature
Extractor
Negotiator
Classifier 1
Classifier 2
Classifier N
…
Vacancy [Class,
Confidence]+
Vacancy Classifier
External
Services

What to do with confidence?
Vacancy, Code, Confidence
… To check manualy
Batch Processing
CONFIDENCE High accuracy
Low accuracy

Measuring accuracy in the laboratory
No class
12%
Not verified
14%
Verified
74%
No class
Incorrect
Correct
Test
20%
Train
80% Train
Test
x 5
Vacancy
Classifier

74% 78% 80%
85%
14%
13% 12%
10%
12% 9% 8% 5%
CORPUS CLASSIFIER CLASSIFIER 100 CLASSIFIER 1000
ONE OF MANY LABORATORY MEASUREMENTS
Correct Incorrect No class

No class
12%
Not verified
14%
Verified
74%
Vacancy
Classifier
No class 9%
Incorrect
13%
Correct
78%
Original
Classifier
This is not relaity
 Biased train/test set
 Accuracy of test set unknown
 Inability to test against 26%

Remember the process?
Vacancy
Aggregator
and
Classifier
Correct
Code?
Publish
Repair
Code!NO
23%
ОК
65%
nohelp
14%
OK
9%
nocode
12%

This is what it actually looks like.
Check Repair
 Reduce manual effort
 Increase volume
 Improve final accuracy
We will

And we proposed this one.
Bulk Accept Check Repair

Best/worst case analysis,
some manual validation,
careful assumptions:
Bulk
Accept
Check Repair

Impact estimation showed that:
 Step 1 effort reduction 60%
(due to bulk acceptance)
 Step 2 effort reduction 11%
(due to bulk acceptance and top 5 offers)
 Significant published volume increase
(almost to 100%)
 Accuracy slightly larger
(+1%, to around 92%)

No class
12%
Not verified
14%
Verified
74%
How can we measure production
accuracy?
We can not,
unless…

How was it built?
Check & Repair
4 eye principle
Vacancy
Classifier
Published
Original Code
&
Top 5 VC codes
Original Code
&
Top 5 VC codes
Original Code
&
Top 5 VC codes
Every single classification was marked as either
Correct, Acceptable, or Wrong

Results
63.05%
73.91% 72.06% 74.38%
65.98%
77.56% 76.25% 78.69%
CURRENT NIRI VC CURRENT
(HQ SOURCE)
NIRI VC
(HQ SOURCE)
GOLDEN TEST SET RESULTS
Correct Acceptable
Highest Quality Source (Training)

Wrap up
 Clean semantic data, in real-life, can only be a myth. We are looking into
data cleansing approaches.
 Measuring usefulness can be hard and expensive, but …
 … it can/must to be monitored after the system is deployed.
It changes over time. Continuous learning, where possible is a great thing.
 1) Implementing state-of-the-art machine learning algorithm is one thing.
2) Making it useful is another.
3) Explaining that to the end-user is the third.
 NIRI is a very cool company to work with!
I hope you liked the story, and I thank you for your attention.

Developing and validating a document classifier: a real-life story - Marko Smiljanic

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (15)

Ähnlich wie Developing and validating a document classifier: a real-life story - Marko Smiljanic

Ähnlich wie Developing and validating a document classifier: a real-life story - Marko Smiljanic (20)

Mehr von Institute of Contemporary Sciences

Mehr von Institute of Contemporary Sciences (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Developing and validating a document classifier: a real-life story - Marko Smiljanic

Hinweis der Redaktion