Classifying Unstructured Text - A Hybrid Deterministic/ML Approach

Hamburg München Berlin Köln Leipzig
Classifying unstructured text
Stephanie Fischer
Christian Winkler
dataworks
summitMunich
2017-04-05

Unstructured content is everywhere.
Most of it exists in a vacuum and
cannot be compared with each other.
BIG
DATA
Unstructured means hardly comparable.
Lets find an efficient way of comparing
different texts with each other
BIG
DATA

Today we will develop a method how to make different texts about similar
content comparable
Fake news? Real news? Who knows in these times? It seems like everything is just
a question of point of view and getting the audience‘s attention. The focus of the
media impacts people‘s opinions. But what‘s the focus of the different media?
Comparing news headlines
from Reuter and Al Jazeera

Compare word frequency of news by visualizing its data
Aljazeera
# 94,309 headlines
8.5 years
Reuters World News
# 163,919 headlines
9 years
Visualizations created with Apache Solr and D3.js, see our talk from Apache Big Data Vancouver 2016 here: https://bigdata.mgm-tp.com/apache/
Result:
They look similar!
Step 1

0
10000
20000
30000
40000
50000
60000
70000
80000
World
US
Politics
Top News
Business News
Markets
Technology
Deals
Personal Finance
Business
Economy
Green Business
Bonds
Sports
Small Business
Reuters
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
news-middleeast
news-americas
news-europe
news-asia-pacific
news-africa
news-asia
news
indepth-opinion
indepth-features
indepth-inpictures
focus
blogs-americas
indepth-spotlight
blogs-asia
indepth-interactive
AlJazeera
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
news-middleeast
news-americas
news-europe
news-asia-pacific
news-africa
news-asia
news
indepth-opinion
indepth-features
indepth-inpictures
focus
blogs-americas
indepth-spotlight
blogs-asia
indepth-interactive
AlJazeera
0
10000
20000
30000
40000
50000
60000
70000
80000
World
US
Politics
Top News
Business News
Markets
Technology
Deals
Personal Finance
Business
Economy
Green Business
Bonds
Sports
Small Business
Reuters
0
10000
20000
30000
40000
50000
60000
70000
80000
World
US
Politics
Top News
Business News
Markets
Technology
Deals
Personal Finance
Business
Economy
Green Business
Bonds
Sports
Small Business
Reuters
Use what‘s already there: Categories
Compare & select pre-defined categories of Al Jazeera & Reuters
Step 2

News-middleeast
News-americas
News-europe
News-asia-pacific
News-africa
Technology
Business
EXTRACTING…
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
news-middleeast
news-americas
news-europe
news-asia-pacific
news-africa
news-asia
news
indepth-opinion
indepth-features
indepth-inpictures
focus
blogs-americas
indepth-spotlight
blogs-asia
indepth-interactive
AlJazeera
0
10000
20000
30000
40000
50000
60000
70000
80000
World
US
Politics
Top News
Business News
Markets
Technology
Deals
Personal Finance
Business
Economy
Green Business
Bonds
Sports
Small Business
Reuters
Step 2
Use what‘s already there: Categories
Compare & select pre-defined categories of Al Jazeera & Reuters

We want:
Reuter‘s text
categorized according
to AlJazeera’s logic
We have:
AlJazeera‘s geo-
localized categories
We want:
AlJazeera‘s text
classified according
to Reuter’s logic
Wehave:
Reuter‘s topics
We want:
AlJazeera‘s text
to Reuter’s logic
Wehave:
Reuter‘s topics
Transfer useful categories from one source to the other in order to
make them comparable
Step 3

Examples for category-specific keywords extracted from
Aljazeera news
There are specific keywords for Al Jazeera’s geo-categories
Europe
23 keywords
Ukraine, Spain, …
Paris, London …
Merkel, Putin, …
Asia-Pacific
23 keywords
Taiwan,
Thailand, …
Beijing,
Bangkok, …
Thaksin,
Typhoon, Kim
Americas
23 keywords
Cuba, Bolivia, …
Guantanamo, …
bp, Castro,
Chavez, …
Africa
32 keywords
Kenya, Somalia, …
Darfur, …
Mandela,
Mugabe, …
Middle-East
25 keywords
Syria, Israel, …
Baghdad, Cairo, …
Mubarak,
Olmert, …
Asia
23 keywords
Pakistan,
Kyrgyzstan, ...
Nepal, …
Musharraf,
Karzai, …
*90% precision
Step 3.1

Naïve selection of category-specific keywords for deterministic
classification leads to wrong results
President Trump‘s apartment
in New York
Cities like „York“ 1 (UK)
Names like „Trump“ 1 (US)
Result: europe + americas
Yorkshires are world‘s most
popular dog
Cities like „York“ (1 UK)
Result: europe
Theresa May‘s press
conference in York
Cities like „York“ 1 (UK)
Names like „Theresa May“ 1 (UK)
Result: europe
Step 3.2
Next challenge: Categorize Reuters data with a deterministic scheme

Categorize Reuters‘ headlines with AlJazeera‘s
geo-categories & check result
Step 3.2
AlJazeera
Reuters (det.)

Compare deterministic results with MLStep 3.2
Reuters (det.)
Reuters (ML)

Summary
Step What we have achieved so far1-3
We classified Reuters
news by applying
category-specific
keywords for each
geo-category from
Al Jazeera
Find rules for deterministic classification
Find category-specific keywords
Apply rules
Categorize data set with keywords
Evaluate results
Done correctly? Where are gaps?
Iterate & further develop rules
More rules, synonyms, …

Transfer useful categories from one source to the other in order
to make them comparable: Reuters‘ topics to Aljazeera‘s text
We want:
Reuter‘s text
We have:
AlJazeera‘s geo-
We want:
AlJazeera‘s text
to Reuter’s logic
Wehave:
Reuter‘s topics
We want:
Reuter‘s text
We have:
AlJazeera‘s geo-
Step 4

Visualize word frequency within topics in order to get a first
feeling for content
Step 4
Technology Business

Extract category-specific keywords within Al Jazeera‘s topics
Business & Technology
Step 4.1
If context is crucial, text structure is more complex
(e.g. multi-word) the deterministic approach is flawed
Let’s try ML!
Result within section Business: Not one specific word Fail
Result within section Technology: 9 category-specific keywords Fail
not enough

Use ML to categorize AlJazeera‘s headlines with Reuter’s topics
& check result
BIG
DATAStep 4.2

AlJazeera
Reuters

What we have achieved so far
Summary
Step
We classified Al Jazeera
news by training an ML
algorithm with already
categorized training
sets from Reuters
4
Find source with interesting categories
Relevant topic and enough data examples
Extract training set from source
Categories already classified (no manual work)
Train ML algorithm
Training set needs to be representative
Classify new text with trained ML algorithm
Be careful with new words and changed TF/IDF
Evaluate data sample
Evaluate classification result

How does the ML algorithm work?
Example Support Vector Machine
BIG
DATA
Machine learning is linear algebra
Fixed values necessary
 Categories are already discrete
Complicated for text
 Model necessary
 Different alternatives
Many different learn models, e.g.
 Support Vector Machines (popular)
 Neuronal networks
 Random forest
 Decision trees
1. Learn logic from coding set:
SVM learns how to separate blue
points from grey points
2. Classification of total data set:
SVM applies its knowledge to the
grey points, unknown so far

Lets take a step back and
find out:
How can I MEASURE
the classification results?

Quantify classification results with the metrics
precision & recall
BIG
DATAStep 5
Example:
We want to find
all Reuters news
which belong to
category Europe
Europe
France
Poland
Spain
Italian
More and
more Italian
restaurants
in China
Recall
Precision
Italian
More and
more Italian
restaurants in
China

Back to reality:
How to handle classification projects with customer-specific categories

Step 1: Find suitable categories
 Functional discussion with project team
 Topic modelling
Step 4: Manual classification of training set
 Very expensive
 Extensive QA necessary
 Correct training set has high impact on quality
of final results
Step 6
Typical project:
Classification of loads of data with non-standard categories
Step 1
Step 3: Find pre-categorized data
 Saves a lot of work but not always possible
Step 3
Step 4
Step 2: Verify categories
 Well-defined and reproducible
(not necessarily mutually exlcusive)
 Ideally 100% coverage
Step 2
Step 6: Classification
 Classify
 Manual QA
Step 6
Step 5: Training, QA and optimization
 Try different algorithms
 Crossfolding
 Iterate and improve
Step 5

Big Data: Select training set, e.g. 10.000 from 1.500.000 balls
BIG
DATAStep 6.1
Challenge
Choose the best training set for your problem
UrsusWehrlis

Preparation of training setStep 6.2
2. Not so good situation:
The manually classified data
contains only a fraction of all
the words in the complete
data set
 Select documents with highest word variability
– Word heterogenity = Number of words in all
documents ( stopwords)
– Long tail distribution
( many, many words use infrequently )
– Even distribution
 Complicated: knapsack-like problem
 Use an approximate approach (like genetic algorithm)
 Crucial for all following tasks
1. Good situation:
The manually classified data set contains all the words of
the complete data set.
Word heterogeneity
in training set
w01 w02 w03
w04 w05 w06
w07 w08 w09
w10 w11 w12
w13 w14 w15
w16 w17 w18
w01 w02 w03
w04 w05 w06
w07 w08 w09
w10 w11 w12
w13 w14 w15
w16 w17 w18
Word heterogeneity
complete data set
w19 w20 w21
w22 w23 w24
w25 w26 w27
w28 w29 w30
w31 w32 w33
… … w99
Complete
set
common distribution dictionary distribution
BIG
DATA

Intelligently choose training setStep 6.3
BIG
DATA
Final data set available Final data set not available
 Optimize for high variability and high usage
Select this Don‘t select that
 Choose training set in a way to create maximal
word overlap with complete data set
 WM = { words in training set }
WC = { words in complete set }
find maximum for | WC WM | = | WM |
 Improved approach: choose training set to
minimize headlines with unknown words in
complete data set
 Find minimum for |C WM|
 More complicated, but worth it
U
U

BIG
DATA
Result
Optimized training set
UrsusWehrlis

BIG
DATA
Summary – Our learnings
UrsusWehrlis
Focus on cost-efficiency of your classification result
 Get more pre-categorized data by
 Categories from other sources
 NLP (e.g. FB pre-trained word vectors) & semantic extraction
 Chose not more documents than necessary for manual training set classification
 Take courage and admit when it‘s best to finish: Don‘t get lost in the longtail
Focus on high-quality of your classification result
 Chose the right training set for ML
 Chose best algorithm for your specific problem
 Optimize chosen algorithm

Classifying unstructured text
Dr. Christian Winkler
Enterprise Architect
Big Data, Data Science
mgm technology partners
https://www.linkedin.com/in/drchristianwinkler/
Stephanie Fischer
Product Owner Text Analytics
mgm consulting partners
https://www.linkedin.com/in/steffifischer/
Ursus Wehrlis

Classifying Unstructured Text - A Hybrid Deterministic/ML Approach

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Classifying Unstructured Text - A Hybrid Deterministic/ML Approach

Ähnlich wie Classifying Unstructured Text - A Hybrid Deterministic/ML Approach (20)

Mehr von DataWorks Summit/Hadoop Summit

Mehr von DataWorks Summit/Hadoop Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Classifying Unstructured Text - A Hybrid Deterministic/ML Approach

Hinweis der Redaktion