Text is one of the most used forms of communication and ubiquitous in the Internet. Social networks like Facebook and Twitter mainly contain unstructured text; the same is true for content-driven websites.
For humans it is easy to grasp the meaning of text - much more difficult so for computers. Used correctly, computers can help humans tremendously in structuring and classifying huge amounts of text. This "symbiosis" can help humans work more efficiently, reduce repetitve work and help them make use of the uncovered structure.
Our talk starts with visualizations giving us ideas how to automatically classify texts. Then we will demonstrate that manual intervention is sometimes necessary and how this can be used as a basis for machine learning. We introduce the concept of classification (taxonomies). We then explain how text classification works with machine learning (TF/IDF, BoW etc.) and how we can deterministically extend training sets.
All our examples use data which is openly available and already pre-categorized. This reduces the amount of manual work, gives us more opportunities in experimenting and helps significantly in classifying more complicated cases.
As software tools we use R, Apache Solr, D3.js, and several NLP and ML tools from the ASF.
Classifying Unstructured Text - A Hybrid Deterministic/ML Approach
1. Hamburg München Berlin Köln Leipzig
Classifying unstructured text
Stephanie Fischer
Christian Winkler
dataworks
summitMunich
2017-04-05
2. Unstructured content is everywhere.
Most of it exists in a vacuum and
cannot be compared with each other.
BIG
DATA
Unstructured means hardly comparable.
Lets find an efficient way of comparing
different texts with each other
BIG
DATA
3. Today we will develop a method how to make different texts about similar
content comparable
Fake news? Real news? Who knows in these times? It seems like everything is just
a question of point of view and getting the audience‘s attention. The focus of the
media impacts people‘s opinions. But what‘s the focus of the different media?
Comparing news headlines
from Reuter and Al Jazeera
4. Compare word frequency of news by visualizing its data
Aljazeera
# 94,309 headlines
8.5 years
Reuters World News
# 163,919 headlines
9 years
Visualizations created with Apache Solr and D3.js, see our talk from Apache Big Data Vancouver 2016 here: https://bigdata.mgm-tp.com/apache/
Result:
They look similar!
Step 1
5. 0
10000
20000
30000
40000
50000
60000
70000
80000
World
US
Politics
Top News
Business News
Markets
Technology
Deals
Personal Finance
Business
Economy
Green Business
Bonds
Sports
Small Business
Reuters
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
news-middleeast
news-americas
news-europe
news-asia-pacific
news-africa
news-asia
news
indepth-opinion
indepth-features
indepth-inpictures
focus
blogs-americas
indepth-spotlight
blogs-asia
indepth-interactive
AlJazeera
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
news-middleeast
news-americas
news-europe
news-asia-pacific
news-africa
news-asia
news
indepth-opinion
indepth-features
indepth-inpictures
focus
blogs-americas
indepth-spotlight
blogs-asia
indepth-interactive
AlJazeera
0
10000
20000
30000
40000
50000
60000
70000
80000
World
US
Politics
Top News
Business News
Markets
Technology
Deals
Personal Finance
Business
Economy
Green Business
Bonds
Sports
Small Business
Reuters
0
10000
20000
30000
40000
50000
60000
70000
80000
World
US
Politics
Top News
Business News
Markets
Technology
Deals
Personal Finance
Business
Economy
Green Business
Bonds
Sports
Small Business
Reuters
Use what‘s already there: Categories
Compare & select pre-defined categories of Al Jazeera & Reuters
Step 2
7. We want:
Reuter‘s text
categorized according
to AlJazeera’s logic
We have:
AlJazeera‘s geo-
localized categories
We want:
AlJazeera‘s text
classified according
to Reuter’s logic
Wehave:
Reuter‘s topics
We want:
AlJazeera‘s text
classified according
to Reuter’s logic
Wehave:
Reuter‘s topics
Transfer useful categories from one source to the other in order to
make them comparable
Step 3
8. Examples for category-specific keywords extracted from
Aljazeera news
There are specific keywords for Al Jazeera’s geo-categories
Europe
23 keywords
Ukraine, Spain, …
Paris, London …
Merkel, Putin, …
Asia-Pacific
23 keywords
Taiwan,
Thailand, …
Beijing,
Bangkok, …
Thaksin,
Typhoon, Kim
Americas
23 keywords
Cuba, Bolivia, …
Guantanamo, …
bp, Castro,
Chavez, …
Africa
32 keywords
Kenya, Somalia, …
Darfur, …
Mandela,
Mugabe, …
Middle-East
25 keywords
Syria, Israel, …
Baghdad, Cairo, …
Mubarak,
Olmert, …
Asia
23 keywords
Pakistan,
Kyrgyzstan, ...
Nepal, …
Musharraf,
Karzai, …
*90% precision
Step 3.1
9. Naïve selection of category-specific keywords for deterministic
classification leads to wrong results
President Trump‘s apartment
in New York
Cities like „York“ 1 (UK)
Names like „Trump“ 1 (US)
Result: europe + americas
Yorkshires are world‘s most
popular dog
Cities like „York“ (1 UK)
Result: europe
Theresa May‘s press
conference in York
Cities like „York“ 1 (UK)
Names like „Theresa May“ 1 (UK)
Result: europe
Step 3.2
Next challenge: Categorize Reuters data with a deterministic scheme
12. Summary
Step What we have achieved so far1-3
We classified Reuters
news by applying
category-specific
keywords for each
geo-category from
Al Jazeera
Find rules for deterministic classification
Find category-specific keywords
Apply rules
Categorize data set with keywords
Evaluate results
Done correctly? Where are gaps?
Iterate & further develop rules
More rules, synonyms, …
13. Transfer useful categories from one source to the other in order
to make them comparable: Reuters‘ topics to Aljazeera‘s text
We want:
Reuter‘s text
categorized according
to AlJazeera’s logic
We have:
AlJazeera‘s geo-
localized categories
We want:
AlJazeera‘s text
classified according
to Reuter’s logic
Wehave:
Reuter‘s topics
We want:
Reuter‘s text
categorized according
to AlJazeera’s logic
We have:
AlJazeera‘s geo-
localized categories
Step 4
14. Visualize word frequency within topics in order to get a first
feeling for content
Step 4
Technology Business
15. Extract category-specific keywords within Al Jazeera‘s topics
Business & Technology
Step 4.1
If context is crucial, text structure is more complex
(e.g. multi-word) the deterministic approach is flawed
Let’s try ML!
Result within section Business: Not one specific word Fail
Result within section Technology: 9 category-specific keywords Fail
not enough
16. Use ML to categorize AlJazeera‘s headlines with Reuter’s topics
& check result
BIG
DATAStep 4.2
AlJazeera
Reuters
17. What we have achieved so far
Summary
Step
We classified Al Jazeera
news by training an ML
algorithm with already
categorized training
sets from Reuters
4
Find source with interesting categories
Relevant topic and enough data examples
Extract training set from source
Categories already classified (no manual work)
Train ML algorithm
Training set needs to be representative
Classify new text with trained ML algorithm
Be careful with new words and changed TF/IDF
Evaluate data sample
Evaluate classification result
18. How does the ML algorithm work?
Example Support Vector Machine
BIG
DATA
Machine learning is linear algebra
Fixed values necessary
Categories are already discrete
Complicated for text
Model necessary
Different alternatives
Many different learn models, e.g.
Support Vector Machines (popular)
Neuronal networks
Random forest
Decision trees
1. Learn logic from coding set:
SVM learns how to separate blue
points from grey points
2. Classification of total data set:
SVM applies its knowledge to the
grey points, unknown so far
19. Lets take a step back and
find out:
How can I MEASURE
the classification results?
20. Quantify classification results with the metrics
precision & recall
BIG
DATAStep 5
Example:
We want to find
all Reuters news
which belong to
category Europe
Europe
France
Poland
Spain
Italian
More and
more Italian
restaurants
in China
Recall
Precision
Italian
More and
more Italian
restaurants in
China
21. Back to reality:
How to handle classification projects with customer-specific categories
22. Step 1: Find suitable categories
Functional discussion with project team
Topic modelling
Step 4: Manual classification of training set
Very expensive
Extensive QA necessary
Correct training set has high impact on quality
of final results
Step 6
Typical project:
Classification of loads of data with non-standard categories
Step 1
Step 3: Find pre-categorized data
Saves a lot of work but not always possible
Step 3
Step 4
Step 2: Verify categories
Well-defined and reproducible
(not necessarily mutually exlcusive)
Ideally 100% coverage
Step 2
Step 6: Classification
Classify
Manual QA
Step 6
Step 5: Training, QA and optimization
Try different algorithms
Crossfolding
Iterate and improve
Step 5
23. Big Data: Select training set, e.g. 10.000 from 1.500.000 balls
BIG
DATAStep 6.1
Challenge
Choose the best training set for your problem
UrsusWehrlis
24. Preparation of training setStep 6.2
2. Not so good situation:
The manually classified data
contains only a fraction of all
the words in the complete
data set
Select documents with highest word variability
– Word heterogenity = Number of words in all
documents ( stopwords)
– Long tail distribution
( many, many words use infrequently )
– Even distribution
Complicated: knapsack-like problem
Use an approximate approach (like genetic algorithm)
Crucial for all following tasks
1. Good situation:
The manually classified data set contains all the words of
the complete data set.
Word heterogeneity
in training set
w01 w02 w03
w04 w05 w06
w07 w08 w09
w10 w11 w12
w13 w14 w15
w16 w17 w18
w01 w02 w03
w04 w05 w06
w07 w08 w09
w10 w11 w12
w13 w14 w15
w16 w17 w18
Word heterogeneity
complete data set
w19 w20 w21
w22 w23 w24
w25 w26 w27
w28 w29 w30
w31 w32 w33
… … w99
Complete
set
common distribution dictionary distribution
BIG
DATA
25. Intelligently choose training setStep 6.3
BIG
DATA
Final data set available Final data set not available
Optimize for high variability and high usage
Select this Don‘t select that
Choose training set in a way to create maximal
word overlap with complete data set
WM = { words in training set }
WC = { words in complete set }
find maximum for | WC WM | = | WM |
Improved approach: choose training set to
minimize headlines with unknown words in
complete data set
Find minimum for |C WM|
More complicated, but worth it
U
U
27. BIG
DATA
Summary – Our learnings
UrsusWehrlis
Focus on cost-efficiency of your classification result
Get more pre-categorized data by
Categories from other sources
NLP (e.g. FB pre-trained word vectors) & semantic extraction
Chose not more documents than necessary for manual training set classification
Take courage and admit when it‘s best to finish: Don‘t get lost in the longtail
Focus on high-quality of your classification result
Chose the right training set for ML
Chose best algorithm for your specific problem
Optimize chosen algorithm
28. Classifying unstructured text
Dr. Christian Winkler
Enterprise Architect
Big Data, Data Science
mgm technology partners
https://www.linkedin.com/in/drchristianwinkler/
Stephanie Fischer
Product Owner Text Analytics
mgm consulting partners
https://www.linkedin.com/in/steffifischer/
Ursus Wehrlis
Hinweis der Redaktion
Steffi
Welcome to our talk about “Classifying unstructured text with deterministic and ML approaches“!
We transferred pre-categorized scheme from Al Jazeera to Retuters
Wow, now the data is comparable!
General procedure: Use pre-categorized data & transfer logic deterministically to other texts wherever possible
4.235 documents in category „Technology“ 2.369 can be found with 25 category-specific keywords. Only half. Recall ca. 50%. Precision only 75%
But: How can you be sure? How can you measure result?
World: 91% precision, 90% recall
US: 88% precision, 80% recall
Take Reuters categories TECHNOLOGY and BUSINESS as training set
Categorize Al Jazeera
Explanation of how ML works
Übergang: Quantität ist ja nicht alles. Wie sieht es mit der Qualität der besprochenen Inhalte aus? Nächstes Slide Sentimentanalyse.
Steffi
I did all the QA. I tried to verify 200 documents and 80% of them were wrong! We can‘t give this to our customer. How can this happen? What parameters can we adjust to improve the result?
Übergang: Quantität ist ja nicht alles. Wie sieht es mit der Qualität der besprochenen Inhalte aus? Nächstes Slide Sentimentanalyse.
Christian
Let‘s assume you have headlines with 5 common words or 3 random strings
U.S. election takes place November
adfpoi4r afdafp23 sad234
Italian earthquake destroys many villages
4234asdas oirutmbs rieo234
Then we get the highest variability by selecting 8.000 headlines with only random words, but it is of no use
We must select headlines with the most common words (but only once) as they give us the highest chances of finding them again
Christian
Christian
Let‘s assume you have headlines with 5 common words or 3 random strings
U.S. election takes place November
adfpoi4r afdafp23 sad234
Italian earthquake destroys many villages
4234asdas oirutmbs rieo234
Then we get the highest variability by selecting 8.000 headlines with only random words, but it is of no use
We must select headlines with the most common words (but only once) as they give us the highest chances of finding them again
Christian
Let‘s assume you have headlines with 5 common words or 3 random strings
U.S. election takes place November
adfpoi4r afdafp23 sad234
Italian earthquake destroys many villages
4234asdas oirutmbs rieo234
Then we get the highest variability by selecting 8.000 headlines with only random words, but it is of no use
We must select headlines with the most common words (but only once) as they give us the highest chances of finding them again
Christian
Let‘s assume you have headlines with 5 common words or 3 random strings
U.S. election takes place November
adfpoi4r afdafp23 sad234
Italian earthquake destroys many villages
4234asdas oirutmbs rieo234
Then we get the highest variability by selecting 8.000 headlines with only random words, but it is of no use
We must select headlines with the most common words (but only once) as they give us the highest chances of finding them again
Steffi
Our talk is designed around the classification of real data: We took 1 MIO headlines from the Online news archive of the British newspaper Telegraph
Before diving into the actual text classification, we will do some data preparations: Text statistics and finding relevant categories
The main part of our talk will be a detailed description about the text classification, both from a functional and technical perspective
We will finish with our Top 10 lessons learned and give you some nice ideas on how you can use the knowledge from this talk for your own projects