SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Hamburg München Berlin Köln Leipzig
Classifying unstructured text
Stephanie Fischer
Christian Winkler
dataworks
summitMunich
2017-04-05
Unstructured content is everywhere.
Most of it exists in a vacuum and
cannot be compared with each other.
BIG
DATA
Unstructured means hardly comparable.
Lets find an efficient way of comparing
different texts with each other
BIG
DATA
Today we will develop a method how to make different texts about similar
content comparable
Fake news? Real news? Who knows in these times? It seems like everything is just
a question of point of view and getting the audience‘s attention. The focus of the
media impacts people‘s opinions. But what‘s the focus of the different media?
Comparing news headlines
from Reuter and Al Jazeera
Compare word frequency of news by visualizing its data
Aljazeera
# 94,309 headlines
8.5 years
Reuters World News
# 163,919 headlines
9 years
Visualizations created with Apache Solr and D3.js, see our talk from Apache Big Data Vancouver 2016 here: https://bigdata.mgm-tp.com/apache/
Result:
They look similar!
Step 1
0
10000
20000
30000
40000
50000
60000
70000
80000
World
US
Politics
Top News
Business News
Markets
Technology
Deals
Personal Finance
Business
Economy
Green Business
Bonds
Sports
Small Business
Reuters
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
news-middleeast
news-americas
news-europe
news-asia-pacific
news-africa
news-asia
news
indepth-opinion
indepth-features
indepth-inpictures
focus
blogs-americas
indepth-spotlight
blogs-asia
indepth-interactive
AlJazeera
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
news-middleeast
news-americas
news-europe
news-asia-pacific
news-africa
news-asia
news
indepth-opinion
indepth-features
indepth-inpictures
focus
blogs-americas
indepth-spotlight
blogs-asia
indepth-interactive
AlJazeera
0
10000
20000
30000
40000
50000
60000
70000
80000
World
US
Politics
Top News
Business News
Markets
Technology
Deals
Personal Finance
Business
Economy
Green Business
Bonds
Sports
Small Business
Reuters
0
10000
20000
30000
40000
50000
60000
70000
80000
World
US
Politics
Top News
Business News
Markets
Technology
Deals
Personal Finance
Business
Economy
Green Business
Bonds
Sports
Small Business
Reuters
Use what‘s already there: Categories
Compare & select pre-defined categories of Al Jazeera & Reuters
Step 2
News-middleeast
News-americas
News-europe
News-asia-pacific
News-africa
Technology
Business
EXTRACTING…
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
news-middleeast
news-americas
news-europe
news-asia-pacific
news-africa
news-asia
news
indepth-opinion
indepth-features
indepth-inpictures
focus
blogs-americas
indepth-spotlight
blogs-asia
indepth-interactive
AlJazeera
0
10000
20000
30000
40000
50000
60000
70000
80000
World
US
Politics
Top News
Business News
Markets
Technology
Deals
Personal Finance
Business
Economy
Green Business
Bonds
Sports
Small Business
Reuters
Step 2
Use what‘s already there: Categories
Compare & select pre-defined categories of Al Jazeera & Reuters
We want:
Reuter‘s text
categorized according
to AlJazeera’s logic
We have:
AlJazeera‘s geo-
localized categories
We want:
AlJazeera‘s text
classified according
to Reuter’s logic
Wehave:
Reuter‘s topics
We want:
AlJazeera‘s text
classified according
to Reuter’s logic
Wehave:
Reuter‘s topics
Transfer useful categories from one source to the other in order to
make them comparable
Step 3
Examples for category-specific keywords extracted from
Aljazeera news
There are specific keywords for Al Jazeera’s geo-categories
Europe
23 keywords
Ukraine, Spain, …
Paris, London …
Merkel, Putin, …
Asia-Pacific
23 keywords
Taiwan,
Thailand, …
Beijing,
Bangkok, …
Thaksin,
Typhoon, Kim
Americas
23 keywords
Cuba, Bolivia, …
Guantanamo, …
bp, Castro,
Chavez, …
Africa
32 keywords
Kenya, Somalia, …
Darfur, …
Mandela,
Mugabe, …
Middle-East
25 keywords
Syria, Israel, …
Baghdad, Cairo, …
Mubarak,
Olmert, …
Asia
23 keywords
Pakistan,
Kyrgyzstan, ...
Nepal, …
Musharraf,
Karzai, …
*90% precision
Step 3.1
Naïve selection of category-specific keywords for deterministic
classification leads to wrong results
President Trump‘s apartment
in New York
Cities like „York“ 1 (UK)
Names like „Trump“ 1 (US)
Result: europe + americas
Yorkshires are world‘s most
popular dog
Cities like „York“ (1 UK)
Result: europe
Theresa May‘s press
conference in York
Cities like „York“ 1 (UK)
Names like „Theresa May“ 1 (UK)
Result: europe
Step 3.2
Next challenge: Categorize Reuters data with a deterministic scheme
Categorize Reuters‘ headlines with AlJazeera‘s
geo-categories & check result
Step 3.2
AlJazeera
Reuters (det.)
Compare deterministic results with MLStep 3.2
Reuters (det.)
Reuters (ML)
Summary
Step What we have achieved so far1-3
We classified Reuters
news by applying
category-specific
keywords for each
geo-category from
Al Jazeera
Find rules for deterministic classification
Find category-specific keywords
Apply rules
Categorize data set with keywords
Evaluate results
Done correctly? Where are gaps?
Iterate & further develop rules
More rules, synonyms, …
Transfer useful categories from one source to the other in order
to make them comparable: Reuters‘ topics to Aljazeera‘s text
We want:
Reuter‘s text
categorized according
to AlJazeera’s logic
We have:
AlJazeera‘s geo-
localized categories
We want:
AlJazeera‘s text
classified according
to Reuter’s logic
Wehave:
Reuter‘s topics
We want:
Reuter‘s text
categorized according
to AlJazeera’s logic
We have:
AlJazeera‘s geo-
localized categories
Step 4
Visualize word frequency within topics in order to get a first
feeling for content
Step 4
Technology Business
Extract category-specific keywords within Al Jazeera‘s topics
Business & Technology
Step 4.1
If context is crucial, text structure is more complex
(e.g. multi-word) the deterministic approach is flawed
Let’s try ML!
Result within section Business: Not one specific word Fail
Result within section Technology: 9 category-specific keywords Fail
not enough
Use ML to categorize AlJazeera‘s headlines with Reuter’s topics
& check result
BIG
DATAStep 4.2

AlJazeera
Reuters
What we have achieved so far
Summary
Step
We classified Al Jazeera
news by training an ML
algorithm with already
categorized training
sets from Reuters
4
Find source with interesting categories
Relevant topic and enough data examples
Extract training set from source
Categories already classified (no manual work)
Train ML algorithm
Training set needs to be representative
Classify new text with trained ML algorithm
Be careful with new words and changed TF/IDF
Evaluate data sample
Evaluate classification result
How does the ML algorithm work?
Example Support Vector Machine
BIG
DATA
Machine learning is linear algebra
Fixed values necessary
 Categories are already discrete
Complicated for text
 Model necessary
 Different alternatives
Many different learn models, e.g.
 Support Vector Machines (popular)
 Neuronal networks
 Random forest
 Decision trees
1. Learn logic from coding set:
SVM learns how to separate blue
points from grey points
2. Classification of total data set:
SVM applies its knowledge to the
grey points, unknown so far
Lets take a step back and
find out:
How can I MEASURE
the classification results?
Quantify classification results with the metrics
precision & recall
BIG
DATAStep 5
Example:
We want to find
all Reuters news
which belong to
category Europe
Europe
France
Poland
Spain
Italian
More and
more Italian
restaurants
in China
Recall
Precision
Italian
More and
more Italian
restaurants in
China
Back to reality:
How to handle classification projects with customer-specific categories
Step 1: Find suitable categories
 Functional discussion with project team
 Topic modelling
Step 4: Manual classification of training set
 Very expensive
 Extensive QA necessary
 Correct training set has high impact on quality
of final results
Step 6
Typical project:
Classification of loads of data with non-standard categories
Step 1
Step 3: Find pre-categorized data
 Saves a lot of work but not always possible
Step 3
Step 4
Step 2: Verify categories
 Well-defined and reproducible
(not necessarily mutually exlcusive)
 Ideally 100% coverage
Step 2
Step 6: Classification
 Classify
 Manual QA
Step 6
Step 5: Training, QA and optimization
 Try different algorithms
 Crossfolding
 Iterate and improve
Step 5
Big Data: Select training set, e.g. 10.000 from 1.500.000 balls
BIG
DATAStep 6.1
Challenge
Choose the best training set for your problem
UrsusWehrlis
Preparation of training setStep 6.2
2. Not so good situation:
The manually classified data
contains only a fraction of all
the words in the complete
data set
 Select documents with highest word variability
– Word heterogenity = Number of words in all
documents ( stopwords)
– Long tail distribution
( many, many words use infrequently )
– Even distribution
 Complicated: knapsack-like problem
 Use an approximate approach (like genetic algorithm)
 Crucial for all following tasks
1. Good situation:
The manually classified data set contains all the words of
the complete data set.
Word heterogeneity
in training set
w01 w02 w03
w04 w05 w06
w07 w08 w09
w10 w11 w12
w13 w14 w15
w16 w17 w18
w01 w02 w03
w04 w05 w06
w07 w08 w09
w10 w11 w12
w13 w14 w15
w16 w17 w18
Word heterogeneity
complete data set
w19 w20 w21
w22 w23 w24
w25 w26 w27
w28 w29 w30
w31 w32 w33
… … w99
Complete
set
common distribution dictionary distribution
BIG
DATA
Intelligently choose training setStep 6.3
BIG
DATA
Final data set available Final data set not available
 Optimize for high variability and high usage
Select this Don‘t select that
 Choose training set in a way to create maximal
word overlap with complete data set
 WM = { words in training set }
WC = { words in complete set }
find maximum for | WC WM | = | WM |
 Improved approach: choose training set to
minimize headlines with unknown words in
complete data set
 Find minimum for |C WM|
 More complicated, but worth it
U
U
BIG
DATA
Result
Optimized training set
UrsusWehrlis
BIG
DATA
Summary – Our learnings
UrsusWehrlis
Focus on cost-efficiency of your classification result
 Get more pre-categorized data by
 Categories from other sources
 NLP (e.g. FB pre-trained word vectors) & semantic extraction
 Chose not more documents than necessary for manual training set classification
 Take courage and admit when it‘s best to finish: Don‘t get lost in the longtail
Focus on high-quality of your classification result
 Chose the right training set for ML
 Chose best algorithm for your specific problem
 Optimize chosen algorithm
Classifying unstructured text
Dr. Christian Winkler
Enterprise Architect
Big Data, Data Science
mgm technology partners
https://www.linkedin.com/in/drchristianwinkler/
Stephanie Fischer
Product Owner Text Analytics
mgm consulting partners
https://www.linkedin.com/in/steffifischer/
Ursus Wehrlis

Weitere ähnliche Inhalte

Was ist angesagt?

Manipulating data with Talend. Learn how?
Manipulating data with Talend. Learn how?Manipulating data with Talend. Learn how?
Manipulating data with Talend. Learn how?Edureka!
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHDataiku
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages JaunesDataiku
 
Data for Action Talk - 2016-02-22
Data for Action Talk - 2016-02-22Data for Action Talk - 2016-02-22
Data for Action Talk - 2016-02-22David E Drummond
 
Is Hadoop a Necessity for Data Science
Is Hadoop a Necessity for Data ScienceIs Hadoop a Necessity for Data Science
Is Hadoop a Necessity for Data ScienceEdureka!
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...Dataiku
 
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku  - for Data Geek Paris@Criteo - Close the Data CircleDataiku  - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku - for Data Geek Paris@Criteo - Close the Data CircleDataiku
 
Dataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine LearningDataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine LearningDataiku
 
Eat whatever you can with PyBabe
Eat whatever you can with PyBabeEat whatever you can with PyBabe
Eat whatever you can with PyBabeDataiku
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...Dataiku
 
H2O World - Data Science in Action @ 6sense - Viral Bajaria
H2O World - Data Science in Action @ 6sense - Viral BajariaH2O World - Data Science in Action @ 6sense - Viral Bajaria
H2O World - Data Science in Action @ 6sense - Viral BajariaSri Ambati
 
Data Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerData Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerDataWorks Summit
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
 
Dataiku - google cloud platform roadshow - october 2013
Dataiku  - google cloud platform roadshow - october 2013Dataiku  - google cloud platform roadshow - october 2013
Dataiku - google cloud platform roadshow - october 2013Dataiku
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesSpringPeople
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Josh Patterson
 

Was ist angesagt? (20)

Manipulating data with Talend. Learn how?
Manipulating data with Talend. Learn how?Manipulating data with Talend. Learn how?
Manipulating data with Talend. Learn how?
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECH
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
 
Data for Action Talk - 2016-02-22
Data for Action Talk - 2016-02-22Data for Action Talk - 2016-02-22
Data for Action Talk - 2016-02-22
 
Is Hadoop a Necessity for Data Science
Is Hadoop a Necessity for Data ScienceIs Hadoop a Necessity for Data Science
Is Hadoop a Necessity for Data Science
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
 
Using hadoop for big data
Using hadoop for big dataUsing hadoop for big data
Using hadoop for big data
 
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku  - for Data Geek Paris@Criteo - Close the Data CircleDataiku  - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
 
Dataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine LearningDataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine Learning
 
Eat whatever you can with PyBabe
Eat whatever you can with PyBabeEat whatever you can with PyBabe
Eat whatever you can with PyBabe
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
 
H2O World - Data Science in Action @ 6sense - Viral Bajaria
H2O World - Data Science in Action @ 6sense - Viral BajariaH2O World - Data Science in Action @ 6sense - Viral Bajaria
H2O World - Data Science in Action @ 6sense - Viral Bajaria
 
00 hadoop welcome_transcript
00 hadoop welcome_transcript00 hadoop welcome_transcript
00 hadoop welcome_transcript
 
Data Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerData Science with Hadoop: A Primer
Data Science with Hadoop: A Primer
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
Before Kaggle
Before KaggleBefore Kaggle
Before Kaggle
 
Dataiku - google cloud platform roadshow - october 2013
Dataiku  - google cloud platform roadshow - october 2013Dataiku  - google cloud platform roadshow - october 2013
Dataiku - google cloud platform roadshow - october 2013
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practices
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 

Ähnlich wie Classifying Unstructured Text - A Hybrid Deterministic/ML Approach

Binary search query classifier
Binary search query classifierBinary search query classifier
Binary search query classifierEsteban Ribero
 
DataScientist Job : Between Myths and Reality.pdf
DataScientist Job : Between Myths and Reality.pdfDataScientist Job : Between Myths and Reality.pdf
DataScientist Job : Between Myths and Reality.pdfJedha Bootcamp
 
Data Science Workshop - day 2
Data Science Workshop - day 2Data Science Workshop - day 2
Data Science Workshop - day 2Aseel Addawood
 
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language ModelsDataScienceConferenc1
 
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Mathieu DESPRIEE
 
Big Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloBig Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloOCTO Technology
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2Mahmoud Alfarra
 
Data Science in Industry - Applying Machine Learning to Real-world Challenges
Data Science in Industry - Applying Machine Learning to Real-world ChallengesData Science in Industry - Applying Machine Learning to Real-world Challenges
Data Science in Industry - Applying Machine Learning to Real-world ChallengesYuchen Zhao
 
Brand Strategy and Super Bowl Twitter AnalyticsImage Sou.docx
Brand Strategy and Super Bowl Twitter AnalyticsImage Sou.docxBrand Strategy and Super Bowl Twitter AnalyticsImage Sou.docx
Brand Strategy and Super Bowl Twitter AnalyticsImage Sou.docxAASTHA76
 
How to be a Good Machine Learning PM by Google Product Manager
How to be a Good Machine Learning PM by Google Product ManagerHow to be a Good Machine Learning PM by Google Product Manager
How to be a Good Machine Learning PM by Google Product ManagerProduct School
 
A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...
A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...
A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...Michael Mortenson
 
A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...
A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...
A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...Michael Mortenson
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dmsumit621
 
DMDW Lesson 05 + 06 + 07 - Data Mining Applied
DMDW Lesson 05 + 06 + 07 - Data Mining AppliedDMDW Lesson 05 + 06 + 07 - Data Mining Applied
DMDW Lesson 05 + 06 + 07 - Data Mining AppliedJohannes Hoppe
 
Introduction to MaxDiff Scaling of Importance - Parametric Marketing Slides
Introduction to MaxDiff Scaling of Importance - Parametric Marketing SlidesIntroduction to MaxDiff Scaling of Importance - Parametric Marketing Slides
Introduction to MaxDiff Scaling of Importance - Parametric Marketing SlidesQuestionPro
 

Ähnlich wie Classifying Unstructured Text - A Hybrid Deterministic/ML Approach (20)

Binary search query classifier
Binary search query classifierBinary search query classifier
Binary search query classifier
 
DataScientist Job : Between Myths and Reality.pdf
DataScientist Job : Between Myths and Reality.pdfDataScientist Job : Between Myths and Reality.pdf
DataScientist Job : Between Myths and Reality.pdf
 
Data Science Workshop - day 2
Data Science Workshop - day 2Data Science Workshop - day 2
Data Science Workshop - day 2
 
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models
[DSC Europe 23] Dmitry Ustalov - Design and Evaluation of Large Language Models
 
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
Big Data & Machine Learning - TDC2013 São Paulo - 12/0713
 
Big Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao PauloBig Data & Machine Learning - TDC2013 Sao Paulo
Big Data & Machine Learning - TDC2013 Sao Paulo
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 
Text Analytics for Legal work
Text Analytics for Legal workText Analytics for Legal work
Text Analytics for Legal work
 
Bayesian reasoning
Bayesian reasoningBayesian reasoning
Bayesian reasoning
 
Data Science in Industry - Applying Machine Learning to Real-world Challenges
Data Science in Industry - Applying Machine Learning to Real-world ChallengesData Science in Industry - Applying Machine Learning to Real-world Challenges
Data Science in Industry - Applying Machine Learning to Real-world Challenges
 
Brand Strategy and Super Bowl Twitter AnalyticsImage Sou.docx
Brand Strategy and Super Bowl Twitter AnalyticsImage Sou.docxBrand Strategy and Super Bowl Twitter AnalyticsImage Sou.docx
Brand Strategy and Super Bowl Twitter AnalyticsImage Sou.docx
 
Introduction to spss
Introduction to spssIntroduction to spss
Introduction to spss
 
How to be a Good Machine Learning PM by Google Product Manager
How to be a Good Machine Learning PM by Google Product ManagerHow to be a Good Machine Learning PM by Google Product Manager
How to be a Good Machine Learning PM by Google Product Manager
 
A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...
A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...
A Topic Model of Analytics Job Adverts (Operational Research Society Annual C...
 
A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...
A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...
A Topic Model of Analytics Job Adverts (The Operational Research Society 55th...
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
BD-ACA Week8a
BD-ACA Week8aBD-ACA Week8a
BD-ACA Week8a
 
DMDW Lesson 05 + 06 + 07 - Data Mining Applied
DMDW Lesson 05 + 06 + 07 - Data Mining AppliedDMDW Lesson 05 + 06 + 07 - Data Mining Applied
DMDW Lesson 05 + 06 + 07 - Data Mining Applied
 
Introduction to MaxDiff Scaling of Importance - Parametric Marketing Slides
Introduction to MaxDiff Scaling of Importance - Parametric Marketing SlidesIntroduction to MaxDiff Scaling of Importance - Parametric Marketing Slides
Introduction to MaxDiff Scaling of Importance - Parametric Marketing Slides
 
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
 

Mehr von DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 

Mehr von DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 

Kürzlich hochgeladen

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Kürzlich hochgeladen (20)

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

Classifying Unstructured Text - A Hybrid Deterministic/ML Approach

  • 1. Hamburg München Berlin Köln Leipzig Classifying unstructured text Stephanie Fischer Christian Winkler dataworks summitMunich 2017-04-05
  • 2. Unstructured content is everywhere. Most of it exists in a vacuum and cannot be compared with each other. BIG DATA Unstructured means hardly comparable. Lets find an efficient way of comparing different texts with each other BIG DATA
  • 3. Today we will develop a method how to make different texts about similar content comparable Fake news? Real news? Who knows in these times? It seems like everything is just a question of point of view and getting the audience‘s attention. The focus of the media impacts people‘s opinions. But what‘s the focus of the different media? Comparing news headlines from Reuter and Al Jazeera
  • 4. Compare word frequency of news by visualizing its data Aljazeera # 94,309 headlines 8.5 years Reuters World News # 163,919 headlines 9 years Visualizations created with Apache Solr and D3.js, see our talk from Apache Big Data Vancouver 2016 here: https://bigdata.mgm-tp.com/apache/ Result: They look similar! Step 1
  • 5. 0 10000 20000 30000 40000 50000 60000 70000 80000 World US Politics Top News Business News Markets Technology Deals Personal Finance Business Economy Green Business Bonds Sports Small Business Reuters 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 news-middleeast news-americas news-europe news-asia-pacific news-africa news-asia news indepth-opinion indepth-features indepth-inpictures focus blogs-americas indepth-spotlight blogs-asia indepth-interactive AlJazeera 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 news-middleeast news-americas news-europe news-asia-pacific news-africa news-asia news indepth-opinion indepth-features indepth-inpictures focus blogs-americas indepth-spotlight blogs-asia indepth-interactive AlJazeera 0 10000 20000 30000 40000 50000 60000 70000 80000 World US Politics Top News Business News Markets Technology Deals Personal Finance Business Economy Green Business Bonds Sports Small Business Reuters 0 10000 20000 30000 40000 50000 60000 70000 80000 World US Politics Top News Business News Markets Technology Deals Personal Finance Business Economy Green Business Bonds Sports Small Business Reuters Use what‘s already there: Categories Compare & select pre-defined categories of Al Jazeera & Reuters Step 2
  • 7. We want: Reuter‘s text categorized according to AlJazeera’s logic We have: AlJazeera‘s geo- localized categories We want: AlJazeera‘s text classified according to Reuter’s logic Wehave: Reuter‘s topics We want: AlJazeera‘s text classified according to Reuter’s logic Wehave: Reuter‘s topics Transfer useful categories from one source to the other in order to make them comparable Step 3
  • 8. Examples for category-specific keywords extracted from Aljazeera news There are specific keywords for Al Jazeera’s geo-categories Europe 23 keywords Ukraine, Spain, … Paris, London … Merkel, Putin, … Asia-Pacific 23 keywords Taiwan, Thailand, … Beijing, Bangkok, … Thaksin, Typhoon, Kim Americas 23 keywords Cuba, Bolivia, … Guantanamo, … bp, Castro, Chavez, … Africa 32 keywords Kenya, Somalia, … Darfur, … Mandela, Mugabe, … Middle-East 25 keywords Syria, Israel, … Baghdad, Cairo, … Mubarak, Olmert, … Asia 23 keywords Pakistan, Kyrgyzstan, ... Nepal, … Musharraf, Karzai, … *90% precision Step 3.1
  • 9. Naïve selection of category-specific keywords for deterministic classification leads to wrong results President Trump‘s apartment in New York Cities like „York“ 1 (UK) Names like „Trump“ 1 (US) Result: europe + americas Yorkshires are world‘s most popular dog Cities like „York“ (1 UK) Result: europe Theresa May‘s press conference in York Cities like „York“ 1 (UK) Names like „Theresa May“ 1 (UK) Result: europe Step 3.2 Next challenge: Categorize Reuters data with a deterministic scheme
  • 10. Categorize Reuters‘ headlines with AlJazeera‘s geo-categories & check result Step 3.2 AlJazeera Reuters (det.)
  • 11. Compare deterministic results with MLStep 3.2 Reuters (det.) Reuters (ML)
  • 12. Summary Step What we have achieved so far1-3 We classified Reuters news by applying category-specific keywords for each geo-category from Al Jazeera Find rules for deterministic classification Find category-specific keywords Apply rules Categorize data set with keywords Evaluate results Done correctly? Where are gaps? Iterate & further develop rules More rules, synonyms, …
  • 13. Transfer useful categories from one source to the other in order to make them comparable: Reuters‘ topics to Aljazeera‘s text We want: Reuter‘s text categorized according to AlJazeera’s logic We have: AlJazeera‘s geo- localized categories We want: AlJazeera‘s text classified according to Reuter’s logic Wehave: Reuter‘s topics We want: Reuter‘s text categorized according to AlJazeera’s logic We have: AlJazeera‘s geo- localized categories Step 4
  • 14. Visualize word frequency within topics in order to get a first feeling for content Step 4 Technology Business
  • 15. Extract category-specific keywords within Al Jazeera‘s topics Business & Technology Step 4.1 If context is crucial, text structure is more complex (e.g. multi-word) the deterministic approach is flawed Let’s try ML! Result within section Business: Not one specific word Fail Result within section Technology: 9 category-specific keywords Fail not enough
  • 16. Use ML to categorize AlJazeera‘s headlines with Reuter’s topics & check result BIG DATAStep 4.2  AlJazeera Reuters
  • 17. What we have achieved so far Summary Step We classified Al Jazeera news by training an ML algorithm with already categorized training sets from Reuters 4 Find source with interesting categories Relevant topic and enough data examples Extract training set from source Categories already classified (no manual work) Train ML algorithm Training set needs to be representative Classify new text with trained ML algorithm Be careful with new words and changed TF/IDF Evaluate data sample Evaluate classification result
  • 18. How does the ML algorithm work? Example Support Vector Machine BIG DATA Machine learning is linear algebra Fixed values necessary  Categories are already discrete Complicated for text  Model necessary  Different alternatives Many different learn models, e.g.  Support Vector Machines (popular)  Neuronal networks  Random forest  Decision trees 1. Learn logic from coding set: SVM learns how to separate blue points from grey points 2. Classification of total data set: SVM applies its knowledge to the grey points, unknown so far
  • 19. Lets take a step back and find out: How can I MEASURE the classification results?
  • 20. Quantify classification results with the metrics precision & recall BIG DATAStep 5 Example: We want to find all Reuters news which belong to category Europe Europe France Poland Spain Italian More and more Italian restaurants in China Recall Precision Italian More and more Italian restaurants in China
  • 21. Back to reality: How to handle classification projects with customer-specific categories
  • 22. Step 1: Find suitable categories  Functional discussion with project team  Topic modelling Step 4: Manual classification of training set  Very expensive  Extensive QA necessary  Correct training set has high impact on quality of final results Step 6 Typical project: Classification of loads of data with non-standard categories Step 1 Step 3: Find pre-categorized data  Saves a lot of work but not always possible Step 3 Step 4 Step 2: Verify categories  Well-defined and reproducible (not necessarily mutually exlcusive)  Ideally 100% coverage Step 2 Step 6: Classification  Classify  Manual QA Step 6 Step 5: Training, QA and optimization  Try different algorithms  Crossfolding  Iterate and improve Step 5
  • 23. Big Data: Select training set, e.g. 10.000 from 1.500.000 balls BIG DATAStep 6.1 Challenge Choose the best training set for your problem UrsusWehrlis
  • 24. Preparation of training setStep 6.2 2. Not so good situation: The manually classified data contains only a fraction of all the words in the complete data set  Select documents with highest word variability – Word heterogenity = Number of words in all documents ( stopwords) – Long tail distribution ( many, many words use infrequently ) – Even distribution  Complicated: knapsack-like problem  Use an approximate approach (like genetic algorithm)  Crucial for all following tasks 1. Good situation: The manually classified data set contains all the words of the complete data set. Word heterogeneity in training set w01 w02 w03 w04 w05 w06 w07 w08 w09 w10 w11 w12 w13 w14 w15 w16 w17 w18 w01 w02 w03 w04 w05 w06 w07 w08 w09 w10 w11 w12 w13 w14 w15 w16 w17 w18 Word heterogeneity complete data set w19 w20 w21 w22 w23 w24 w25 w26 w27 w28 w29 w30 w31 w32 w33 … … w99 Complete set common distribution dictionary distribution BIG DATA
  • 25. Intelligently choose training setStep 6.3 BIG DATA Final data set available Final data set not available  Optimize for high variability and high usage Select this Don‘t select that  Choose training set in a way to create maximal word overlap with complete data set  WM = { words in training set } WC = { words in complete set } find maximum for | WC WM | = | WM |  Improved approach: choose training set to minimize headlines with unknown words in complete data set  Find minimum for |C WM|  More complicated, but worth it U U
  • 27. BIG DATA Summary – Our learnings UrsusWehrlis Focus on cost-efficiency of your classification result  Get more pre-categorized data by  Categories from other sources  NLP (e.g. FB pre-trained word vectors) & semantic extraction  Chose not more documents than necessary for manual training set classification  Take courage and admit when it‘s best to finish: Don‘t get lost in the longtail Focus on high-quality of your classification result  Chose the right training set for ML  Chose best algorithm for your specific problem  Optimize chosen algorithm
  • 28. Classifying unstructured text Dr. Christian Winkler Enterprise Architect Big Data, Data Science mgm technology partners https://www.linkedin.com/in/drchristianwinkler/ Stephanie Fischer Product Owner Text Analytics mgm consulting partners https://www.linkedin.com/in/steffifischer/ Ursus Wehrlis

Hinweis der Redaktion

  1. Steffi Welcome to our talk about “Classifying unstructured text with deterministic and ML approaches“!
  2. We transferred pre-categorized scheme from Al Jazeera to Retuters Wow, now the data is comparable! General procedure: Use pre-categorized data & transfer logic deterministically to other texts wherever possible
  3. 4.235 documents in category „Technology“  2.369 can be found with 25 category-specific keywords. Only half.  Recall ca. 50%. Precision only 75%
  4. But: How can you be sure? How can you measure result? World: 91% precision, 90% recall US: 88% precision, 80% recall
  5. Take Reuters categories TECHNOLOGY and BUSINESS as training set Categorize Al Jazeera Explanation of how ML works
  6. Übergang: Quantität ist ja nicht alles. Wie sieht es mit der Qualität der besprochenen Inhalte aus?  Nächstes Slide Sentimentanalyse.
  7. Steffi I did all the QA. I tried to verify 200 documents and 80% of them were wrong! We can‘t give this to our customer. How can this happen? What parameters can we adjust to improve the result?
  8. Übergang: Quantität ist ja nicht alles. Wie sieht es mit der Qualität der besprochenen Inhalte aus?  Nächstes Slide Sentimentanalyse.
  9. Christian Let‘s assume you have headlines with 5 common words or 3 random strings U.S. election takes place November adfpoi4r afdafp23 sad234 Italian earthquake destroys many villages 4234asdas oirutmbs rieo234 Then we get the highest variability by selecting 8.000 headlines with only random words, but it is of no use We must select headlines with the most common words (but only once) as they give us the highest chances of finding them again
  10. Christian
  11. Christian Let‘s assume you have headlines with 5 common words or 3 random strings U.S. election takes place November adfpoi4r afdafp23 sad234 Italian earthquake destroys many villages 4234asdas oirutmbs rieo234 Then we get the highest variability by selecting 8.000 headlines with only random words, but it is of no use We must select headlines with the most common words (but only once) as they give us the highest chances of finding them again
  12. Christian Let‘s assume you have headlines with 5 common words or 3 random strings U.S. election takes place November adfpoi4r afdafp23 sad234 Italian earthquake destroys many villages 4234asdas oirutmbs rieo234 Then we get the highest variability by selecting 8.000 headlines with only random words, but it is of no use We must select headlines with the most common words (but only once) as they give us the highest chances of finding them again
  13. Christian Let‘s assume you have headlines with 5 common words or 3 random strings U.S. election takes place November adfpoi4r afdafp23 sad234 Italian earthquake destroys many villages 4234asdas oirutmbs rieo234 Then we get the highest variability by selecting 8.000 headlines with only random words, but it is of no use We must select headlines with the most common words (but only once) as they give us the highest chances of finding them again
  14. Steffi Our talk is designed around the classification of real data: We took 1 MIO headlines from the Online news archive of the British newspaper Telegraph Before diving into the actual text classification, we will do some data preparations: Text statistics and finding relevant categories The main part of our talk will be a detailed description about the text classification, both from a functional and technical perspective We will finish with our Top 10 lessons learned and give you some nice ideas on how you can use the knowledge from this talk for your own projects