SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Eurostat
THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION
Text Mining & Natural
Language Processing
Ali Hürriyetoglu, Piet Daas
Eurostat
Outline
• Introduction
• Background
• Basic steps
• Use cases
• Machine learning for text mining
2
Eurostat
Introduction
3
Eurostat
What can you do with text mining?
• Named entity recognition
• Sentiment analysis
• Topic detection
• Information extraction
• Trend detection
• Clustering similar documents
• Automatic summarisation
4
Eurostat
Ingredients of text mining
• Text analytics is a function of:
• The amount and type of text you have
• The task you want to achieve
• The precision and recall you want to get
• The time you can spend
5
Eurostat
Text types
• Semi structured language use: Address, phone
number, named entities, etc.
• Standard text: News articles, books, etc.
• User generated text: social media, comments
6
Eurostat
Background
7
Eurostat
Text
• Text is a rich combination of symbols that lead to
a structure which has a context dependent
interpretation.
• Symbols: character, word, punctuation, digit, emoticon
• Structure: tokens, links, user names, hashtags, noun,
verb, named entity, emoticon, phrases, codes, etc.
• Context: writer, genre, platform, social environment,
time, geographic location, etc.
• Interpretation: sense, meaning, …
8
Eurostat
Symbols
• Letters: A B Ç X
• Digits: 1 5 3 2
• Punctuation: . , ! ?
• Emoticons:  
• Special characters: ^ # &
Eurostat
Structure
• Tokens: Any space separated symbol sequence
(for European languages).
• Numbers: 6, 123, …,
• Web specific tokens: user names, hashtags, URLs, …
• Abbreviations: vs., etc., ...
• Syntactic interpretation: noun, verb, adjective, ...
10
Eurostat
Context
• Anything about use of a token may have
significant effect:
• The person who uses it
• The aim of the phrase
• Time and place of the language use
• Preceding and following expressions
• ...
11
Eurostat
Interpretation
• Tokens and phrases may have one or more
interpretations.
• Ambiguity: Lexical meaning may differ
• Named entities: same entities names may refer to
different real entities
• Genre: Orders, compliments, statements, instructions,
etc.
• Usernames: will be interpreted differently in different
platforms
12
Eurostat
Basic steps
13
Eurostat
Basic steps and tools
• You need some combination of:
• Language identification
• Sentence splitting
• Tokenization
• Lemmatization
• Anaphora resolution
• Regular expressions
• POS tagging
• Named entity recognition
• Parsing methodology, Pyparsing
• Language resources: stop words, a sentiment lexicon, multi-word
expressions, ontology, etc.
14
Eurostat
Use cases
15
Eurostat
Named entities
• Problem: You want to know which named entities are
available in a text. You do not have much time or
resources. An approximate result is sufficient for you.
• Solution: Find and count all proper-cased token
sequences: ([A-Z][a-z]+(s[A-Z][a-z]+)+)
• ('Sherlock Holmes', 90),
• ('United States', 71),
• ('New York', 54),
• ('New England', 46),
• ('Baker Street', 29),
• …
16
Eurostat
Street names
• Problem: You have a set of criminality reports.
You wonder which street names are mentioned
mostly.
• Solution: Write a more specific regular
expression: [A-Z][a-z]+ [sS]treet
• ('Baker Street', 29),
• ('Leadenhall Street', 5),
• ('Fresno Street', 2),
• ('Fenchurch Street', 2),
• ('Bow Street', 2),
• ('Oxford Street', 2),
• … 17
Eurostat
Detect economic indicators
• Problem: You want to detect and track price
changes. You want to be precise. You know and
can spend some time to specify what you are
looking for.
• Solution: Parse text with Pyparsing*
• action = oneOf(["lower","increase","decrease"], caseless=True)
• econ = oneOf(["prices","expense","cost","price"], caseless=True)
• item = Word(alphas)
• economy_grammar = action("action")+item("item")+econ
• economy_grammar2 = econ + Literal("of") + item + action
18
*For R use tm package
Eurostat
Sentiment Analysis
• Problem: You want to understand how people
feel about a certain issue or entity.
• Solution 1: Create or use an available sentiment
lexicon. Count number of occurrences for the
entries in the lexicon.
• Solution 2: Detailed syntactic and semantic
analysis.
19
Eurostat
Wordclouds
• Problem: You have text, and want to have a
quick insight about what it mostly contains.
• Solution: Word cloud, streamgraph, t-SNE, …
20
Eurostat
21
https://github.com/amueller/word_cloud/blob/master/examples/constitution.png
Eurostat
Track co-evoluation of language use
22
https://blog.twitter.com/2010/the-2010-world-cup-a-global-conversation
Eurostat
Topic modelling
• Problem: You need a detailed analysis of the
topics in a text collection, corpus.
• Solution: Topic modelling
23
Eurostat
24
http://alexperrier.github.io/jekyll/update/2015/09/04/topic-modeling-of-twitter-followers.html
Eurostat
Machine learning
25
Eurostat
Machine Learning
• You can attempt to solve almost any text mining
task with machine learning approaches. The
outcome will depend on:
• Feature extraction and selection
• Amount of labeled data in the case of supervised learning
• Time you have to analyze the output in unsupervised
learning
26
Eurostat
Thanks for listening!
Any question or comment?
27
Eurostat
Exercises
• 6) Search for key terms on Twitter and collect n tweets (n = 200)
• 7) Determine most frequent hashtags, links, mentions
• 8) Create wordcloud of these tweets
• 9) Topic detection from tweets (either user or key terms search
result)
• 10) Sentiment analysis, create your own list of 10 positive and 10
negative words, calculate count based score
• 11) Look for an online classifier (for the language of your tweets),
get access key and test it (watch the rate limit)
• E.g. MonkeyLearn
• 12) Study emoticons as an example for basic emotions 28
Eurostat
Additional exercises
• Additional tasks:
• 13) Detect place name, person name, organisation name,
number, date recognition, geolocation/temporal characteristics,
find similar tweets
• 14) Apply t-distributed stochastic neighbour embedding (t-SNE)
visualization technique on tweets
29

Weitere ähnliche Inhalte

Ähnlich wie 6_Big Data Sources part3-Day 3_A_text_mining.pptx

Intro 2 document
Intro 2 documentIntro 2 document
Intro 2 documentUma Kant
 
Semanticnews 230913-final
Semanticnews 230913-finalSemanticnews 230913-final
Semanticnews 230913-finalDavid Newman
 
A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)UNCResearchHub
 
Text analytics in social media
Text analytics in social mediaText analytics in social media
Text analytics in social mediaJeremiah Fadugba
 
A Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis IA Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis IUNCResearchHub
 
BEA 2015 Generating Metadata by Machine Final
BEA 2015 Generating Metadata by Machine FinalBEA 2015 Generating Metadata by Machine Final
BEA 2015 Generating Metadata by Machine FinalS. M. Hassan Zaidi
 
Building Corpora from Social Media
Building Corpora from Social MediaBuilding Corpora from Social Media
Building Corpora from Social MediaRichard Littauer
 
Semantic engagement
Semantic engagementSemantic engagement
Semantic engagementSTIinnsbruck
 
mt_cat_presentations CAT TRANSLATION PPT
mt_cat_presentations CAT TRANSLATION PPTmt_cat_presentations CAT TRANSLATION PPT
mt_cat_presentations CAT TRANSLATION PPTRamdan43
 
LIWC-ing at Texts for Insights from Linguistic Patterns
LIWC-ing at Texts for Insights from Linguistic PatternsLIWC-ing at Texts for Insights from Linguistic Patterns
LIWC-ing at Texts for Insights from Linguistic PatternsShalin Hai-Jew
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPAnuj Gupta
 
Text analysis-semantic-search
Text analysis-semantic-searchText analysis-semantic-search
Text analysis-semantic-searchDiana Maynard
 
Building NLP solutions using Python
Building NLP solutions using PythonBuilding NLP solutions using Python
Building NLP solutions using Pythonbotsplash.com
 
Building NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML GroupBuilding NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML Groupbotsplash.com
 
Deep learning for text analytics
Deep learning for text analyticsDeep learning for text analytics
Deep learning for text analyticsErik Tromp
 
Language as social sensor - Marko Grobelnik - Dubrovnik - HrTAL2016 - 30 Sep ...
Language as social sensor - Marko Grobelnik - Dubrovnik - HrTAL2016 - 30 Sep ...Language as social sensor - Marko Grobelnik - Dubrovnik - HrTAL2016 - 30 Sep ...
Language as social sensor - Marko Grobelnik - Dubrovnik - HrTAL2016 - 30 Sep ...Marko Grobelnik
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering StandardsNavigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering StandardsLiz Grumbach
 
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptxENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptxSyedNadeemAbbas6
 

Ähnlich wie 6_Big Data Sources part3-Day 3_A_text_mining.pptx (20)

Intro 2 document
Intro 2 documentIntro 2 document
Intro 2 document
 
Semanticnews 230913-final
Semanticnews 230913-finalSemanticnews 230913-final
Semanticnews 230913-final
 
A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)
 
Text analytics in social media
Text analytics in social mediaText analytics in social media
Text analytics in social media
 
A Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis IA Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis I
 
BEA 2015 Generating Metadata by Machine Final
BEA 2015 Generating Metadata by Machine FinalBEA 2015 Generating Metadata by Machine Final
BEA 2015 Generating Metadata by Machine Final
 
Building Corpora from Social Media
Building Corpora from Social MediaBuilding Corpora from Social Media
Building Corpora from Social Media
 
Semantic engagement
Semantic engagementSemantic engagement
Semantic engagement
 
mt_cat_presentations CAT TRANSLATION PPT
mt_cat_presentations CAT TRANSLATION PPTmt_cat_presentations CAT TRANSLATION PPT
mt_cat_presentations CAT TRANSLATION PPT
 
LIWC-ing at Texts for Insights from Linguistic Patterns
LIWC-ing at Texts for Insights from Linguistic PatternsLIWC-ing at Texts for Insights from Linguistic Patterns
LIWC-ing at Texts for Insights from Linguistic Patterns
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLP
 
Text analysis-semantic-search
Text analysis-semantic-searchText analysis-semantic-search
Text analysis-semantic-search
 
Building NLP solutions using Python
Building NLP solutions using PythonBuilding NLP solutions using Python
Building NLP solutions using Python
 
Building NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML GroupBuilding NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML Group
 
Filling the gaps
Filling the gapsFilling the gaps
Filling the gaps
 
Deep learning for text analytics
Deep learning for text analyticsDeep learning for text analytics
Deep learning for text analytics
 
Language as social sensor - Marko Grobelnik - Dubrovnik - HrTAL2016 - 30 Sep ...
Language as social sensor - Marko Grobelnik - Dubrovnik - HrTAL2016 - 30 Sep ...Language as social sensor - Marko Grobelnik - Dubrovnik - HrTAL2016 - 30 Sep ...
Language as social sensor - Marko Grobelnik - Dubrovnik - HrTAL2016 - 30 Sep ...
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering StandardsNavigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
 
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptxENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
 

Kürzlich hochgeladen

怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制vexqp
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制vexqp
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schscnajjemba
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格q6pzkpark
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdftheeltifs
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样wsppdmt
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjurptikerjasaptiker
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATIONLakpaYanziSherpa
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制vexqp
 

Kürzlich hochgeladen (20)

怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 

6_Big Data Sources part3-Day 3_A_text_mining.pptx

  • 1. Eurostat THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Text Mining & Natural Language Processing Ali Hürriyetoglu, Piet Daas
  • 2. Eurostat Outline • Introduction • Background • Basic steps • Use cases • Machine learning for text mining 2
  • 4. Eurostat What can you do with text mining? • Named entity recognition • Sentiment analysis • Topic detection • Information extraction • Trend detection • Clustering similar documents • Automatic summarisation 4
  • 5. Eurostat Ingredients of text mining • Text analytics is a function of: • The amount and type of text you have • The task you want to achieve • The precision and recall you want to get • The time you can spend 5
  • 6. Eurostat Text types • Semi structured language use: Address, phone number, named entities, etc. • Standard text: News articles, books, etc. • User generated text: social media, comments 6
  • 8. Eurostat Text • Text is a rich combination of symbols that lead to a structure which has a context dependent interpretation. • Symbols: character, word, punctuation, digit, emoticon • Structure: tokens, links, user names, hashtags, noun, verb, named entity, emoticon, phrases, codes, etc. • Context: writer, genre, platform, social environment, time, geographic location, etc. • Interpretation: sense, meaning, … 8
  • 9. Eurostat Symbols • Letters: A B Ç X • Digits: 1 5 3 2 • Punctuation: . , ! ? • Emoticons:   • Special characters: ^ # &
  • 10. Eurostat Structure • Tokens: Any space separated symbol sequence (for European languages). • Numbers: 6, 123, …, • Web specific tokens: user names, hashtags, URLs, … • Abbreviations: vs., etc., ... • Syntactic interpretation: noun, verb, adjective, ... 10
  • 11. Eurostat Context • Anything about use of a token may have significant effect: • The person who uses it • The aim of the phrase • Time and place of the language use • Preceding and following expressions • ... 11
  • 12. Eurostat Interpretation • Tokens and phrases may have one or more interpretations. • Ambiguity: Lexical meaning may differ • Named entities: same entities names may refer to different real entities • Genre: Orders, compliments, statements, instructions, etc. • Usernames: will be interpreted differently in different platforms 12
  • 14. Eurostat Basic steps and tools • You need some combination of: • Language identification • Sentence splitting • Tokenization • Lemmatization • Anaphora resolution • Regular expressions • POS tagging • Named entity recognition • Parsing methodology, Pyparsing • Language resources: stop words, a sentiment lexicon, multi-word expressions, ontology, etc. 14
  • 16. Eurostat Named entities • Problem: You want to know which named entities are available in a text. You do not have much time or resources. An approximate result is sufficient for you. • Solution: Find and count all proper-cased token sequences: ([A-Z][a-z]+(s[A-Z][a-z]+)+) • ('Sherlock Holmes', 90), • ('United States', 71), • ('New York', 54), • ('New England', 46), • ('Baker Street', 29), • … 16
  • 17. Eurostat Street names • Problem: You have a set of criminality reports. You wonder which street names are mentioned mostly. • Solution: Write a more specific regular expression: [A-Z][a-z]+ [sS]treet • ('Baker Street', 29), • ('Leadenhall Street', 5), • ('Fresno Street', 2), • ('Fenchurch Street', 2), • ('Bow Street', 2), • ('Oxford Street', 2), • … 17
  • 18. Eurostat Detect economic indicators • Problem: You want to detect and track price changes. You want to be precise. You know and can spend some time to specify what you are looking for. • Solution: Parse text with Pyparsing* • action = oneOf(["lower","increase","decrease"], caseless=True) • econ = oneOf(["prices","expense","cost","price"], caseless=True) • item = Word(alphas) • economy_grammar = action("action")+item("item")+econ • economy_grammar2 = econ + Literal("of") + item + action 18 *For R use tm package
  • 19. Eurostat Sentiment Analysis • Problem: You want to understand how people feel about a certain issue or entity. • Solution 1: Create or use an available sentiment lexicon. Count number of occurrences for the entries in the lexicon. • Solution 2: Detailed syntactic and semantic analysis. 19
  • 20. Eurostat Wordclouds • Problem: You have text, and want to have a quick insight about what it mostly contains. • Solution: Word cloud, streamgraph, t-SNE, … 20
  • 22. Eurostat Track co-evoluation of language use 22 https://blog.twitter.com/2010/the-2010-world-cup-a-global-conversation
  • 23. Eurostat Topic modelling • Problem: You need a detailed analysis of the topics in a text collection, corpus. • Solution: Topic modelling 23
  • 26. Eurostat Machine Learning • You can attempt to solve almost any text mining task with machine learning approaches. The outcome will depend on: • Feature extraction and selection • Amount of labeled data in the case of supervised learning • Time you have to analyze the output in unsupervised learning 26
  • 27. Eurostat Thanks for listening! Any question or comment? 27
  • 28. Eurostat Exercises • 6) Search for key terms on Twitter and collect n tweets (n = 200) • 7) Determine most frequent hashtags, links, mentions • 8) Create wordcloud of these tweets • 9) Topic detection from tweets (either user or key terms search result) • 10) Sentiment analysis, create your own list of 10 positive and 10 negative words, calculate count based score • 11) Look for an online classifier (for the language of your tweets), get access key and test it (watch the rate limit) • E.g. MonkeyLearn • 12) Study emoticons as an example for basic emotions 28
  • 29. Eurostat Additional exercises • Additional tasks: • 13) Detect place name, person name, organisation name, number, date recognition, geolocation/temporal characteristics, find similar tweets • 14) Apply t-distributed stochastic neighbour embedding (t-SNE) visualization technique on tweets 29