6_Big Data Sources part3-Day 3_A_text_mining.pptx

Eurostat
THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION
Text Mining & Natural
Language Processing
Ali Hürriyetoglu, Piet Daas

Eurostat
Outline
• Introduction
• Background
• Basic steps
• Use cases
• Machine learning for text mining
2

Eurostat
What can you do with text mining?
• Named entity recognition
• Sentiment analysis
• Topic detection
• Information extraction
• Trend detection
• Clustering similar documents
• Automatic summarisation
4

Eurostat
Ingredients of text mining
• Text analytics is a function of:
• The amount and type of text you have
• The task you want to achieve
• The precision and recall you want to get
• The time you can spend
5

Eurostat
Text types
• Semi structured language use: Address, phone
number, named entities, etc.
• Standard text: News articles, books, etc.
• User generated text: social media, comments
6

Eurostat
Text
• Text is a rich combination of symbols that lead to
a structure which has a context dependent
interpretation.
• Symbols: character, word, punctuation, digit, emoticon
• Structure: tokens, links, user names, hashtags, noun,
verb, named entity, emoticon, phrases, codes, etc.
• Context: writer, genre, platform, social environment,
time, geographic location, etc.
• Interpretation: sense, meaning, …
8

Eurostat
Symbols
• Letters: A B Ç X
• Digits: 1 5 3 2
• Punctuation: . , ! ?
• Emoticons:  
• Special characters: ^ # &

Eurostat
Structure
• Tokens: Any space separated symbol sequence
(for European languages).
• Numbers: 6, 123, …,
• Web specific tokens: user names, hashtags, URLs, …
• Abbreviations: vs., etc., ...
• Syntactic interpretation: noun, verb, adjective, ...
10

Eurostat
Context
• Anything about use of a token may have
significant effect:
• The person who uses it
• The aim of the phrase
• Time and place of the language use
• Preceding and following expressions
• ...
11

Eurostat
Interpretation
• Tokens and phrases may have one or more
interpretations.
• Ambiguity: Lexical meaning may differ
• Named entities: same entities names may refer to
different real entities
• Genre: Orders, compliments, statements, instructions,
etc.
• Usernames: will be interpreted differently in different
platforms
12

Eurostat
Basic steps and tools
• You need some combination of:
• Language identification
• Sentence splitting
• Tokenization
• Lemmatization
• Anaphora resolution
• Regular expressions
• POS tagging
• Named entity recognition
• Parsing methodology, Pyparsing
• Language resources: stop words, a sentiment lexicon, multi-word
expressions, ontology, etc.
14

Eurostat
Named entities
• Problem: You want to know which named entities are
available in a text. You do not have much time or
resources. An approximate result is sufficient for you.
• Solution: Find and count all proper-cased token
sequences: ([A-Z][a-z]+(s[A-Z][a-z]+)+)
• ('Sherlock Holmes', 90),
• ('United States', 71),
• ('New York', 54),
• ('New England', 46),
• ('Baker Street', 29),
• …
16

Eurostat
Street names
• Problem: You have a set of criminality reports.
You wonder which street names are mentioned
mostly.
• Solution: Write a more specific regular
expression: [A-Z][a-z]+ [sS]treet
• ('Baker Street', 29),
• ('Leadenhall Street', 5),
• ('Fresno Street', 2),
• ('Fenchurch Street', 2),
• ('Bow Street', 2),
• ('Oxford Street', 2),
• … 17

Eurostat
Detect economic indicators
• Problem: You want to detect and track price
changes. You want to be precise. You know and
can spend some time to specify what you are
looking for.
• Solution: Parse text with Pyparsing*
• action = oneOf(["lower","increase","decrease"], caseless=True)
• econ = oneOf(["prices","expense","cost","price"], caseless=True)
• item = Word(alphas)
• economy_grammar = action("action")+item("item")+econ
• economy_grammar2 = econ + Literal("of") + item + action
18
*For R use tm package

Eurostat
Sentiment Analysis
• Problem: You want to understand how people
feel about a certain issue or entity.
• Solution 1: Create or use an available sentiment
lexicon. Count number of occurrences for the
entries in the lexicon.
• Solution 2: Detailed syntactic and semantic
analysis.
19

Eurostat
Wordclouds
• Problem: You have text, and want to have a
quick insight about what it mostly contains.
• Solution: Word cloud, streamgraph, t-SNE, …
20

Eurostat
21
https://github.com/amueller/word_cloud/blob/master/examples/constitution.png

Eurostat
Track co-evoluation of language use
22
https://blog.twitter.com/2010/the-2010-world-cup-a-global-conversation

Eurostat
Topic modelling
• Problem: You need a detailed analysis of the
topics in a text collection, corpus.
• Solution: Topic modelling
23

Eurostat
24
http://alexperrier.github.io/jekyll/update/2015/09/04/topic-modeling-of-twitter-followers.html

Eurostat
Machine Learning
• You can attempt to solve almost any text mining
task with machine learning approaches. The
outcome will depend on:
• Feature extraction and selection
• Amount of labeled data in the case of supervised learning
• Time you have to analyze the output in unsupervised
learning
26

Eurostat
Thanks for listening!
Any question or comment?
27

Eurostat
Exercises
• 6) Search for key terms on Twitter and collect n tweets (n = 200)
• 7) Determine most frequent hashtags, links, mentions
• 8) Create wordcloud of these tweets
• 9) Topic detection from tweets (either user or key terms search
result)
• 10) Sentiment analysis, create your own list of 10 positive and 10
negative words, calculate count based score
• 11) Look for an online classifier (for the language of your tweets),
get access key and test it (watch the rate limit)
• E.g. MonkeyLearn
• 12) Study emoticons as an example for basic emotions 28

Eurostat
Additional exercises
• Additional tasks:
• 13) Detect place name, person name, organisation name,
number, date recognition, geolocation/temporal characteristics,
find similar tweets
• 14) Apply t-distributed stochastic neighbour embedding (t-SNE)
visualization technique on tweets
29

6_Big Data Sources part3-Day 3_A_text_mining.pptx

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie 6_Big Data Sources part3-Day 3_A_text_mining.pptx

Ähnlich wie 6_Big Data Sources part3-Day 3_A_text_mining.pptx (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

6_Big Data Sources part3-Day 3_A_text_mining.pptx