1. Eurostat
THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION
Text Mining & Natural
Language Processing
Ali Hürriyetoglu, Piet Daas
4. Eurostat
What can you do with text mining?
• Named entity recognition
• Sentiment analysis
• Topic detection
• Information extraction
• Trend detection
• Clustering similar documents
• Automatic summarisation
4
5. Eurostat
Ingredients of text mining
• Text analytics is a function of:
• The amount and type of text you have
• The task you want to achieve
• The precision and recall you want to get
• The time you can spend
5
6. Eurostat
Text types
• Semi structured language use: Address, phone
number, named entities, etc.
• Standard text: News articles, books, etc.
• User generated text: social media, comments
6
8. Eurostat
Text
• Text is a rich combination of symbols that lead to
a structure which has a context dependent
interpretation.
• Symbols: character, word, punctuation, digit, emoticon
• Structure: tokens, links, user names, hashtags, noun,
verb, named entity, emoticon, phrases, codes, etc.
• Context: writer, genre, platform, social environment,
time, geographic location, etc.
• Interpretation: sense, meaning, …
8
9. Eurostat
Symbols
• Letters: A B Ç X
• Digits: 1 5 3 2
• Punctuation: . , ! ?
• Emoticons:
• Special characters: ^ # &
10. Eurostat
Structure
• Tokens: Any space separated symbol sequence
(for European languages).
• Numbers: 6, 123, …,
• Web specific tokens: user names, hashtags, URLs, …
• Abbreviations: vs., etc., ...
• Syntactic interpretation: noun, verb, adjective, ...
10
11. Eurostat
Context
• Anything about use of a token may have
significant effect:
• The person who uses it
• The aim of the phrase
• Time and place of the language use
• Preceding and following expressions
• ...
11
12. Eurostat
Interpretation
• Tokens and phrases may have one or more
interpretations.
• Ambiguity: Lexical meaning may differ
• Named entities: same entities names may refer to
different real entities
• Genre: Orders, compliments, statements, instructions,
etc.
• Usernames: will be interpreted differently in different
platforms
12
16. Eurostat
Named entities
• Problem: You want to know which named entities are
available in a text. You do not have much time or
resources. An approximate result is sufficient for you.
• Solution: Find and count all proper-cased token
sequences: ([A-Z][a-z]+(s[A-Z][a-z]+)+)
• ('Sherlock Holmes', 90),
• ('United States', 71),
• ('New York', 54),
• ('New England', 46),
• ('Baker Street', 29),
• …
16
17. Eurostat
Street names
• Problem: You have a set of criminality reports.
You wonder which street names are mentioned
mostly.
• Solution: Write a more specific regular
expression: [A-Z][a-z]+ [sS]treet
• ('Baker Street', 29),
• ('Leadenhall Street', 5),
• ('Fresno Street', 2),
• ('Fenchurch Street', 2),
• ('Bow Street', 2),
• ('Oxford Street', 2),
• … 17
18. Eurostat
Detect economic indicators
• Problem: You want to detect and track price
changes. You want to be precise. You know and
can spend some time to specify what you are
looking for.
• Solution: Parse text with Pyparsing*
• action = oneOf(["lower","increase","decrease"], caseless=True)
• econ = oneOf(["prices","expense","cost","price"], caseless=True)
• item = Word(alphas)
• economy_grammar = action("action")+item("item")+econ
• economy_grammar2 = econ + Literal("of") + item + action
18
*For R use tm package
19. Eurostat
Sentiment Analysis
• Problem: You want to understand how people
feel about a certain issue or entity.
• Solution 1: Create or use an available sentiment
lexicon. Count number of occurrences for the
entries in the lexicon.
• Solution 2: Detailed syntactic and semantic
analysis.
19
20. Eurostat
Wordclouds
• Problem: You have text, and want to have a
quick insight about what it mostly contains.
• Solution: Word cloud, streamgraph, t-SNE, …
20
26. Eurostat
Machine Learning
• You can attempt to solve almost any text mining
task with machine learning approaches. The
outcome will depend on:
• Feature extraction and selection
• Amount of labeled data in the case of supervised learning
• Time you have to analyze the output in unsupervised
learning
26
28. Eurostat
Exercises
• 6) Search for key terms on Twitter and collect n tweets (n = 200)
• 7) Determine most frequent hashtags, links, mentions
• 8) Create wordcloud of these tweets
• 9) Topic detection from tweets (either user or key terms search
result)
• 10) Sentiment analysis, create your own list of 10 positive and 10
negative words, calculate count based score
• 11) Look for an online classifier (for the language of your tweets),
get access key and test it (watch the rate limit)
• E.g. MonkeyLearn
• 12) Study emoticons as an example for basic emotions 28
29. Eurostat
Additional exercises
• Additional tasks:
• 13) Detect place name, person name, organisation name,
number, date recognition, geolocation/temporal characteristics,
find similar tweets
• 14) Apply t-distributed stochastic neighbour embedding (t-SNE)
visualization technique on tweets
29