SlideShare a Scribd company logo
1 of 68
Download to read offline
Meltwater Budapest, April 2016
The importance of entities
Babak Rasolzadeh, Director of Data Science Research
1. Company background
2. Data Science @ Meltwater
3. Challenges with NLP at Large scale
4. Entities, entities, entities
a. Social NER
b. ELS
c. Knowledge Graph
3
What is Meltwater?
● A business intelligence company → Providing insights from data outside
the firewall (news, blogs, social media, etc.)
● Born in Oslo, in 2001.
● Founder and CEO: Jorn Lyssegen
● www.meltwater.com
● 30K+ clients all over the World.
● 1000+ employees
● 60+ offices around the world, mostly sale.
● Tech offices: USA, Germany, Sweden, Hungary, India.
4
Why?
own brand
competitors
leads
partners product reviews
own industry
5
What?
Uses Meltwater to find out about new
instances of vandalism and break-ins.
Often, the victim is in need of services
Uses Meltwater to help determine
how public perception of certain
ingredient chemicals will influence
adoption & sales
Uses Meltwater to be alerted of
when certain patent will expire in
target markets
Uses Meltwater to monitor the
performance and popularity of news
anchors and programs
Uses Meltwater social listening
to estimate and prevent
infrastructure attacks
6
How?
Unstructured
Document
Stream
Pipeline
Enrichments
Search
/Storage
Enriched Documents
High Performance
Indexes
Processing
Services
API Layer
APPSBackup
Storage
Raw Documents
15 supported languages in pipeline
(EN, DE, SV, NO, FI, ZH, JP, FR, ES, DA, NL, PT, AR, IT,
HI)
Typical enrichments
○ Sentiment analysis
○ Thematic analysis
○ Categorization
○ Keyphrase extraction
○ Named Entity Recognition
○ Named Entity Disambiguation
NLP & Data Science at Meltwater
8
What other than NLP?
● Recommendation Engines
DOC3
DOC3
DOC3
DOC3
DOC3
DOC8
Realtime
recommender
engine
● Correlation and predictive pattern recognition
● Word2vec techniques
concept 3
concept 1
concept 2
“British American Tobacco" or "British
American Tobbaco" or (BAT near tobacco) or
"英矎煙草" or (("Lucky Strike" or "Dunhill"
or "Pall Mall") near/15 cigarette*)
9
Machine Learning Terminology
10
Challenges with Data Science (NLP) at scale
‱ High DPS (~2000) and a lot to do! (tokenization, lemmatization,
stemming, POS tagging, categorization, sentiment, NER, ...) with racing
conditions!
Pipeline
Enrichments
SV
EN
DE
POS NER‱ Training data labelling is costly! x15
‱ Contextual information expensive (computationally).
‱ Noise, missing data, variation (e.g. slang), data types, ...
Knowledge Base Strategy
Entities, entities, entities
don - July 2015
12
Knowledge Base StrategyWhat are Named Entities (NE)?
● Non-linguistic definition
○ Referable entities
○ Usually Proper Names
○ Single or multi-word
→ I know this man. He might be Charles.
→ He lives in Stockholm. He is Swedish.
13
Knowledge Base StrategyWhat is Named Entity Recognition (NER)?
1. Extracting NEs from a text.
2. Categorizing NEs from a set of predefined categories.
John lives in Stockholm. He works at Ericsson.
Categories of {PER, LOC, ORG, MISC, PROD}
14
Knowledge Base StrategyWhat NER is not?
● NER is not event recognition.
● NER recognises entities in text, and classifies them in some way, but it
does not create templates, nor does it perform co-reference or entity
linking.
● NER is not just matching text strings with pre-defined lists of names. It only
recognises entities which are being used as entities in a given context.
(i.e. not easy!)
15
● Key part of Information Extraction system
● Robust handling of proper names essential for many applications
● Pre-processing for different classification levels
● Information filtering
● Information linking
● Entity level sentiment
● Knowledge graph
Why NER?
16
Knowledge Base StrategyWhy NER?
17
Knowledge Base StrategyWhy NER?
Pepsi spooks Coke with
this Halloween themed ad.
Entity specific sentiment analysis a.k.a ELS
Knowledge Base Strategy
So what about Social
?
19
Supervised Learning
❏ Hidden Markov Model (HMM) Freitag and Mccallum, 1999; Leek, 1997.
❏ Conditional Markov Model (CMM) Borthwick, 1999; McCallum et al., 2000.
✓ Conditional Random Field (CRF) Lafferty, 2001; Ratinov and Roth, 2009.
How to do NER? (state-of-the-art)
20
● Ground truth data collection for NER is very expensive
● Solutions:
○ Automatic NER annotation using Wikipedia data
○ Applying Latent Dirichlet Analysis (LDA) based NER detection
using Gazetteer data.
Training data
21
NER pipeline
22
Extensive lists of names for a specific category
● PER
○ First names (male-female) and surnames, their frequency
● LOC
○ Cities, Countries
○ Population
● ORG
○ Name of companies from Yellow pages.
Gazetteers help
Disadvantages
○ Difficult to create and maintain (or expensive if commercial)
○ Usefulness varies depending on category
○ Ambiguity
○ Words occur in more lists of different types (PER, LOC, FAC,...)
23
Let’s say we want to estimate the likelihood of the bi-gram "to Shanghai",
without having seen this in a training set.
The system can obtain a good estimate if it can cluster "Shanghai" with
other city names (like “London”, “Beijing”), then make its estimate based
on the likelihood of phrases such as "to London", "to Beijing" and "to
Denver"
Brown clustering - motivation
24
● Proposed by Brown et al. (1992) (a.k.a “IBM clustering”)
● Hierarchical class-based labeling method.
● Bottom-up
● Unsupervised learning
○ Doesn't need labeled data but rather large set of raw text.
● Greedy technique to maximize bi-gram MI.
● Merge words by contextual similarity.
Brown clustering (1)
( )
25
Brown clustering (2)
● Large amount of data
○ Similar words appear in similar contexts.
○ More precisely: similar words have similar distribution of words to their
immediate left and right.
● Example: “the” and “a” both are determinant.
○ Frequency of immediate words on their left and right:
26
Brown clustering (3)
27
Hmm...easy?
● What are the challenges in real applications?
● What about moving to other languages?
● What about moving to social domain?
28
Disambiguation
What is the entity
category of
“Washington”?
29
Different languages
● Tokenization
○ Chinese & Japanese: Words not separated
● Part of speech
○ Nouns
■ English: only number inflection
■ German: number, gender and case inflection
○ Verbs
■ English: regular verb 4, irregular verb up to 8 distinct forms
■ Finnish: more than 10,000 forms
● NER: Shape feature
○ English: Only proper nouns capitalized
○ German: All nouns capitalized
30
Different languages
31
Different languages
Studying of linguistic
properties of a language is
important!
32
Editorial vs. Social
33
Challenges in Social NER
● The performance of “off-the-shelf” NER methods degrades severely when
applied on Twitter data
● Tweets
○ are short: 140 character limit.
○ cover wide range of topics.
○ are written grammatically in broken language.
○ are written fast and posted from anywhere: a lot of mis-spelling.
→ a solution which considers social characteristics of text
34
Challenges in Social NER
Examples of noisy data
● Jaguar's gonna like this episode of #MadMen even less than last week's, I bet.
● Dane Bowers is in Asda I cant believe.it luckiest girl in the world omf i cant believe
it omg
● A feel good story RT @DailyBreezeNews: Santa Claus arrives by helicopter at LAX
to greet local school
35
Solution (1)
Adapting existing features to social properties
(POS tagger of editorial NER performs really poor
when it comes to social documents.)
36
Solution (2)
Weight (importance) of each CRF feature
37
Results
● Training Data
○ ~76K tweets labeled by human
annotator.
● Inter agreement of two
annotators.
● Test Data
○ ~9.1K tweets labeled by human
annotator.
● Improvement compared state-of-
the-art method
Ritter, A. et al. Named entity recognition in tweets: An
experimental study. EMNLP ’11, pages 1524–1534.
Knowledge Base Strategy
What about sentiment
.?
Document Level Sentiment - how it works
Inter-annotator agreement ~80%*
* http://bit.ly/human-sentiment
Document Level Sentiment - how it works
Machine Learning Magic
Supervised learning
Naive bayes - BernoulliNB, GaussianNB, MultinomialNB
Support Vector Machines - LinearSVM, RbfSVM
Maximum Entropy Model - GIS, IIS, MEGAM, TADM
MLP - RecurrentNN
Document Level Sentiment - how it works
Machine Learning Magic
Document Level Sentiment - current status
~60-70% (depending on language)
Not too terrible, considering that human
performance is at best ~80%...
...but why is it so hard?
Document Level Sentiment - how it’s used
Document Level Sentiment - how it’s used
Document Level Sentiment - the problem
Document Level Sentiment - the problem
Negative
Neutral
Document Level Sentiment - the problem
“Those numbers underline a growing gap between McDonald's and today's fast-
food customers. It will only get wider with another year's worth of the same
uninspired fare that has made McDonald's customers easy pickings for Panera
Bread, Chick-fil-A, Chipotle Mexican Grill and others.
”
Negative
Positive
Does not make sense for our industry!
Knowledge Base Strategy
Entity Level Sentiment (ELS)
Entity Level Sentiment - motivation
● DLS imprecise and wrong for our customers
● Entities are of main importance for our customers
● We already have NER (Named Entity Recognition) technology
Idea:
Identify the sentiment towards each particular entity in a text!
Entity Level Sentiment - how it works
NER
BMW: Positive
Mercedes: Neutral
Toyota: Negative


Entity Level Sentiment - how it works
Entity1: Positive
Entity2: Neutral
Entity3: Negative


E1:Positive
E2: Neutral
E3: Negative
E1:Positive
E2: Neutral
E3: Negative
E1:Positive
E2: Neutral
E3: Negative
Entity Level Sentiment - how it works
Entity1: Positive
Entity2: Neutral
Entity3: Negative


NER
Entity Level Sentiment - use case
Entity Level Sentiment - current status
● ELS is considered a very tough problem in NLP/ML
● The accuracy of state-of-the-art ELS is currently very low
(~45%)
Knowledge Base Strategy
The holy grail : The Graph Knowledge Base
don - July 2015
56
Entities + Relationships
As the types of entities and their
relationships grows so does
the capacity to infer insights
that depend on connectivity
and eventually one can
answer questions that
would otherwise not be
possible with only separate
datasets!
57
KB Architecture
Unstructured
Document
Stream
Pipeline
Enrichments
Graph
Search
Enriched Documents
High Performance
Indexes
Processing
Services
API Layer
Knowledge
Base
(Graph)
I/O
External Data Providers
Updates/subscriptions
Lookups
APPSBackup
Storage
Raw Documents
Knowledge Base Strategy
Why is it hard?
59
Composing the KB
60
Data Acquisition trade-offs
Highvolume
High quality
Cheap
Manual data
acquisition
Special crawlers,
Smart algorithms
Acquisitions,
partnerships
low
quality
expensive
low
volume
61
Composing the KB - Scalability
62
Scalability Requirements - next steps
Companies ~ 100 million worldwide
People ~ 500 million (including media influencers)
Products ~ 500 million
~1 billion entities all the connections
between them
→
billions of nodes, trillions of edges!
63
Composing the KB - New features
64
Improve entity search - company NED
65
Improve entity search - person NED
Robert Gates
22nd Secretary of Defense
William Henry Gates III
former CEO & cofounder of Microsoft
“Who is Mr. Gates?”
66
Emerging competition
67
Map influencer network
influencer score ~ eg. PageRank
68
Suggested read
● Ratinov 2009 (challenges in NER): paper.
● ArkCMU (social): paper, code.
● Ritter et al (social): paper, code.
● Stanford NLP NER (editorial): paper, code.
● Brown clustering
○ brown clustering: video
○ Word Representations and N-grams: video
● Transforming Wikipedia into Named Entity Training Data: paper.

More Related Content

Viewers also liked

Érzelmek hĂĄlĂłjĂĄban – hĂĄlĂłzat- Ă©s tartalomelemzĂ©s
Érzelmek hĂĄlĂłjĂĄban – hĂĄlĂłzat- Ă©s tartalomelemzĂ©sÉrzelmek hĂĄlĂłjĂĄban – hĂĄlĂłzat- Ă©s tartalomelemzĂ©s
Érzelmek hĂĄlĂłjĂĄban – hĂĄlĂłzat- Ă©s tartalomelemzĂ©s
Zoltan Varju
 
SzabĂł - VarjĂș: Automatikus Ă©rtĂ©kelĂ©s- Ă©s Ă©rzelemelemzĂ©s magyar nyelvƱ szöveg...
SzabĂł - VarjĂș: Automatikus  Ă©rtĂ©kelĂ©s- Ă©s Ă©rzelemelemzĂ©s magyar nyelvƱ szöveg...SzabĂł - VarjĂș: Automatikus  Ă©rtĂ©kelĂ©s- Ă©s Ă©rzelemelemzĂ©s magyar nyelvƱ szöveg...
SzabĂł - VarjĂș: Automatikus Ă©rtĂ©kelĂ©s- Ă©s Ă©rzelemelemzĂ©s magyar nyelvƱ szöveg...
Zoltan Varju
 
Balogh Kitti - SzƱcs Krisztina: Képes beszéd
Balogh Kitti - SzƱcs Krisztina: Képes beszédBalogh Kitti - SzƱcs Krisztina: Képes beszéd
Balogh Kitti - SzƱcs Krisztina: Képes beszéd
Zoltan Varju
 
Coparative analysis-of-pepsi-and-coke
Coparative analysis-of-pepsi-and-cokeCoparative analysis-of-pepsi-and-coke
Coparative analysis-of-pepsi-and-coke
Prabhpreet Singh
 

Viewers also liked (11)

Sorok között olvasni
Sorok között olvasniSorok között olvasni
Sorok között olvasni
 
Milyenek a trollok
Milyenek a trollokMilyenek a trollok
Milyenek a trollok
 
MunkanĂ©lkĂŒlisĂ©g jelenbecslĂ©se
MunkanĂ©lkĂŒlisĂ©g jelenbecslĂ©seMunkanĂ©lkĂŒlisĂ©g jelenbecslĂ©se
MunkanĂ©lkĂŒlisĂ©g jelenbecslĂ©se
 
Digitålis testbeszéd
Digitålis testbeszédDigitålis testbeszéd
Digitålis testbeszéd
 
Érzelmek hĂĄlĂłjĂĄban – hĂĄlĂłzat- Ă©s tartalomelemzĂ©s
Érzelmek hĂĄlĂłjĂĄban – hĂĄlĂłzat- Ă©s tartalomelemzĂ©sÉrzelmek hĂĄlĂłjĂĄban – hĂĄlĂłzat- Ă©s tartalomelemzĂ©s
Érzelmek hĂĄlĂłjĂĄban – hĂĄlĂłzat- Ă©s tartalomelemzĂ©s
 
SzabĂł - VarjĂș: Automatikus Ă©rtĂ©kelĂ©s- Ă©s Ă©rzelemelemzĂ©s magyar nyelvƱ szöveg...
SzabĂł - VarjĂș: Automatikus  Ă©rtĂ©kelĂ©s- Ă©s Ă©rzelemelemzĂ©s magyar nyelvƱ szöveg...SzabĂł - VarjĂș: Automatikus  Ă©rtĂ©kelĂ©s- Ă©s Ă©rzelemelemzĂ©s magyar nyelvƱ szöveg...
SzabĂł - VarjĂș: Automatikus Ă©rtĂ©kelĂ©s- Ă©s Ă©rzelemelemzĂ©s magyar nyelvƱ szöveg...
 
Balogh Kitti - SzƱcs Krisztina: Képes beszéd
Balogh Kitti - SzƱcs Krisztina: Képes beszédBalogh Kitti - SzƱcs Krisztina: Képes beszéd
Balogh Kitti - SzƱcs Krisztina: Képes beszéd
 
Coparative analysis-of-pepsi-and-coke
Coparative analysis-of-pepsi-and-cokeCoparative analysis-of-pepsi-and-coke
Coparative analysis-of-pepsi-and-coke
 
KisvilĂĄgunk, a nyelv
KisvilĂĄgunk, a nyelvKisvilĂĄgunk, a nyelv
KisvilĂĄgunk, a nyelv
 
Balogh Kitti - SzƱcs Krisztina - VarjĂș ZoltĂĄn: TechTea: SzövegvizualizĂĄciĂłk a...
Balogh Kitti - SzƱcs Krisztina - VarjĂș ZoltĂĄn: TechTea: SzövegvizualizĂĄciĂłk a...Balogh Kitti - SzƱcs Krisztina - VarjĂș ZoltĂĄn: TechTea: SzövegvizualizĂĄciĂłk a...
Balogh Kitti - SzƱcs Krisztina - VarjĂș ZoltĂĄn: TechTea: SzövegvizualizĂĄciĂłk a...
 
Érzelmek Ă©s tĂ©mĂĄk a szĂŒlĂ©szeti ellĂĄtĂĄsban
Érzelmek Ă©s tĂ©mĂĄk a szĂŒlĂ©szeti ellĂĄtĂĄsbanÉrzelmek Ă©s tĂ©mĂĄk a szĂŒlĂ©szeti ellĂĄtĂĄsban
Érzelmek Ă©s tĂ©mĂĄk a szĂŒlĂ©szeti ellĂĄtĂĄsban
 

Similar to Babak Rasolzadeh: The importance of entities

The Rise Of Conversational AI with David Low
The Rise Of Conversational AI with David LowThe Rise Of Conversational AI with David Low
The Rise Of Conversational AI with David Low
Databricks
 
How can text-mining leverage developments in Deep Learning? Presentation at ...
How can text-mining leverage developments in Deep Learning?  Presentation at ...How can text-mining leverage developments in Deep Learning?  Presentation at ...
How can text-mining leverage developments in Deep Learning? Presentation at ...
jcscholtes
 
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Andre Freitas
 
Guenther Krumpak: The Book and The Internet - the Antithesis between Paper an...
Guenther Krumpak: The Book and The Internet - the Antithesis between Paper an...Guenther Krumpak: The Book and The Internet - the Antithesis between Paper an...
Guenther Krumpak: The Book and The Internet - the Antithesis between Paper an...
KISK FF MU
 

Similar to Babak Rasolzadeh: The importance of entities (20)

The Rise Of Conversational AI with David Low
The Rise Of Conversational AI with David LowThe Rise Of Conversational AI with David Low
The Rise Of Conversational AI with David Low
 
State of the art in Natural Language Processing (March 2019)
State of the art in Natural Language Processing (March 2019)State of the art in Natural Language Processing (March 2019)
State of the art in Natural Language Processing (March 2019)
 
GATE: a text analysis tool for social media
GATE: a text analysis tool for social mediaGATE: a text analysis tool for social media
GATE: a text analysis tool for social media
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
 
How can text-mining leverage developments in Deep Learning? Presentation at ...
How can text-mining leverage developments in Deep Learning?  Presentation at ...How can text-mining leverage developments in Deep Learning?  Presentation at ...
How can text-mining leverage developments in Deep Learning? Presentation at ...
 
Do We Need Better Presentations
Do We Need Better PresentationsDo We Need Better Presentations
Do We Need Better Presentations
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)
 
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
 
Guenther Krumpak: The Book and The Internet - the Antithesis between Paper an...
Guenther Krumpak: The Book and The Internet - the Antithesis between Paper an...Guenther Krumpak: The Book and The Internet - the Antithesis between Paper an...
Guenther Krumpak: The Book and The Internet - the Antithesis between Paper an...
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
 
Text analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATEText analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATE
 
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceBroad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
 
DataScientist Job : Between Myths and Reality.pdf
DataScientist Job : Between Myths and Reality.pdfDataScientist Job : Between Myths and Reality.pdf
DataScientist Job : Between Myths and Reality.pdf
 
Boosting Named Entity Extraction through Crowdsourcing
Boosting Named Entity Extraction through CrowdsourcingBoosting Named Entity Extraction through Crowdsourcing
Boosting Named Entity Extraction through Crowdsourcing
 
Rigourous evaluation of nlp models in real world deployment
Rigourous evaluation of nlp models in real world deploymentRigourous evaluation of nlp models in real world deployment
Rigourous evaluation of nlp models in real world deployment
 
Towards Responsible NLP: Walking the walk
Towards Responsible NLP: Walking the walkTowards Responsible NLP: Walking the walk
Towards Responsible NLP: Walking the walk
 
Machine learning technology for publishing industry / Buchmesse 2018
Machine learning technology for publishing industry / Buchmesse 2018Machine learning technology for publishing industry / Buchmesse 2018
Machine learning technology for publishing industry / Buchmesse 2018
 
Silk data - machine learning
Silk data - machine learning Silk data - machine learning
Silk data - machine learning
 
Openbar Leuven // Less is more. Working with less data in NLP by Yves Peirsman
Openbar Leuven // Less is more. Working with less data in NLP by Yves PeirsmanOpenbar Leuven // Less is more. Working with less data in NLP by Yves Peirsman
Openbar Leuven // Less is more. Working with less data in NLP by Yves Peirsman
 
"Understanding Humans with Machines" (Arthur Tisi)
"Understanding Humans with Machines" (Arthur Tisi)"Understanding Humans with Machines" (Arthur Tisi)
"Understanding Humans with Machines" (Arthur Tisi)
 

More from Zoltan Varju

Danics Szabina Lívia: A magyar és az orosz melléknévi igenevek a megfelelteté...
Danics Szabina Lívia: A magyar és az orosz melléknévi igenevek a megfelelteté...Danics Szabina Lívia: A magyar és az orosz melléknévi igenevek a megfelelteté...
Danics Szabina Lívia: A magyar és az orosz melléknévi igenevek a megfelelteté...
Zoltan Varju
 
Rasztik Zita: A ŃŃ‚Đ°Ń€Ń‚ĐŸĐČать jövevĂ©nyszĂł fejlƑdĂ©si Ăștja
Rasztik Zita: A ŃŃ‚Đ°Ń€Ń‚ĐŸĐČать jövevĂ©nyszĂł fejlƑdĂ©si ĂștjaRasztik Zita: A ŃŃ‚Đ°Ń€Ń‚ĐŸĐČать jövevĂ©nyszĂł fejlƑdĂ©si Ăștja
Rasztik Zita: A ŃŃ‚Đ°Ń€Ń‚ĐŸĐČать jövevĂ©nyszĂł fejlƑdĂ©si Ăștja
Zoltan Varju
 
Textus; szövegek hålójåban
Textus; szövegek hålójåbanTextus; szövegek hålójåban
Textus; szövegek hålójåban
Zoltan Varju
 

More from Zoltan Varju (20)

NLP meetup 2016.10.05 - BĂłdogh Attila: xdroid
NLP meetup 2016.10.05 - BĂłdogh Attila: xdroidNLP meetup 2016.10.05 - BĂłdogh Attila: xdroid
NLP meetup 2016.10.05 - BĂłdogh Attila: xdroid
 
NLP meetup 2016.10.05 - SzabĂł Martina Katalin: Precognox
NLP meetup 2016.10.05 - SzabĂł Martina Katalin: PrecognoxNLP meetup 2016.10.05 - SzabĂł Martina Katalin: Precognox
NLP meetup 2016.10.05 - SzabĂł Martina Katalin: Precognox
 
NLP meetup 2016.10.05 - Szekeres PĂ©ter: Neticle
NLP meetup 2016.10.05 - Szekeres PĂ©ter: NeticleNLP meetup 2016.10.05 - Szekeres PĂ©ter: Neticle
NLP meetup 2016.10.05 - Szekeres PĂ©ter: Neticle
 
Balogh Kitti: Szövegbånyåszat
Balogh Kitti: SzövegbånyåszatBalogh Kitti: Szövegbånyåszat
Balogh Kitti: Szövegbånyåszat
 
Balogh Kitti: Politika a sorok között - Politikai tĂ©mĂĄjĂș szövegelemzĂ©sek
Balogh Kitti: Politika a sorok között - Politikai tĂ©mĂĄjĂș szövegelemzĂ©sekBalogh Kitti: Politika a sorok között - Politikai tĂ©mĂĄjĂș szövegelemzĂ©sek
Balogh Kitti: Politika a sorok között - Politikai tĂ©mĂĄjĂș szövegelemzĂ©sek
 
MĂłkus (Koncsik Anita, VarjĂș ZoltĂĄn)
MĂłkus (Koncsik Anita, VarjĂș ZoltĂĄn)MĂłkus (Koncsik Anita, VarjĂș ZoltĂĄn)
MĂłkus (Koncsik Anita, VarjĂș ZoltĂĄn)
 
SzĂŒletĂ©shĂĄz - Adatozz okosan hackathon (Schmidt Erika, Balogh Kitti, Hudy RĂłb...
SzĂŒletĂ©shĂĄz - Adatozz okosan hackathon (Schmidt Erika, Balogh Kitti, Hudy RĂłb...SzĂŒletĂ©shĂĄz - Adatozz okosan hackathon (Schmidt Erika, Balogh Kitti, Hudy RĂłb...
SzĂŒletĂ©shĂĄz - Adatozz okosan hackathon (Schmidt Erika, Balogh Kitti, Hudy RĂłb...
 
Danics Szabina Lívia: A magyar és az orosz melléknévi igenevek a megfelelteté...
Danics Szabina Lívia: A magyar és az orosz melléknévi igenevek a megfelelteté...Danics Szabina Lívia: A magyar és az orosz melléknévi igenevek a megfelelteté...
Danics Szabina Lívia: A magyar és az orosz melléknévi igenevek a megfelelteté...
 
Rasztik Zita: A ŃŃ‚Đ°Ń€Ń‚ĐŸĐČать jövevĂ©nyszĂł fejlƑdĂ©si Ăștja
Rasztik Zita: A ŃŃ‚Đ°Ń€Ń‚ĐŸĐČать jövevĂ©nyszĂł fejlƑdĂ©si ĂștjaRasztik Zita: A ŃŃ‚Đ°Ń€Ń‚ĐŸĐČать jövevĂ©nyszĂł fejlƑdĂ©si Ăștja
Rasztik Zita: A ŃŃ‚Đ°Ń€Ń‚ĐŸĐČать jövevĂ©nyszĂł fejlƑdĂ©si Ăștja
 
Kontextus Ă©s a hivatkozĂĄsok ereje
Kontextus Ă©s a hivatkozĂĄsok erejeKontextus Ă©s a hivatkozĂĄsok ereje
Kontextus Ă©s a hivatkozĂĄsok ereje
 
Simon Eszter: Silver standard korpuszok tulajdonnév-felismeréshez
Simon Eszter: Silver standard korpuszok tulajdonnév-felismeréshezSimon Eszter: Silver standard korpuszok tulajdonnév-felismeréshez
Simon Eszter: Silver standard korpuszok tulajdonnév-felismeréshez
 
Vincze Veronika: Korpuszok az informåciókinyerésben
Vincze Veronika: Korpuszok az informåciókinyerésben Vincze Veronika: Korpuszok az informåciókinyerésben
Vincze Veronika: Korpuszok az informåciókinyerésben
 
FelhĂ­vĂĄs
FelhĂ­vĂĄsFelhĂ­vĂĄs
FelhĂ­vĂĄs
 
MihĂĄltz MĂĄrton: Magyar wordnet
MihĂĄltz MĂĄrton: Magyar wordnetMihĂĄltz MĂĄrton: Magyar wordnet
MihĂĄltz MĂĄrton: Magyar wordnet
 
Ács Judit: Online soknyelvƱ szótårak
Ács Judit: Online soknyelvƱ szótårakÁcs Judit: Online soknyelvƱ szótårak
Ács Judit: Online soknyelvƱ szótårak
 
Sass BĂĄlint: 28 milliĂł szintaktikailag elemzett mondat Ă©s 500000 igei szerkezet
Sass BĂĄlint: 28 milliĂł szintaktikailag elemzett mondat Ă©s 500000 igei szerkezetSass BĂĄlint: 28 milliĂł szintaktikailag elemzett mondat Ă©s 500000 igei szerkezet
Sass BĂĄlint: 28 milliĂł szintaktikailag elemzett mondat Ă©s 500000 igei szerkezet
 
Vincze Veronika: Korpuszok az informåciókinyerésben
Vincze Veronika: Korpuszok az informåciókinyerésben Vincze Veronika: Korpuszok az informåciókinyerésben
Vincze Veronika: Korpuszok az informåciókinyerésben
 
Vincze Veronika: A Szeged Korpusz Ă©s Treebank
Vincze Veronika: A Szeged Korpusz Ă©s Treebank Vincze Veronika: A Szeged Korpusz Ă©s Treebank
Vincze Veronika: A Szeged Korpusz Ă©s Treebank
 
Textus; szövegek hålójåban
Textus; szövegek hålójåbanTextus; szövegek hålójåban
Textus; szövegek hålójåban
 
Szabó - Vincze - Morvay: Magyar nyelvƱ szövegek emócióelemzésének elméleti és...
Szabó - Vincze - Morvay: Magyar nyelvƱ szövegek emócióelemzésénekelméleti és...Szabó - Vincze - Morvay: Magyar nyelvƱ szövegek emócióelemzésénekelméleti és...
Szabó - Vincze - Morvay: Magyar nyelvƱ szövegek emócióelemzésének elméleti és...
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Babak Rasolzadeh: The importance of entities

  • 1. Meltwater Budapest, April 2016 The importance of entities Babak Rasolzadeh, Director of Data Science Research
  • 2. 1. Company background 2. Data Science @ Meltwater 3. Challenges with NLP at Large scale 4. Entities, entities, entities a. Social NER b. ELS c. Knowledge Graph
  • 3. 3 What is Meltwater? ● A business intelligence company → Providing insights from data outside the firewall (news, blogs, social media, etc.) ● Born in Oslo, in 2001. ● Founder and CEO: Jorn Lyssegen ● www.meltwater.com ● 30K+ clients all over the World. ● 1000+ employees ● 60+ offices around the world, mostly sale. ● Tech offices: USA, Germany, Sweden, Hungary, India.
  • 5. 5 What? Uses Meltwater to find out about new instances of vandalism and break-ins. Often, the victim is in need of services Uses Meltwater to help determine how public perception of certain ingredient chemicals will influence adoption & sales Uses Meltwater to be alerted of when certain patent will expire in target markets Uses Meltwater to monitor the performance and popularity of news anchors and programs Uses Meltwater social listening to estimate and prevent infrastructure attacks
  • 7. Unstructured Document Stream Pipeline Enrichments Search /Storage Enriched Documents High Performance Indexes Processing Services API Layer APPSBackup Storage Raw Documents 15 supported languages in pipeline (EN, DE, SV, NO, FI, ZH, JP, FR, ES, DA, NL, PT, AR, IT, HI) Typical enrichments ○ Sentiment analysis ○ Thematic analysis ○ Categorization ○ Keyphrase extraction ○ Named Entity Recognition ○ Named Entity Disambiguation NLP & Data Science at Meltwater
  • 8. 8 What other than NLP? ● Recommendation Engines DOC3 DOC3 DOC3 DOC3 DOC3 DOC8 Realtime recommender engine ● Correlation and predictive pattern recognition ● Word2vec techniques concept 3 concept 1 concept 2 “British American Tobacco" or "British American Tobbaco" or (BAT near tobacco) or "英矎煙草" or (("Lucky Strike" or "Dunhill" or "Pall Mall") near/15 cigarette*)
  • 10. 10 Challenges with Data Science (NLP) at scale ‱ High DPS (~2000) and a lot to do! (tokenization, lemmatization, stemming, POS tagging, categorization, sentiment, NER, ...) with racing conditions! Pipeline Enrichments SV EN DE POS NER‱ Training data labelling is costly! x15 ‱ Contextual information expensive (computationally). ‱ Noise, missing data, variation (e.g. slang), data types, ...
  • 11. Knowledge Base Strategy Entities, entities, entities don - July 2015
  • 12. 12 Knowledge Base StrategyWhat are Named Entities (NE)? ● Non-linguistic definition ○ Referable entities ○ Usually Proper Names ○ Single or multi-word → I know this man. He might be Charles. → He lives in Stockholm. He is Swedish.
  • 13. 13 Knowledge Base StrategyWhat is Named Entity Recognition (NER)? 1. Extracting NEs from a text. 2. Categorizing NEs from a set of predefined categories. John lives in Stockholm. He works at Ericsson. Categories of {PER, LOC, ORG, MISC, PROD}
  • 14. 14 Knowledge Base StrategyWhat NER is not? ● NER is not event recognition. ● NER recognises entities in text, and classifies them in some way, but it does not create templates, nor does it perform co-reference or entity linking. ● NER is not just matching text strings with pre-defined lists of names. It only recognises entities which are being used as entities in a given context. (i.e. not easy!)
  • 15. 15 ● Key part of Information Extraction system ● Robust handling of proper names essential for many applications ● Pre-processing for different classification levels ● Information filtering ● Information linking ● Entity level sentiment ● Knowledge graph Why NER?
  • 17. 17 Knowledge Base StrategyWhy NER? Pepsi spooks Coke with this Halloween themed ad. Entity specific sentiment analysis a.k.a ELS
  • 18. Knowledge Base Strategy So what about Social
?
  • 19. 19 Supervised Learning ❏ Hidden Markov Model (HMM) Freitag and Mccallum, 1999; Leek, 1997. ❏ Conditional Markov Model (CMM) Borthwick, 1999; McCallum et al., 2000. ✓ Conditional Random Field (CRF) Lafferty, 2001; Ratinov and Roth, 2009. How to do NER? (state-of-the-art)
  • 20. 20 ● Ground truth data collection for NER is very expensive ● Solutions: ○ Automatic NER annotation using Wikipedia data ○ Applying Latent Dirichlet Analysis (LDA) based NER detection using Gazetteer data. Training data
  • 22. 22 Extensive lists of names for a specific category ● PER ○ First names (male-female) and surnames, their frequency ● LOC ○ Cities, Countries ○ Population ● ORG ○ Name of companies from Yellow pages. Gazetteers help Disadvantages ○ Difficult to create and maintain (or expensive if commercial) ○ Usefulness varies depending on category ○ Ambiguity ○ Words occur in more lists of different types (PER, LOC, FAC,...)
  • 23. 23 Let’s say we want to estimate the likelihood of the bi-gram "to Shanghai", without having seen this in a training set. The system can obtain a good estimate if it can cluster "Shanghai" with other city names (like “London”, “Beijing”), then make its estimate based on the likelihood of phrases such as "to London", "to Beijing" and "to Denver" Brown clustering - motivation
  • 24. 24 ● Proposed by Brown et al. (1992) (a.k.a “IBM clustering”) ● Hierarchical class-based labeling method. ● Bottom-up ● Unsupervised learning ○ Doesn't need labeled data but rather large set of raw text. ● Greedy technique to maximize bi-gram MI. ● Merge words by contextual similarity. Brown clustering (1) ( )
  • 25. 25 Brown clustering (2) ● Large amount of data ○ Similar words appear in similar contexts. ○ More precisely: similar words have similar distribution of words to their immediate left and right. ● Example: “the” and “a” both are determinant. ○ Frequency of immediate words on their left and right:
  • 27. 27 Hmm...easy? ● What are the challenges in real applications? ● What about moving to other languages? ● What about moving to social domain?
  • 28. 28 Disambiguation What is the entity category of “Washington”?
  • 29. 29 Different languages ● Tokenization ○ Chinese & Japanese: Words not separated ● Part of speech ○ Nouns ■ English: only number inflection ■ German: number, gender and case inflection ○ Verbs ■ English: regular verb 4, irregular verb up to 8 distinct forms ■ Finnish: more than 10,000 forms ● NER: Shape feature ○ English: Only proper nouns capitalized ○ German: All nouns capitalized
  • 31. 31 Different languages Studying of linguistic properties of a language is important!
  • 33. 33 Challenges in Social NER ● The performance of “off-the-shelf” NER methods degrades severely when applied on Twitter data ● Tweets ○ are short: 140 character limit. ○ cover wide range of topics. ○ are written grammatically in broken language. ○ are written fast and posted from anywhere: a lot of mis-spelling. → a solution which considers social characteristics of text
  • 34. 34 Challenges in Social NER Examples of noisy data ● Jaguar's gonna like this episode of #MadMen even less than last week's, I bet. ● Dane Bowers is in Asda I cant believe.it luckiest girl in the world omf i cant believe it omg ● A feel good story RT @DailyBreezeNews: Santa Claus arrives by helicopter at LAX to greet local school
  • 35. 35 Solution (1) Adapting existing features to social properties (POS tagger of editorial NER performs really poor when it comes to social documents.)
  • 36. 36 Solution (2) Weight (importance) of each CRF feature
  • 37. 37 Results ● Training Data ○ ~76K tweets labeled by human annotator. ● Inter agreement of two annotators. ● Test Data ○ ~9.1K tweets labeled by human annotator. ● Improvement compared state-of- the-art method Ritter, A. et al. Named entity recognition in tweets: An experimental study. EMNLP ’11, pages 1524–1534.
  • 38. Knowledge Base Strategy What about sentiment
.?
  • 39. Document Level Sentiment - how it works Inter-annotator agreement ~80%* * http://bit.ly/human-sentiment
  • 40. Document Level Sentiment - how it works Machine Learning Magic Supervised learning Naive bayes - BernoulliNB, GaussianNB, MultinomialNB Support Vector Machines - LinearSVM, RbfSVM Maximum Entropy Model - GIS, IIS, MEGAM, TADM MLP - RecurrentNN
  • 41. Document Level Sentiment - how it works Machine Learning Magic
  • 42. Document Level Sentiment - current status ~60-70% (depending on language) Not too terrible, considering that human performance is at best ~80%... ...but why is it so hard?
  • 43. Document Level Sentiment - how it’s used
  • 44. Document Level Sentiment - how it’s used
  • 45. Document Level Sentiment - the problem
  • 46. Document Level Sentiment - the problem Negative Neutral
  • 47. Document Level Sentiment - the problem “Those numbers underline a growing gap between McDonald's and today's fast- food customers. It will only get wider with another year's worth of the same uninspired fare that has made McDonald's customers easy pickings for Panera Bread, Chick-fil-A, Chipotle Mexican Grill and others. ” Negative Positive Does not make sense for our industry!
  • 48. Knowledge Base Strategy Entity Level Sentiment (ELS)
  • 49. Entity Level Sentiment - motivation ● DLS imprecise and wrong for our customers ● Entities are of main importance for our customers ● We already have NER (Named Entity Recognition) technology Idea: Identify the sentiment towards each particular entity in a text!
  • 50. Entity Level Sentiment - how it works NER BMW: Positive Mercedes: Neutral Toyota: Negative 

  • 51. Entity Level Sentiment - how it works Entity1: Positive Entity2: Neutral Entity3: Negative 
 E1:Positive E2: Neutral E3: Negative E1:Positive E2: Neutral E3: Negative E1:Positive E2: Neutral E3: Negative
  • 52. Entity Level Sentiment - how it works Entity1: Positive Entity2: Neutral Entity3: Negative 
 NER
  • 54. Entity Level Sentiment - current status ● ELS is considered a very tough problem in NLP/ML ● The accuracy of state-of-the-art ELS is currently very low (~45%)
  • 55. Knowledge Base Strategy The holy grail : The Graph Knowledge Base don - July 2015
  • 56. 56 Entities + Relationships As the types of entities and their relationships grows so does the capacity to infer insights that depend on connectivity and eventually one can answer questions that would otherwise not be possible with only separate datasets!
  • 57. 57 KB Architecture Unstructured Document Stream Pipeline Enrichments Graph Search Enriched Documents High Performance Indexes Processing Services API Layer Knowledge Base (Graph) I/O External Data Providers Updates/subscriptions Lookups APPSBackup Storage Raw Documents
  • 60. 60 Data Acquisition trade-offs Highvolume High quality Cheap Manual data acquisition Special crawlers, Smart algorithms Acquisitions, partnerships low quality expensive low volume
  • 61. 61 Composing the KB - Scalability
  • 62. 62 Scalability Requirements - next steps Companies ~ 100 million worldwide People ~ 500 million (including media influencers) Products ~ 500 million ~1 billion entities all the connections between them → billions of nodes, trillions of edges!
  • 63. 63 Composing the KB - New features
  • 64. 64 Improve entity search - company NED
  • 65. 65 Improve entity search - person NED Robert Gates 22nd Secretary of Defense William Henry Gates III former CEO & cofounder of Microsoft “Who is Mr. Gates?”
  • 67. 67 Map influencer network influencer score ~ eg. PageRank
  • 68. 68 Suggested read ● Ratinov 2009 (challenges in NER): paper. ● ArkCMU (social): paper, code. ● Ritter et al (social): paper, code. ● Stanford NLP NER (editorial): paper, code. ● Brown clustering ○ brown clustering: video ○ Word Representations and N-grams: video ● Transforming Wikipedia into Named Entity Training Data: paper.