SlideShare ist ein Scribd-Unternehmen logo
1 von 46
1
Adaptive Parser-Centric
Text Normalization
Congle Zhang* Tyler Baldwin**
Howard Ho** Benny Kimelfeld** Yunyao Li**
* University of Washington **IBM Research - Almaden
Public Text
Web Text
Private Text
Text
Analytics
Marketing
Financial investment
Drug discovery
Law enforcement
…
Applications
Social media
News
SEC
Internal
Data
Subscription
Data
USPTO
Text analytics is the key for discovering
hidden value from text
DREAM
REALITY
Image from http://samasource.org
CAN YOU READ THIS IN
FIRST ATTEMPT?
ay woundent of see ’ em
CAN YOU READ THIS IN FIRST ATEMPT?
00:0000:0100:02
I would not have seen them.
When a machine reads it
Results from Google translation
Chinese 唉看见他们 woundent
Spanish ay woundent de verlas
Japanese ローマ法王進呈の AY woundent
Portuguese ay woundent de vê-los
German ay woundent de voir 'em
Text Normalization
• Informal writing  standard written form
9
I would not have seen them .
normalize
ay woundent of see ’ em
Challenge: Grammar
10
text normalization
would not of see them
ay woundent of see ’ em
I would not have seen them. Vs.
mapping out-of-
vocabulary non-standard
tokens to their in-
vocabulary standard form
≠
Challenge: Domain Adaptation
Tailor the same text
normalization solution
towards different writing
style of different data
sources
11
Challenge: Evaluation
• Previous: word error rate & BLEU score
• However,
– Words are not equally important
– non-word information (punctuations,
capitalization) can be important
– Word reordering is important
• How does the normalization actually
impact the downstream applications?
12
Adaptive Parser-Centric Text
Normalization
Grammatical
Sentence
Domain
Transferrable
Parsing
performance
Outlines
• Model
• Inference
• Learning
• Instantiation
• Evaluation
• Conclusion
14
Model: Replacement Generator
15
• Replacement <i,j,s>: replace tokens xi … xj-1
with s
• Domain customization
– Generic (cross-domain) replacements
– Domain-specific replacements
Ay1 woudent2 of3 see4 ‘em5
<2,3,”would not”>
<1,2,”Ay”>
<1,2,”I”>
<1,2,ε>
<6,6,”.”>
…
Edit
Same
Edit
Delete
Insert
…
Model: Boolean Variables
• Associate a unique Boolean variable Xr with
each replacement r
– Xr=true: replacement r is used to produce the
output sentence
16
<2,3,”would not”> = true
… would not …
Model: Normalization Graph
17
• A graphical model Ay woudent of see ‘em
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<2,4,”would not have”> <2,3,”would”>
<4,5,”seen”>
<5,6,”them”>
*START*
*END*
<6,6,”.”>
<3,4,”of”>
Model: Legal Assignment
• Sound
– Any two true replacements do not overlap
– <1,2,”Ay”> and <1,2,”I”> cannot be both true
• Completeness
– Every input token is captured by at least one true
replacement
18
Model: Legal = Path
• A legal assignment: a path from start to end
19
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<2,4,”would not have”> <2,3,”would”>
<4,5,”seen”>
<5,6,”them”>
*START*
*END*
<6,6,”.”>
<3,4,”of”>
I would not have see him.
Output
Model: Assignment Probability
20
• Log-linear model; feature functions on edges
20
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<2,4,”would not have”> <2,3,”would”>
<4,5,”seen”>
<5,6,”them”>
*START*
*END*
<6,6,”.”>
<3,4,”of”>
Outlines
• Model
• Inference
• Learning
• Instantiation
• Evaluation
• Conclusion
21
Inference
• Select the assignment with the highest
probability
• Computationally hard on general graph
models …
• But, in our model it boils down to finding the
longest path in a weighted and directed
acyclic graph 22
Inference
23
• weighted longest path
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<2,4,”would not have”> <2,3,”would”>
<4,5,”seen”>
<5,6,”them”>
*START*
*END*
<6,6,”.”>
<3,4,”of”>
I would not have see him.
Outlines
• Model
• Inference
• Learning
• Instantiation
• Evaluation
• Conclusion
24
Learning
• Perceptron-style algorithm
– Update weights by
– Comparing (1) most probable output with the
current weights (2) gold sequence
25
(1) Informal: Ay woudent of see ‘em
(2) Gold: I would not have seen them.
(3) Graph
Input
Output (1) weights of features
Learning: Gold vs. Inferred
26
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<2,4,”would not have”> <2,3,”would”>
<4,5,”seen”>
<5,6,”them”>
*START*
*END*
<6,6,”.”>
<3,4,”of”>
Gold sequence
Most probable
sequence with current θ
Learning: Update Weights on the
Differential Edges
27
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<2,4,”would not have”> <2,3,”would”>
<4,5,”seen”>
<5,6,”them”>
*START*
*END*
<6,6,”.”>
<3,4,”of”>
the gold sequence becomes “longer”
Increase wi
Outlines
• Model
• Inference
• Learning
• Instantiation
• Evaluation
• Conclusion
28
Instantiation: Replacement
Generators
29
Generator From To
leave intact good good
edit distance bac back
lowercase NEED need
capitalize it It
Google spell dispaear disappear
contraction wouldn’t would not
slang language ima I am going to
insert punctuation ε .
duplicated punctuation !? !
delete filler lmao ε
Instantiation: Features
• N-gram
– Frequency of the phrases induced by an edge
• Part-of-speech
– Encourage certain behavior, such as avoiding the
deletion of noun phrases.
• Positional
– Capitalize words after stop punctuations
• Lineage
– Which generator spawned the replacement
30
Outlines
• Model
• Inference
• Learning
• Instantiation
• Evaluation
• Conclusion
31
Evaluation Metrics: Compare Parses
Input
sentence
32
Human Expert
Gold
sentence
Normalized
sentence
Normalizer
Parser
Parser
Compare
Gold
Parse
Normalized
Parse
Focus on subjects, verbs, and objects (SVO)
Evaluation Metrics: Example
Test Gold SVO
I kinda wanna
get ipad NEW
I kind of want to
get a new iPad.
verb(get) verb(want)
verb(get)
precisionv = 1/1
recallv = 1/2
subj(get,I)
subj(get,wanna)
obj(get,NEW)
subj(want, I)
subj(get,I)
obj(get,iPad)
precisionso = 1/3
recallso= 1/3
33
Evaluation: Baselines
• w/oN: without normalization
• Google: Google spell checker
• w2wN: word-to-word normalization [Han and
Baldwin 2011]
• Gw2wN: gold standard for word-to-word
normalizations of previous work (whenever
available).
34
Evaluation: Domains
• Twitter [Han and Baldwin 2011]
– Gold: Grammatical sentences
• SMS [Choudhury et al 2007]
– Gold: Grammatical sentences
• Call-Center Log: proprietary
– Text-based responses about users’ experience with a call-
center for a major company
– Gold: Grammatical sentences
35
Evaluation: Twitter
36
• Twitter-specific replacement generators
– Hashtags (#), ats (@), and retweets (RT)
– Generators that allowed for either the initial symbol or the
entire token to be deleted
Evaluation: Twitter
System
Verb Subject-Object
Pre Rec F1 Pre Rec F1
w/oN 83.7 68.1 75.1 31.7 38.6 34.8
Google 88.9 78.8 83.5 36.1 46.3 40.6
w2wN 87.5 81.5 84.4 44.5 58.9 50.7
Gw2wN 89.8 83.8 86.7 46.9 61.0 53.0
generic 91.7 88.9 90.3 53.6 70.2 60.8
domain
specific
95.3 88.7 91.9 72.5 76.3 74.4
37
Domain-specific generators yielded the
best overall performance
Evaluation: Twitter
System
Verb Subject-Object
Pre Rec F1 Pre Rec F1
w/oN 83.7 68.1 75.1 31.7 38.6 34.8
Google 88.9 78.8 83.5 36.1 46.3 40.6
w2wN 87.5 81.5 84.4 44.5 58.9 50.7
Gw2wN 89.8 83.8 86.7 46.9 61.0 53.0
generic 91.7 88.9 90.3 53.6 70.2 60.8
domain
specific
95.3 88.7 91.9 72.5 76.3 74.4
38
w/o domain-specific generators, our system
outperformed the word-to-word normalization
approaches
Evaluation: Twitter
System
Verb Subject-Object
Pre Rec F1 Pre Rec F1
w/oN 83.7 68.1 75.1 31.7 38.6 34.8
Google 88.9 78.8 83.5 36.1 46.3 40.6
w2wN 87.5 81.5 84.4 44.5 58.9 50.7
Gw2wN 89.8 83.8 86.7 46.9 61.0 53.0
generic 91.7 88.9 90.3 53.6 70.2 60.8
domain
specific
95.3 88.7 91.9 72.5 76.3 74.4
39
Even perfect word-to-word normalization is not
good enough!
Evaluation: SMS
40
SMS-specific replacement generator:
- Mapping dictionary of SMS
abbreviations
Evaluation: SMS
41
System
Verb Subject-Object
Pre Rec F1 Pre Rec F1
w/oN 76.4 48.1 59.0 19.5 21.5 20.4
Google 85.1 61.6 71.5 22.4 26.2 24.1
w2wN 78.5 61.5 68.9 29.9 36.0 32.6
Gw2wN 87.6 76.6 81.8 38.0 50.6 43.4
generic 86.5 77.4 81.7 35.5 47.7 40.7
domain
specific
88.1 75.0 81.0 41.0 49.5 44.8
Evaluation: Call-Center
42
Call Center-specific generator:
- Mapping dictionary of call center
abbreviations
(e.g. “rep.”  “representative”)
Evaluation: Call-Center
43
System
Verb Subject-Object
Pre Rec F1 Pre Rec F1
w/oN 98.5 97.1 97.8 69.2 66.1 67.6
Google 99.2 97.9 98.5 70.5 67.3 68.8
generic 98.9 97.4 98.1 71.3 67.9 69.6
domain
specific
99.2 97.4 98.3 87.9 83.1 85.4
Discussion
• Domain transfer w/ small amount of effort is
possible
• Performing normalization is indeed beneficial to
dependency parsing
– Simple word-to-word normalization is not enough
44
Conclusion
• Normalization framework with an eye toward
domain adaptation
• Parser-centric view of normalization
• Our system outperformed competitive baselines
over three different domains
• Dataset to spur future research
– https://www.cs.washington.edu/node/9091/
45
Team
46

Weitere ähnliche Inhalte

Ähnlich wie Adaptive Parser-Centric Text Normalization

ANOTHER BRICK OFF THE WALL: DECONSTRUCTING WEB APPLICATION FIREWALLS USING AU...
ANOTHER BRICK OFF THE WALL: DECONSTRUCTING WEB APPLICATION FIREWALLS USING AU...ANOTHER BRICK OFF THE WALL: DECONSTRUCTING WEB APPLICATION FIREWALLS USING AU...
ANOTHER BRICK OFF THE WALL: DECONSTRUCTING WEB APPLICATION FIREWALLS USING AU...
Ioannis Stais
 
The Enterprise Architecture You Always Wanted
The Enterprise Architecture You Always WantedThe Enterprise Architecture You Always Wanted
The Enterprise Architecture You Always Wanted
Thoughtworks
 
We’re All UX: Designing a Whole Company Design Team - Giant Conf 2014
We’re All UX:  Designing a Whole Company Design Team - Giant Conf 2014We’re All UX:  Designing a Whole Company Design Team - Giant Conf 2014
We’re All UX: Designing a Whole Company Design Team - Giant Conf 2014
Phillip Hunter
 

Ähnlich wie Adaptive Parser-Centric Text Normalization (20)

K02-salen: Systems Thinking in Action 2011
K02-salen: Systems Thinking in Action 2011K02-salen: Systems Thinking in Action 2011
K02-salen: Systems Thinking in Action 2011
 
Semantic Optimization with Structured Data - SMX Munich
Semantic Optimization with Structured Data - SMX MunichSemantic Optimization with Structured Data - SMX Munich
Semantic Optimization with Structured Data - SMX Munich
 
Faster! Faster! Accelerate your business with blazing prototypes
Faster! Faster! Accelerate your business with blazing prototypesFaster! Faster! Accelerate your business with blazing prototypes
Faster! Faster! Accelerate your business with blazing prototypes
 
Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...
Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...
Non equilibrium Molecular Simulations of Polymers under Flow Saving Energy th...
 
Duplicates everywhere (Berlin)
Duplicates everywhere (Berlin)Duplicates everywhere (Berlin)
Duplicates everywhere (Berlin)
 
Opinion Driven Decision Support System
Opinion Driven Decision Support SystemOpinion Driven Decision Support System
Opinion Driven Decision Support System
 
Code Review: How And When
Code Review: How And WhenCode Review: How And When
Code Review: How And When
 
Deployments in one click!
Deployments in one click!Deployments in one click!
Deployments in one click!
 
How to write a project proposal
How to write a project proposalHow to write a project proposal
How to write a project proposal
 
ANOTHER BRICK OFF THE WALL: DECONSTRUCTING WEB APPLICATION FIREWALLS USING AU...
ANOTHER BRICK OFF THE WALL: DECONSTRUCTING WEB APPLICATION FIREWALLS USING AU...ANOTHER BRICK OFF THE WALL: DECONSTRUCTING WEB APPLICATION FIREWALLS USING AU...
ANOTHER BRICK OFF THE WALL: DECONSTRUCTING WEB APPLICATION FIREWALLS USING AU...
 
The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...
The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...
The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...
 
The Enterprise Architecture You Always Wanted
The Enterprise Architecture You Always WantedThe Enterprise Architecture You Always Wanted
The Enterprise Architecture You Always Wanted
 
Code Review: How and When
Code Review: How and WhenCode Review: How and When
Code Review: How and When
 
Another pair of eyes
Another pair of eyesAnother pair of eyes
Another pair of eyes
 
Agile experiments in Machine Learning with F#
Agile experiments in Machine Learning with F#Agile experiments in Machine Learning with F#
Agile experiments in Machine Learning with F#
 
We’re All UX: Designing a Whole Company Design Team - Giant Conf 2014
We’re All UX:  Designing a Whole Company Design Team - Giant Conf 2014We’re All UX:  Designing a Whole Company Design Team - Giant Conf 2014
We’re All UX: Designing a Whole Company Design Team - Giant Conf 2014
 
Code Review: How and When
Code Review: How and WhenCode Review: How and When
Code Review: How and When
 
Word embeddings as a service - PyData NYC 2015
Word embeddings as a service -  PyData NYC 2015Word embeddings as a service -  PyData NYC 2015
Word embeddings as a service - PyData NYC 2015
 
OpenML DALI
OpenML DALIOpenML DALI
OpenML DALI
 
Logical Abduction and an Application on Business Rules Management
Logical Abduction and an Application on Business Rules ManagementLogical Abduction and an Application on Business Rules Management
Logical Abduction and an Application on Business Rules Management
 

Mehr von Yunyao Li

Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
Yunyao Li
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
Yunyao Li
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
Yunyao Li
 

Mehr von Yunyao Li (20)

The Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language Models
 
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-LoopBuilding, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
 
Meaning Representations for Natural Languages: Design, Models and Applications
Meaning Representations for Natural Languages:  Design, Models and ApplicationsMeaning Representations for Natural Languages:  Design, Models and Applications
Meaning Representations for Natural Languages: Design, Models and Applications
 
Taming the Wild West of NLP
Taming the Wild West of NLPTaming the Wild West of NLP
Taming the Wild West of NLP
 
Towards Deep Table Understanding
Towards Deep Table UnderstandingTowards Deep Table Understanding
Towards Deep Table Understanding
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Human in the Loop AI for Building Knowledge Bases
Human in the Loop AI for Building Knowledge Bases Human in the Loop AI for Building Knowledge Bases
Human in the Loop AI for Building Knowledge Bases
 
Towards Universal Language Understanding
Towards Universal Language UnderstandingTowards Universal Language Understanding
Towards Universal Language Understanding
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)
 
Towards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural LanguagesTowards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural Languages
 
An In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social MediaAn In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social Media
 
Exploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active LearningExploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active Learning
 
K-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role LabelingK-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role Labeling
 
Coling poster
Coling posterColing poster
Coling poster
 
Coling demo
Coling demoColing demo
Coling demo
 
Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...
 
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsPolyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
 
Transparent Machine Learning for Information Extraction: State-of-the-Art and...
Transparent Machine Learning for Information Extraction: State-of-the-Art and...Transparent Machine Learning for Information Extraction: State-of-the-Art and...
Transparent Machine Learning for Information Extraction: State-of-the-Art and...
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Adaptive Parser-Centric Text Normalization

  • 1. 1 Adaptive Parser-Centric Text Normalization Congle Zhang* Tyler Baldwin** Howard Ho** Benny Kimelfeld** Yunyao Li** * University of Washington **IBM Research - Almaden
  • 2. Public Text Web Text Private Text Text Analytics Marketing Financial investment Drug discovery Law enforcement … Applications Social media News SEC Internal Data Subscription Data USPTO Text analytics is the key for discovering hidden value from text
  • 6. CAN YOU READ THIS IN FIRST ATTEMPT?
  • 7. ay woundent of see ’ em CAN YOU READ THIS IN FIRST ATEMPT? 00:0000:0100:02 I would not have seen them.
  • 8. When a machine reads it Results from Google translation Chinese 唉看见他们 woundent Spanish ay woundent de verlas Japanese ローマ法王進呈の AY woundent Portuguese ay woundent de vê-los German ay woundent de voir 'em
  • 9. Text Normalization • Informal writing  standard written form 9 I would not have seen them . normalize ay woundent of see ’ em
  • 10. Challenge: Grammar 10 text normalization would not of see them ay woundent of see ’ em I would not have seen them. Vs. mapping out-of- vocabulary non-standard tokens to their in- vocabulary standard form ≠
  • 11. Challenge: Domain Adaptation Tailor the same text normalization solution towards different writing style of different data sources 11
  • 12. Challenge: Evaluation • Previous: word error rate & BLEU score • However, – Words are not equally important – non-word information (punctuations, capitalization) can be important – Word reordering is important • How does the normalization actually impact the downstream applications? 12
  • 14. Outlines • Model • Inference • Learning • Instantiation • Evaluation • Conclusion 14
  • 15. Model: Replacement Generator 15 • Replacement <i,j,s>: replace tokens xi … xj-1 with s • Domain customization – Generic (cross-domain) replacements – Domain-specific replacements Ay1 woudent2 of3 see4 ‘em5 <2,3,”would not”> <1,2,”Ay”> <1,2,”I”> <1,2,ε> <6,6,”.”> … Edit Same Edit Delete Insert …
  • 16. Model: Boolean Variables • Associate a unique Boolean variable Xr with each replacement r – Xr=true: replacement r is used to produce the output sentence 16 <2,3,”would not”> = true … would not …
  • 17. Model: Normalization Graph 17 • A graphical model Ay woudent of see ‘em <4,6,”see him”> <1,2,”Ay”> <1,2,”I”> <2,4,”would not have”> <2,3,”would”> <4,5,”seen”> <5,6,”them”> *START* *END* <6,6,”.”> <3,4,”of”>
  • 18. Model: Legal Assignment • Sound – Any two true replacements do not overlap – <1,2,”Ay”> and <1,2,”I”> cannot be both true • Completeness – Every input token is captured by at least one true replacement 18
  • 19. Model: Legal = Path • A legal assignment: a path from start to end 19 <4,6,”see him”> <1,2,”Ay”> <1,2,”I”> <2,4,”would not have”> <2,3,”would”> <4,5,”seen”> <5,6,”them”> *START* *END* <6,6,”.”> <3,4,”of”> I would not have see him. Output
  • 20. Model: Assignment Probability 20 • Log-linear model; feature functions on edges 20 <4,6,”see him”> <1,2,”Ay”> <1,2,”I”> <2,4,”would not have”> <2,3,”would”> <4,5,”seen”> <5,6,”them”> *START* *END* <6,6,”.”> <3,4,”of”>
  • 21. Outlines • Model • Inference • Learning • Instantiation • Evaluation • Conclusion 21
  • 22. Inference • Select the assignment with the highest probability • Computationally hard on general graph models … • But, in our model it boils down to finding the longest path in a weighted and directed acyclic graph 22
  • 23. Inference 23 • weighted longest path <4,6,”see him”> <1,2,”Ay”> <1,2,”I”> <2,4,”would not have”> <2,3,”would”> <4,5,”seen”> <5,6,”them”> *START* *END* <6,6,”.”> <3,4,”of”> I would not have see him.
  • 24. Outlines • Model • Inference • Learning • Instantiation • Evaluation • Conclusion 24
  • 25. Learning • Perceptron-style algorithm – Update weights by – Comparing (1) most probable output with the current weights (2) gold sequence 25 (1) Informal: Ay woudent of see ‘em (2) Gold: I would not have seen them. (3) Graph Input Output (1) weights of features
  • 26. Learning: Gold vs. Inferred 26 <4,6,”see him”> <1,2,”Ay”> <1,2,”I”> <2,4,”would not have”> <2,3,”would”> <4,5,”seen”> <5,6,”them”> *START* *END* <6,6,”.”> <3,4,”of”> Gold sequence Most probable sequence with current θ
  • 27. Learning: Update Weights on the Differential Edges 27 <4,6,”see him”> <1,2,”Ay”> <1,2,”I”> <2,4,”would not have”> <2,3,”would”> <4,5,”seen”> <5,6,”them”> *START* *END* <6,6,”.”> <3,4,”of”> the gold sequence becomes “longer” Increase wi
  • 28. Outlines • Model • Inference • Learning • Instantiation • Evaluation • Conclusion 28
  • 29. Instantiation: Replacement Generators 29 Generator From To leave intact good good edit distance bac back lowercase NEED need capitalize it It Google spell dispaear disappear contraction wouldn’t would not slang language ima I am going to insert punctuation ε . duplicated punctuation !? ! delete filler lmao ε
  • 30. Instantiation: Features • N-gram – Frequency of the phrases induced by an edge • Part-of-speech – Encourage certain behavior, such as avoiding the deletion of noun phrases. • Positional – Capitalize words after stop punctuations • Lineage – Which generator spawned the replacement 30
  • 31. Outlines • Model • Inference • Learning • Instantiation • Evaluation • Conclusion 31
  • 32. Evaluation Metrics: Compare Parses Input sentence 32 Human Expert Gold sentence Normalized sentence Normalizer Parser Parser Compare Gold Parse Normalized Parse Focus on subjects, verbs, and objects (SVO)
  • 33. Evaluation Metrics: Example Test Gold SVO I kinda wanna get ipad NEW I kind of want to get a new iPad. verb(get) verb(want) verb(get) precisionv = 1/1 recallv = 1/2 subj(get,I) subj(get,wanna) obj(get,NEW) subj(want, I) subj(get,I) obj(get,iPad) precisionso = 1/3 recallso= 1/3 33
  • 34. Evaluation: Baselines • w/oN: without normalization • Google: Google spell checker • w2wN: word-to-word normalization [Han and Baldwin 2011] • Gw2wN: gold standard for word-to-word normalizations of previous work (whenever available). 34
  • 35. Evaluation: Domains • Twitter [Han and Baldwin 2011] – Gold: Grammatical sentences • SMS [Choudhury et al 2007] – Gold: Grammatical sentences • Call-Center Log: proprietary – Text-based responses about users’ experience with a call- center for a major company – Gold: Grammatical sentences 35
  • 36. Evaluation: Twitter 36 • Twitter-specific replacement generators – Hashtags (#), ats (@), and retweets (RT) – Generators that allowed for either the initial symbol or the entire token to be deleted
  • 37. Evaluation: Twitter System Verb Subject-Object Pre Rec F1 Pre Rec F1 w/oN 83.7 68.1 75.1 31.7 38.6 34.8 Google 88.9 78.8 83.5 36.1 46.3 40.6 w2wN 87.5 81.5 84.4 44.5 58.9 50.7 Gw2wN 89.8 83.8 86.7 46.9 61.0 53.0 generic 91.7 88.9 90.3 53.6 70.2 60.8 domain specific 95.3 88.7 91.9 72.5 76.3 74.4 37 Domain-specific generators yielded the best overall performance
  • 38. Evaluation: Twitter System Verb Subject-Object Pre Rec F1 Pre Rec F1 w/oN 83.7 68.1 75.1 31.7 38.6 34.8 Google 88.9 78.8 83.5 36.1 46.3 40.6 w2wN 87.5 81.5 84.4 44.5 58.9 50.7 Gw2wN 89.8 83.8 86.7 46.9 61.0 53.0 generic 91.7 88.9 90.3 53.6 70.2 60.8 domain specific 95.3 88.7 91.9 72.5 76.3 74.4 38 w/o domain-specific generators, our system outperformed the word-to-word normalization approaches
  • 39. Evaluation: Twitter System Verb Subject-Object Pre Rec F1 Pre Rec F1 w/oN 83.7 68.1 75.1 31.7 38.6 34.8 Google 88.9 78.8 83.5 36.1 46.3 40.6 w2wN 87.5 81.5 84.4 44.5 58.9 50.7 Gw2wN 89.8 83.8 86.7 46.9 61.0 53.0 generic 91.7 88.9 90.3 53.6 70.2 60.8 domain specific 95.3 88.7 91.9 72.5 76.3 74.4 39 Even perfect word-to-word normalization is not good enough!
  • 40. Evaluation: SMS 40 SMS-specific replacement generator: - Mapping dictionary of SMS abbreviations
  • 41. Evaluation: SMS 41 System Verb Subject-Object Pre Rec F1 Pre Rec F1 w/oN 76.4 48.1 59.0 19.5 21.5 20.4 Google 85.1 61.6 71.5 22.4 26.2 24.1 w2wN 78.5 61.5 68.9 29.9 36.0 32.6 Gw2wN 87.6 76.6 81.8 38.0 50.6 43.4 generic 86.5 77.4 81.7 35.5 47.7 40.7 domain specific 88.1 75.0 81.0 41.0 49.5 44.8
  • 42. Evaluation: Call-Center 42 Call Center-specific generator: - Mapping dictionary of call center abbreviations (e.g. “rep.”  “representative”)
  • 43. Evaluation: Call-Center 43 System Verb Subject-Object Pre Rec F1 Pre Rec F1 w/oN 98.5 97.1 97.8 69.2 66.1 67.6 Google 99.2 97.9 98.5 70.5 67.3 68.8 generic 98.9 97.4 98.1 71.3 67.9 69.6 domain specific 99.2 97.4 98.3 87.9 83.1 85.4
  • 44. Discussion • Domain transfer w/ small amount of effort is possible • Performing normalization is indeed beneficial to dependency parsing – Simple word-to-word normalization is not enough 44
  • 45. Conclusion • Normalization framework with an eye toward domain adaptation • Parser-centric view of normalization • Our system outperformed competitive baselines over three different domains • Dataset to spur future research – https://www.cs.washington.edu/node/9091/ 45

Hinweis der Redaktion

  1. much of the big data in the text form is bad data that are difficult to analyze, even for human being.
  2. The average reading speed for English is 250 words per minute. With this short sentence of only 5 tokens, one should only need no more than 2 seconds.
  3. None of the translation really makes much sense!
  4. While there are a number of previous work on text normalization, in this work, we seek to address several new challenges.
  5. Why fully grammatical? Most NLP algorithms are trained over News articles, such as WSJ and NYT.
  6. A replacement generator is a function that takes a sequence of token as input and generate one or more replacement. Each replacement is in the form of a triplet. Domain customization is done through a combination of generic replacements + domain-specific relacements.
  7. By connecting replacements with each other based on their token positions, we can construct a direct acyclic graph.
  8. The output of normalization can only be produced by a legal assignment, where a legal assignment must be both sound and complete
  9. Essentially, each legal assignment corresponds to a path from start to end.
  10. We appeal to the log-linear model formulation to define the probability of an assignment. The probability of an assignment depends on the input as well as the weight vector of the features.
  11. When performing inference, we wish to select the output sequence with the highest probability
  12. The goal of learning is to compute the weights of our features. We use a perceptron-style algorithm to do learning. The idea is to update the weights over iterations to minimize the difference between the truth path and the inference path.
  13. Here is a simple demo for one iteration of learning. From gold, I know the black path is the truth path. But w/ current weights, my inference tell me the blue path is the best path.
  14. My hope is that my inference path could move toward the truth path. So the nature thing to do is to decrease the weights in blue boxes, because they only appear in inference path, and increase the weights in the purple boxes, because they only appear in the truth path. Then the update makes truth path longer in the model, and will be picked by our algorithm.
  15. We use features of four major sources N-gram features indicate the frequency of the phrases induced by an edge. POS information can be used to produce features that encourage certain behavior, such as … Information from positions is used primarily to handle capitalization and punctuation insertion. Finally, we include binary features that indicate which generator spawned the replacement.
  16. We propose a new evaluation metric that directly equates normalization performance with the performance of a common downstream application --- dependency parsing.