SlideShare ist ein Scribd-Unternehmen logo
1 von 73
Downloaden Sie, um offline zu lesen
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Switchabalizer
Our journey from spell checker to homophone correcter
Oskar Singer
July 23, 2014
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
How I got here
I am a rising senior in the UMass Amherst CS program specializing
in machine learning and natural language processing.
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
How I got here
I am a rising senior in the UMass Amherst CS program specializing
in machine learning and natural language processing.
Last summer, I interned at an Amherst/Boston-based text
analytics company called Lexalytics
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
How I got here
I am a rising senior in the UMass Amherst CS program specializing
in machine learning and natural language processing.
Last summer, I interned at an Amherst/Boston-based text
analytics company called Lexalytics
I worked with Lexalytics’ head of software engineering on this
project
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
How I got here
I am a rising senior in the UMass Amherst CS program specializing
in machine learning and natural language processing.
Last summer, I interned at an Amherst/Boston-based text
analytics company called Lexalytics
I worked with Lexalytics’ head of software engineering on this
project
Lexalytics often uses CommonCrawl, and it was a great option for
a training data set
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Motivation
Lexalytics provides sentiment analysis software
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Motivation
Lexalytics provides sentiment analysis software
Sentiment analysis relies heavily in sentence parsing and
part-of-speech tagging
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Motivation
Lexalytics provides sentiment analysis software
Sentiment analysis relies heavily in sentence parsing and
part-of-speech tagging
Misspellings and misusage can do serious damage to accuracy for
those two tasks
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Approach
We employed an open-source spell-checker called Hunspell
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Approach
We employed an open-source spell-checker called Hunspell
Hunspell gives an unranked list of correction suggestions
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Approach
We employed an open-source spell-checker called Hunspell
Hunspell gives an unranked list of correction suggestions
So we took the argmax of a home-baked scoring function that:
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Approach
We employed an open-source spell-checker called Hunspell
Hunspell gives an unranked list of correction suggestions
So we took the argmax of a home-baked scoring function that:
penalized string edit distance
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Approach
We employed an open-source spell-checker called Hunspell
Hunspell gives an unranked list of correction suggestions
So we took the argmax of a home-baked scoring function that:
penalized string edit distance
penalized keyboard distance
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Approach
We employed an open-source spell-checker called Hunspell
Hunspell gives an unranked list of correction suggestions
So we took the argmax of a home-baked scoring function that:
penalized string edit distance
penalized keyboard distance
rewarded high word frequencies, which were harvested from
CommonCrawl data
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Failure
Hunspell had an error rate of
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Failure
Hunspell had an error rate of
216%
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
How is this possible? Two reasons:
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
How is this possible? Two reasons:
Hunspell missed all the mistakes
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
How is this possible? Two reasons:
Hunspell missed all the mistakes
Hunspell made false corrections
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Hunspell was a poor choice for a couple reasons:
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Hunspell was a poor choice for a couple reasons:
Hunspell’s vocabulary is not appropriate or flexible enough for
Twitter domain
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Hunspell was a poor choice for a couple reasons:
Hunspell’s vocabulary is not appropriate or flexible enough for
Twitter domain
Hunspell can’t detect correctly spelled words that are out of
context
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Twitter’s vocabulary of abbreviations and acronyms is constantly
growing
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Twitter’s vocabulary of abbreviations and acronyms is constantly
growing
Hunspell’s internal dictionary is not prepared for this
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Example: ur
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Example: ur
What was Hunspell’s correction?
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Example: ur
What was Hunspell’s correction?
Ur (the ancient Sumerian city-state)
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
When the issue is misuse rather than misspelling, Hunspell
completely ignores the problem
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
When the issue is misuse rather than misspelling, Hunspell
completely ignores the problem
Specifically, commonly misused homophones were a huge problem
in our data
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
When the issue is misuse rather than misspelling, Hunspell
completely ignores the problem
Specifically, commonly misused homophones were a huge problem
in our data
Examples: two/too/2/to; their/there/they’re; your/you’re
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Addressing Misusage
How do we capture the idea of misuse?
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Addressing Misusage
How do we capture the idea of misuse?
Context
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Addressing Misusage
How do we capture the idea of misuse?
Context
How can we capture context?
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Addressing Misusage
How do we capture the idea of misuse?
Context
How can we capture context?
Rule set?
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Addressing Misusage
How do we capture the idea of misuse?
Context
How can we capture context?
Rule set?
Probabilistic approach!
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Probability Model
Bayes network
Conditioned on the preceding and succeeding words
Assumes these two words are independent
Does not use bag-of-words approach (considers position)
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Probability Model
Conditional Probability of Preceding or Succeeding Words
P(pre(wi )|wj ) =
#(wi wj )
#(wj )
,
where pre(w) is the event that w is the preceding word and #(∗)
is the number of occurences of a sequence of words
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Probability Model
Conditional Probability of Preceding or Succeeding Words
Conditional Probability of Preceding or Succeeding Words
P(pre(wi )|wj ) =
#(wi wj )
#(wj )
,
where pre(w) is the event that w is the preceding word and #(∗)
is the number of occurences of a sequence of words
P(suc(wi )|wj ) =
#(wj wi )
#(wj )
,
where suc(w) is the event that w is the succeeding word
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Probability Model
Conditional Probability of Both Words
P(pre(wi ), suc(wk)|wj ) = P(pre(wi )|wj ) × P(suc(wk)|wj )
log(P(pre(wi ), suc(wk)|wj )) = log(P(pre(wi )|wj ))
+ log(P(suc(wk)|wj ))
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Probability Model
Conditional Probability of Both Words
Conditional Probability of Both Words
P(pre(wi ), suc(wk)|wj ) = P(pre(wi )|wj ) × P(suc(wk)|wj )
log(P(pre(wi ), suc(wk)|wj )) = log(P(pre(wi )|wj ))
+ log(P(suc(wk)|wj ))
The first equation holds because of our assumption of
independence between the preceding and succeeding words
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Probability Model
Conditional Probability of Both Words
Conditional Probability of Both Words
P(pre(wi ), suc(wk)|wj ) = P(pre(wi )|wj ) × P(suc(wk)|wj )
log(P(pre(wi ), suc(wk)|wj )) = log(P(pre(wi )|wj ))
+ log(P(suc(wk)|wj ))
The first equation holds because of our assumption of
independence between the preceding and succeeding words
There is a missing term in the scoring function that I will address
in the Future Work section
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Switchable Sets
Only certain groups should be compared, e.g. ”too” should not be
scored against ”their”
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Switchable Sets
Only certain groups should be compared, e.g. ”too” should not be
scored against ”their”
Comparable switchables are groups in switchable sets
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Switchable Sets
Only certain groups should be compared, e.g. ”too” should not be
scored against ”their”
Comparable switchables are groups in switchable sets
Each switchable is mapped to its switchable set
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Picking the Word
The Final Equation
S(wi , wj , wk) = log(P(pre(wi ), suc(wk)|wj ))
v∗
= argmaxv∈Vwj
S(wi , v, wk)
where S(wi , wj , wk) is the score for the sequence of words wi wj wk
and Vwj is the switchable set corresponding to wj and v∗ is the
ideal switchable
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
First Pass
What about common misspellings that intersect with switchables?
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
First Pass
What about common misspellings that intersect with switchables?
Example: ”ur”
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
First Pass
What about common misspellings that intersect with switchables?
Example: ”ur”
Should we put them in the switchable sets?
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
First Pass
My opinion: no!
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
First Pass
My opinion: no!
Realistically, its probably okay. I opted for a more elegant solution
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
First Pass
My opinion: no!
Realistically, its probably okay. I opted for a more elegant solution
Replace all common mispellings with something from the
appropriate switchable set
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
First Pass
My opinion: no!
Realistically, its probably okay. I opted for a more elegant solution
Replace all common mispellings with something from the
appropriate switchable set
The model’s results are agnositc to the switchable that activates it
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Testing
Assume Wikipedia has correct usage of all switchables
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Testing
Assume Wikipedia has correct usage of all switchables
Replace target words in Wikipedia articles with words from their
switchable set
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Testing
Assume Wikipedia has correct usage of all switchables
Replace target words in Wikipedia articles with words from their
switchable set
Run the Switchabilizer on corrupted articles
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Results
How did we do?
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Results
How did we do?
20% error
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Ideal Scoring Function
S(wi wj wk) = log(P(wj , pre(wi ), suc(wk))
= log(P(wj )P(wi |wj )P(wk|wj ))
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Ideal Scoring Function
Ideal Scoring Function
S(wi wj wk) = log(P(wj , pre(wi ), suc(wk))
= log(P(wj )P(wi |wj )P(wk|wj ))
Forgot the P(wj ) term in the factorization of the joint distribution,
which resulted in a slightly unfitting conditional distribution.
Remember this for reimplementation!
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Testing conditions were not ideal because:
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Testing conditions were not ideal because:
Test data is not target data
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Testing conditions were not ideal because:
Test data is not target data
Mistakes are contrived
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Testing conditions were not ideal because:
Test data is not target data
Mistakes are contrived
Somebody make a labeled test set, then tune the algorithm to it!
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Here are some ideas I had for future experiments:
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Here are some ideas I had for future experiments:
Use a discriminative model like maximum entropy
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Here are some ideas I had for future experiments:
Use a discriminative model like maximum entropy
Consider higher order neighbor words
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Here are some ideas I had for future experiments:
Use a discriminative model like maximum entropy
Consider higher order neighbor words
Implement for other languages
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Start Coding!
Anyone else can do this too!
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Start Coding!
Anyone else can do this too!
Straight-forward probability model
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Start Coding!
Anyone else can do this too!
Straight-forward probability model
25-50 lines of Python
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Start Coding!
Anyone else can do this too!
Straight-forward probability model
25-50 lines of Python
Freely accessible data from CommonCrawl!
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Start Coding!
Anyone else can do this too!
Straight-forward probability model
25-50 lines of Python
Freely accessible data from CommonCrawl!
Go learn about ML and NLP! Get your hands dirty and add your
own mods! Find new problems and try new solutions!
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Thank You, CommonCrawl!
Thanks so much to Lisa, Stephen, Grace and the rest of the team
for providing such a fantastic resource and bringing me down to
San Francisco to present!
Oskar Singer The Switchabalizer

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Servicesexy call girls service in goa
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.soniya singh
 
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)Delhi Call girls
 
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableSeo
 
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...APNIC
 
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445ruhi
 
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersMoving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersDamian Radcliffe
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxellan12
 
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...SUHANI PANDEY
 
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Delhi Call girls
 
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.soniya singh
 
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...Escorts Call Girls
 
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...SUHANI PANDEY
 
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...SUHANI PANDEY
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...Neha Pandey
 
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort ServiceBusty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort ServiceDelhi Call girls
 
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.soniya singh
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝soniya singh
 

Kürzlich hochgeladen (20)

Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
 
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
 
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
 
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
 
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
 
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
 
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
 
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersMoving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
 
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
 
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
 
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
 
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
 
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
 
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
 
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort ServiceBusty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
Busty Desi⚡Call Girls in Vasundhara Ghaziabad >༒8448380779 Escort Service
 
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
 

Empfohlen

Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 

Empfohlen (20)

Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 

The Switchabalizer - our journey from spell checker to homophone corrrecter

  • 1. Introduction The Problem First Attempt Second Attempt Conclusion The Switchabalizer Our journey from spell checker to homophone correcter Oskar Singer July 23, 2014 Oskar Singer The Switchabalizer
  • 2. Introduction The Problem First Attempt Second Attempt Conclusion How I got here I am a rising senior in the UMass Amherst CS program specializing in machine learning and natural language processing. Oskar Singer The Switchabalizer
  • 3. Introduction The Problem First Attempt Second Attempt Conclusion How I got here I am a rising senior in the UMass Amherst CS program specializing in machine learning and natural language processing. Last summer, I interned at an Amherst/Boston-based text analytics company called Lexalytics Oskar Singer The Switchabalizer
  • 4. Introduction The Problem First Attempt Second Attempt Conclusion How I got here I am a rising senior in the UMass Amherst CS program specializing in machine learning and natural language processing. Last summer, I interned at an Amherst/Boston-based text analytics company called Lexalytics I worked with Lexalytics’ head of software engineering on this project Oskar Singer The Switchabalizer
  • 5. Introduction The Problem First Attempt Second Attempt Conclusion How I got here I am a rising senior in the UMass Amherst CS program specializing in machine learning and natural language processing. Last summer, I interned at an Amherst/Boston-based text analytics company called Lexalytics I worked with Lexalytics’ head of software engineering on this project Lexalytics often uses CommonCrawl, and it was a great option for a training data set Oskar Singer The Switchabalizer
  • 6. Introduction The Problem First Attempt Second Attempt Conclusion Motivation Lexalytics provides sentiment analysis software Oskar Singer The Switchabalizer
  • 7. Introduction The Problem First Attempt Second Attempt Conclusion Motivation Lexalytics provides sentiment analysis software Sentiment analysis relies heavily in sentence parsing and part-of-speech tagging Oskar Singer The Switchabalizer
  • 8. Introduction The Problem First Attempt Second Attempt Conclusion Motivation Lexalytics provides sentiment analysis software Sentiment analysis relies heavily in sentence parsing and part-of-speech tagging Misspellings and misusage can do serious damage to accuracy for those two tasks Oskar Singer The Switchabalizer
  • 9. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Approach We employed an open-source spell-checker called Hunspell Oskar Singer The Switchabalizer
  • 10. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Approach We employed an open-source spell-checker called Hunspell Hunspell gives an unranked list of correction suggestions Oskar Singer The Switchabalizer
  • 11. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Approach We employed an open-source spell-checker called Hunspell Hunspell gives an unranked list of correction suggestions So we took the argmax of a home-baked scoring function that: Oskar Singer The Switchabalizer
  • 12. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Approach We employed an open-source spell-checker called Hunspell Hunspell gives an unranked list of correction suggestions So we took the argmax of a home-baked scoring function that: penalized string edit distance Oskar Singer The Switchabalizer
  • 13. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Approach We employed an open-source spell-checker called Hunspell Hunspell gives an unranked list of correction suggestions So we took the argmax of a home-baked scoring function that: penalized string edit distance penalized keyboard distance Oskar Singer The Switchabalizer
  • 14. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Approach We employed an open-source spell-checker called Hunspell Hunspell gives an unranked list of correction suggestions So we took the argmax of a home-baked scoring function that: penalized string edit distance penalized keyboard distance rewarded high word frequencies, which were harvested from CommonCrawl data Oskar Singer The Switchabalizer
  • 15. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Failure Hunspell had an error rate of Oskar Singer The Switchabalizer
  • 16. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Failure Hunspell had an error rate of 216% Oskar Singer The Switchabalizer
  • 17. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? How is this possible? Two reasons: Oskar Singer The Switchabalizer
  • 18. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? How is this possible? Two reasons: Hunspell missed all the mistakes Oskar Singer The Switchabalizer
  • 19. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? How is this possible? Two reasons: Hunspell missed all the mistakes Hunspell made false corrections Oskar Singer The Switchabalizer
  • 20. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Hunspell was a poor choice for a couple reasons: Oskar Singer The Switchabalizer
  • 21. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Hunspell was a poor choice for a couple reasons: Hunspell’s vocabulary is not appropriate or flexible enough for Twitter domain Oskar Singer The Switchabalizer
  • 22. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Hunspell was a poor choice for a couple reasons: Hunspell’s vocabulary is not appropriate or flexible enough for Twitter domain Hunspell can’t detect correctly spelled words that are out of context Oskar Singer The Switchabalizer
  • 23. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Twitter’s vocabulary of abbreviations and acronyms is constantly growing Oskar Singer The Switchabalizer
  • 24. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Twitter’s vocabulary of abbreviations and acronyms is constantly growing Hunspell’s internal dictionary is not prepared for this Oskar Singer The Switchabalizer
  • 25. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Example: ur Oskar Singer The Switchabalizer
  • 26. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Example: ur What was Hunspell’s correction? Oskar Singer The Switchabalizer
  • 27. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Example: ur What was Hunspell’s correction? Ur (the ancient Sumerian city-state) Oskar Singer The Switchabalizer
  • 28. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? When the issue is misuse rather than misspelling, Hunspell completely ignores the problem Oskar Singer The Switchabalizer
  • 29. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? When the issue is misuse rather than misspelling, Hunspell completely ignores the problem Specifically, commonly misused homophones were a huge problem in our data Oskar Singer The Switchabalizer
  • 30. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? When the issue is misuse rather than misspelling, Hunspell completely ignores the problem Specifically, commonly misused homophones were a huge problem in our data Examples: two/too/2/to; their/there/they’re; your/you’re Oskar Singer The Switchabalizer
  • 31. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Addressing Misusage How do we capture the idea of misuse? Oskar Singer The Switchabalizer
  • 32. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Addressing Misusage How do we capture the idea of misuse? Context Oskar Singer The Switchabalizer
  • 33. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Addressing Misusage How do we capture the idea of misuse? Context How can we capture context? Oskar Singer The Switchabalizer
  • 34. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Addressing Misusage How do we capture the idea of misuse? Context How can we capture context? Rule set? Oskar Singer The Switchabalizer
  • 35. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Addressing Misusage How do we capture the idea of misuse? Context How can we capture context? Rule set? Probabilistic approach! Oskar Singer The Switchabalizer
  • 36. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Probability Model Bayes network Conditioned on the preceding and succeeding words Assumes these two words are independent Does not use bag-of-words approach (considers position) Oskar Singer The Switchabalizer
  • 37. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Probability Model Conditional Probability of Preceding or Succeeding Words P(pre(wi )|wj ) = #(wi wj ) #(wj ) , where pre(w) is the event that w is the preceding word and #(∗) is the number of occurences of a sequence of words Oskar Singer The Switchabalizer
  • 38. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Probability Model Conditional Probability of Preceding or Succeeding Words Conditional Probability of Preceding or Succeeding Words P(pre(wi )|wj ) = #(wi wj ) #(wj ) , where pre(w) is the event that w is the preceding word and #(∗) is the number of occurences of a sequence of words P(suc(wi )|wj ) = #(wj wi ) #(wj ) , where suc(w) is the event that w is the succeeding word Oskar Singer The Switchabalizer
  • 39. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Probability Model Conditional Probability of Both Words P(pre(wi ), suc(wk)|wj ) = P(pre(wi )|wj ) × P(suc(wk)|wj ) log(P(pre(wi ), suc(wk)|wj )) = log(P(pre(wi )|wj )) + log(P(suc(wk)|wj )) Oskar Singer The Switchabalizer
  • 40. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Probability Model Conditional Probability of Both Words Conditional Probability of Both Words P(pre(wi ), suc(wk)|wj ) = P(pre(wi )|wj ) × P(suc(wk)|wj ) log(P(pre(wi ), suc(wk)|wj )) = log(P(pre(wi )|wj )) + log(P(suc(wk)|wj )) The first equation holds because of our assumption of independence between the preceding and succeeding words Oskar Singer The Switchabalizer
  • 41. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Probability Model Conditional Probability of Both Words Conditional Probability of Both Words P(pre(wi ), suc(wk)|wj ) = P(pre(wi )|wj ) × P(suc(wk)|wj ) log(P(pre(wi ), suc(wk)|wj )) = log(P(pre(wi )|wj )) + log(P(suc(wk)|wj )) The first equation holds because of our assumption of independence between the preceding and succeeding words There is a missing term in the scoring function that I will address in the Future Work section Oskar Singer The Switchabalizer
  • 42. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Switchable Sets Only certain groups should be compared, e.g. ”too” should not be scored against ”their” Oskar Singer The Switchabalizer
  • 43. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Switchable Sets Only certain groups should be compared, e.g. ”too” should not be scored against ”their” Comparable switchables are groups in switchable sets Oskar Singer The Switchabalizer
  • 44. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Switchable Sets Only certain groups should be compared, e.g. ”too” should not be scored against ”their” Comparable switchables are groups in switchable sets Each switchable is mapped to its switchable set Oskar Singer The Switchabalizer
  • 45. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Picking the Word The Final Equation S(wi , wj , wk) = log(P(pre(wi ), suc(wk)|wj )) v∗ = argmaxv∈Vwj S(wi , v, wk) where S(wi , wj , wk) is the score for the sequence of words wi wj wk and Vwj is the switchable set corresponding to wj and v∗ is the ideal switchable Oskar Singer The Switchabalizer
  • 46. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results First Pass What about common misspellings that intersect with switchables? Oskar Singer The Switchabalizer
  • 47. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results First Pass What about common misspellings that intersect with switchables? Example: ”ur” Oskar Singer The Switchabalizer
  • 48. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results First Pass What about common misspellings that intersect with switchables? Example: ”ur” Should we put them in the switchable sets? Oskar Singer The Switchabalizer
  • 49. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results First Pass My opinion: no! Oskar Singer The Switchabalizer
  • 50. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results First Pass My opinion: no! Realistically, its probably okay. I opted for a more elegant solution Oskar Singer The Switchabalizer
  • 51. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results First Pass My opinion: no! Realistically, its probably okay. I opted for a more elegant solution Replace all common mispellings with something from the appropriate switchable set Oskar Singer The Switchabalizer
  • 52. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results First Pass My opinion: no! Realistically, its probably okay. I opted for a more elegant solution Replace all common mispellings with something from the appropriate switchable set The model’s results are agnositc to the switchable that activates it Oskar Singer The Switchabalizer
  • 53. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Testing Assume Wikipedia has correct usage of all switchables Oskar Singer The Switchabalizer
  • 54. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Testing Assume Wikipedia has correct usage of all switchables Replace target words in Wikipedia articles with words from their switchable set Oskar Singer The Switchabalizer
  • 55. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Testing Assume Wikipedia has correct usage of all switchables Replace target words in Wikipedia articles with words from their switchable set Run the Switchabilizer on corrupted articles Oskar Singer The Switchabalizer
  • 56. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Results How did we do? Oskar Singer The Switchabalizer
  • 57. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Results How did we do? 20% error Oskar Singer The Switchabalizer
  • 58. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Ideal Scoring Function S(wi wj wk) = log(P(wj , pre(wi ), suc(wk)) = log(P(wj )P(wi |wj )P(wk|wj )) Oskar Singer The Switchabalizer
  • 59. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Ideal Scoring Function Ideal Scoring Function S(wi wj wk) = log(P(wj , pre(wi ), suc(wk)) = log(P(wj )P(wi |wj )P(wk|wj )) Forgot the P(wj ) term in the factorization of the joint distribution, which resulted in a slightly unfitting conditional distribution. Remember this for reimplementation! Oskar Singer The Switchabalizer
  • 60. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Testing conditions were not ideal because: Oskar Singer The Switchabalizer
  • 61. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Testing conditions were not ideal because: Test data is not target data Oskar Singer The Switchabalizer
  • 62. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Testing conditions were not ideal because: Test data is not target data Mistakes are contrived Oskar Singer The Switchabalizer
  • 63. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Testing conditions were not ideal because: Test data is not target data Mistakes are contrived Somebody make a labeled test set, then tune the algorithm to it! Oskar Singer The Switchabalizer
  • 64. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Here are some ideas I had for future experiments: Oskar Singer The Switchabalizer
  • 65. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Here are some ideas I had for future experiments: Use a discriminative model like maximum entropy Oskar Singer The Switchabalizer
  • 66. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Here are some ideas I had for future experiments: Use a discriminative model like maximum entropy Consider higher order neighbor words Oskar Singer The Switchabalizer
  • 67. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Here are some ideas I had for future experiments: Use a discriminative model like maximum entropy Consider higher order neighbor words Implement for other languages Oskar Singer The Switchabalizer
  • 68. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Start Coding! Anyone else can do this too! Oskar Singer The Switchabalizer
  • 69. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Start Coding! Anyone else can do this too! Straight-forward probability model Oskar Singer The Switchabalizer
  • 70. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Start Coding! Anyone else can do this too! Straight-forward probability model 25-50 lines of Python Oskar Singer The Switchabalizer
  • 71. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Start Coding! Anyone else can do this too! Straight-forward probability model 25-50 lines of Python Freely accessible data from CommonCrawl! Oskar Singer The Switchabalizer
  • 72. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Start Coding! Anyone else can do this too! Straight-forward probability model 25-50 lines of Python Freely accessible data from CommonCrawl! Go learn about ML and NLP! Get your hands dirty and add your own mods! Find new problems and try new solutions! Oskar Singer The Switchabalizer
  • 73. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Thank You, CommonCrawl! Thanks so much to Lisa, Stephen, Grace and the rest of the team for providing such a fantastic resource and bringing me down to San Francisco to present! Oskar Singer The Switchabalizer