SlideShare a Scribd company logo
1 of 24
Download to read offline
Duplicate Detection via
Topic Modeling
HomeAway Key Facts
● 1,300,000+ global vacation rental listings
● 200,000,000+ vacation days / year
● ~190 countries, 22 languages
● HQ in Austin, TX; part of Expedia, Inc
--> Capable competition and fraud vectors
Competitive Intelligence
Breckenridge Colorado
HomeAway in blue
Breckenridge, zoomed in
Same Property
The Property Descriptions
Why Property Descriptions?
● Almost identical text
● Similar descriptions
seemed probable
○ Consistent owner
branding, easy to
replicate
● Tech team wanted to use
natural language
processing techniques
● Didn’t know if this would
work when we began
The Other Guys
There are truly inspiring views at High Point Retreat and
plenty of places to sit and enjoy them. Take a load off in one
of the many rooms with views of the ski mountain and
remember how lucky you are to live like this. Cozy up with
family in the sunken living room and chat for hours on end.
Sit in a circle of tree stumps around the outdoor fire pit and
roast marshmallows. After all that sitting, youll be more than
happy to walk 250 yards to the free shuttle to get the blood
pumping again. Then, have a seat and enjoy your free ride.
Best. Vacation. Ever. Vacation homes allow families to
stay...together. At InvitedHome, we think that's pretty
important, so we do everything in our power to make your
vacation totally epic. Not only do we choose the best homes
in the best destinations, but we make the experience
effortless so you can really enjoy yourself. Our team will
stock your fridge, babysit the kids, cater your party, plan your
day trip, make reservations, and do whatever we can to
make sure you have the Best. Vacation. Ever.
HomeAway
There are truly inspiring views at High Point Retreat and
plenty of places to sit and enjoy them. Take a load off in one
of the many rooms with views of the ski mountain and
remember how lucky you are to live like this. Cozy up with
family in the sunken living room and chat for hours on end.
Sit in a circle of tree stumps around the outdoor fire pit and
roast marshmallows. After all that sitting, you’ll be more
than happy to walk 250 yards to the free shuttle to get the
blood pumping again. Then, have a seat and enjoy your free
ride.
Best.Vacation.Ever. Vacation homes allow families to stay...
together. At InvitedHome, we think that's pretty important,
so we do everything in our power to make your vacation
totally epic. Not only do we choose the best homes in the
best destinations, but we make the experience effortless so
you can really enjoy yourself. Let us connect you with the
best options in town for babysitting, equipment rental,
transportation, catering, day trips, shopping, dining, and
even stocking your fridge with groceries! We’ll do everything
in our power to make sure you have the Best. Vacation.
Ever.
Worked great, but...
“Large” Vocabulary size
~6300 Tokens -> 6300 Dimensions and
millions of sparse vectors
A little slow
(took a week to process the US)
Initial Approach: TF-IDF and Cosine Distance
Spark
Clusters?
Topic
Modeling?
Other Distance
Metrics?
Latent Dirichlet Allocation (Topic Modeling)
Communications of the ACM, Vol. 55 No. 4, Pages
77-84
10.1145/2133806.2133826
Topic Modeling Motivations
● Smaller dimensional space
● Faster processing times
● At the end, we’d have Topic Models
Must be useful for duplicate detection
We used Spark’s ML APIs for this:
val countLDA = new LDA()
.setK(numTopics)
.setMaxIter(params.maxIterations)
.setSeed(params.randomSeed)
.setFeaturesCol(featureCol)
.setTopicDistributionCol("topicDistribution")
Distances between Topic Distributions
Euclidean Manhattan Cosine
Distances between Topic Distributions
Euclidean Manhattan Cosine
Jensen-Shannon Hellinger
Distances between Topic Distributions
Euclidean Manhattan Cosine
Jensen-Shannon Hellinger
How to make something useful?
This is a machine learning effort
Interquartile Ranges are more resilient to outliers than
standard deviations
IQRs bring information about the entire set of possible
duplicates
Random Forest Model (R):
trainIdx <- createDataPartition(dupesFoundByTopic$match,
p=0.9, list=FALSE, times=1)
train <- dupesFoundByTopic[trainIdx,]
fit <- randomForest(as.factor(match) ~ distance + iqrs,
data=train)
Combining Distance and IQR
Feature Mean
Decrease
Gini
distance 498
IQR 57
Reference
Pred. FALSE TRUE
FALSE 204 2
TRUE 4 32
● Topic Models / Topic Distances seem useful
○ Esp. when part of a multi-signal model
(i.e. images)
● Hybrid Spark and R approach
○ Moving to 100% Spark in future for
speed
● Topic Models just sitting there, waiting for
exploitation
○ “Programmatic” Marketing Efforts, &c.
Current Status
Questions?
Brent Schneeman
Principal Data Scientist
HomeAway, Inc.
brent@homeaway.com
careers.homeaway.com
@schnee
← https://www.homeaway.com/vacation-rental/p3482065

More Related Content

Similar to Duplicate detection via topic modeling

Similar to Duplicate detection via topic modeling (8)

Lesson3 2 es0_19-20
Lesson3 2 es0_19-20 Lesson3 2 es0_19-20
Lesson3 2 es0_19-20
 
Dime-Novel Genre Classifier: A Prototype Text-Mining Application
Dime-Novel Genre Classifier:  A Prototype Text-Mining ApplicationDime-Novel Genre Classifier:  A Prototype Text-Mining Application
Dime-Novel Genre Classifier: A Prototype Text-Mining Application
 
May 2020 Newsletter
May 2020 NewsletterMay 2020 Newsletter
May 2020 Newsletter
 
My flight and hotel
My flight and hotelMy flight and hotel
My flight and hotel
 
The cruise weddings
The cruise weddingsThe cruise weddings
The cruise weddings
 
Event Marquees - Hosting an Event in Transition Seasons
Event Marquees - Hosting an Event in Transition Seasons Event Marquees - Hosting an Event in Transition Seasons
Event Marquees - Hosting an Event in Transition Seasons
 
Creating Presentations That Matter - A 1-day workshop (May 4th) at SVC
Creating Presentations That Matter - A 1-day workshop (May 4th) at SVCCreating Presentations That Matter - A 1-day workshop (May 4th) at SVC
Creating Presentations That Matter - A 1-day workshop (May 4th) at SVC
 
Brochure gls avenue51 sector 92 gurgaon +91 9717622228
Brochure gls avenue51 sector 92 gurgaon +91 9717622228Brochure gls avenue51 sector 92 gurgaon +91 9717622228
Brochure gls avenue51 sector 92 gurgaon +91 9717622228
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Duplicate detection via topic modeling

  • 2. HomeAway Key Facts ● 1,300,000+ global vacation rental listings ● 200,000,000+ vacation days / year ● ~190 countries, 22 languages ● HQ in Austin, TX; part of Expedia, Inc --> Capable competition and fraud vectors
  • 7. The Property Descriptions Why Property Descriptions? ● Almost identical text ● Similar descriptions seemed probable ○ Consistent owner branding, easy to replicate ● Tech team wanted to use natural language processing techniques ● Didn’t know if this would work when we began The Other Guys There are truly inspiring views at High Point Retreat and plenty of places to sit and enjoy them. Take a load off in one of the many rooms with views of the ski mountain and remember how lucky you are to live like this. Cozy up with family in the sunken living room and chat for hours on end. Sit in a circle of tree stumps around the outdoor fire pit and roast marshmallows. After all that sitting, youll be more than happy to walk 250 yards to the free shuttle to get the blood pumping again. Then, have a seat and enjoy your free ride. Best. Vacation. Ever. Vacation homes allow families to stay...together. At InvitedHome, we think that's pretty important, so we do everything in our power to make your vacation totally epic. Not only do we choose the best homes in the best destinations, but we make the experience effortless so you can really enjoy yourself. Our team will stock your fridge, babysit the kids, cater your party, plan your day trip, make reservations, and do whatever we can to make sure you have the Best. Vacation. Ever. HomeAway There are truly inspiring views at High Point Retreat and plenty of places to sit and enjoy them. Take a load off in one of the many rooms with views of the ski mountain and remember how lucky you are to live like this. Cozy up with family in the sunken living room and chat for hours on end. Sit in a circle of tree stumps around the outdoor fire pit and roast marshmallows. After all that sitting, you’ll be more than happy to walk 250 yards to the free shuttle to get the blood pumping again. Then, have a seat and enjoy your free ride. Best.Vacation.Ever. Vacation homes allow families to stay... together. At InvitedHome, we think that's pretty important, so we do everything in our power to make your vacation totally epic. Not only do we choose the best homes in the best destinations, but we make the experience effortless so you can really enjoy yourself. Let us connect you with the best options in town for babysitting, equipment rental, transportation, catering, day trips, shopping, dining, and even stocking your fridge with groceries! We’ll do everything in our power to make sure you have the Best. Vacation. Ever.
  • 8. Worked great, but... “Large” Vocabulary size ~6300 Tokens -> 6300 Dimensions and millions of sparse vectors A little slow (took a week to process the US) Initial Approach: TF-IDF and Cosine Distance
  • 10. Latent Dirichlet Allocation (Topic Modeling) Communications of the ACM, Vol. 55 No. 4, Pages 77-84 10.1145/2133806.2133826
  • 11. Topic Modeling Motivations ● Smaller dimensional space ● Faster processing times ● At the end, we’d have Topic Models Must be useful for duplicate detection We used Spark’s ML APIs for this: val countLDA = new LDA() .setK(numTopics) .setMaxIter(params.maxIterations) .setSeed(params.randomSeed) .setFeaturesCol(featureCol) .setTopicDistributionCol("topicDistribution")
  • 12.
  • 13. Distances between Topic Distributions Euclidean Manhattan Cosine
  • 14. Distances between Topic Distributions Euclidean Manhattan Cosine Jensen-Shannon Hellinger
  • 15. Distances between Topic Distributions Euclidean Manhattan Cosine Jensen-Shannon Hellinger
  • 16.
  • 17.
  • 18.
  • 19. How to make something useful? This is a machine learning effort
  • 20.
  • 21.
  • 22. Interquartile Ranges are more resilient to outliers than standard deviations IQRs bring information about the entire set of possible duplicates Random Forest Model (R): trainIdx <- createDataPartition(dupesFoundByTopic$match, p=0.9, list=FALSE, times=1) train <- dupesFoundByTopic[trainIdx,] fit <- randomForest(as.factor(match) ~ distance + iqrs, data=train) Combining Distance and IQR Feature Mean Decrease Gini distance 498 IQR 57 Reference Pred. FALSE TRUE FALSE 204 2 TRUE 4 32
  • 23. ● Topic Models / Topic Distances seem useful ○ Esp. when part of a multi-signal model (i.e. images) ● Hybrid Spark and R approach ○ Moving to 100% Spark in future for speed ● Topic Models just sitting there, waiting for exploitation ○ “Programmatic” Marketing Efforts, &c. Current Status
  • 24. Questions? Brent Schneeman Principal Data Scientist HomeAway, Inc. brent@homeaway.com careers.homeaway.com @schnee ← https://www.homeaway.com/vacation-rental/p3482065