SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Downloaden Sie, um offline zu lesen
HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching  Arnab Nandi  Phil BernsteinUniv of Michigan     Microsoft Research
Scenario Arnab Nandi & Phil Bernstein 2
Scenario Arnab Nandi & Phil Bernstein 3 Search over structured data Commerce entertainment Data onboarding– merge an XML data feed from a 3rd partyto Microsoft data warehouse.
Scenario	 Arnab Nandi & Phil Bernstein 4 “Amazon.com” 3rd Party Feed 3rd Party Feed 3rd Party Feed 3rd Party Feed query Users Search engine + data warehouse results ,[object Object]
(Irrespective ofRecall)
Minimal Human Involvement,[object Object]
Schema Matching 3rd Party Movie Site (Foreign) Warehouse: Movies (Host) -<Movie>  <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title>  <Release Key="Yes">2008</Release>  <Description>Ever…</Description>  <RunTime>127</RunTime> <Categories>  <Category>Action</Category>  <Category>Comedy</Category>  </Categories>  <MPAA>PG-13</MPAA>  <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl> -<Persons>  <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person> -</Persons>  </Movie> <MOVIE>  <MOVIE_ID>57590</MOVIE_ID>  <MOVIE_NAME>Indiana Jones and the      Kingdom of 	the Crystal Skull</MOVIE_NAME>  <RUNTIME>02:00</RUNTIME>  <GENRE1>Action/Adventure</GENRE1>  <GENRE2/>  <RATING>NR</RATING>  <ADVISORY/>  <URL>http://www.indianajones.com/</URL>  <ACTOR1>Harrison Ford</ACTOR1>  <ACTOR2>Karen Allen</ACTOR2> </MOVIE> 6 Arnab Nandi & Phil Bernstein
Taxonomy Matching 3rd Party Movie Site (Foreign) Warehouse: Movies (Host) -<Movie>  <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title>  <Release Key="Yes">2008</Release>  <Description>Ever…</Description>  <RunTime>127</RunTime> <Categories>  <Category>Action</Category>  <Category>Comedy</Category>  </Categories>  <MPAA>PG-13</MPAA>  <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl> -<Persons>  <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person> -</Persons>  </Movie> <MOVIE>  <MOVIE_ID>57590</MOVIE_ID>  <MOVIE_NAME>Indiana Jones and the      Kingdom of 	the Crystal Skull</MOVIE_NAME>  <RUNTIME>02:00</RUNTIME>  <GENRE1>Action/Adventure</GENRE1>  <GENRE2/>  <RATING>NR</RATING>  <ADVISORY/>  <URL>http://www.indianajones.com/</URL>  <ACTOR1>Harrison Ford</ACTOR1>  <ACTOR2>Karen Allen</ACTOR2> </MOVIE> 7 Arnab Nandi & Phil Bernstein
Various Problems 8 Badly normalized…. Unit conversion… In-band signaling… Arbitrary labels Zero documentation Not enough instances Formatting choices… Non standard vocabulary / language Arnab Nandi & Phil Bernstein
Unlike conventional matching… Arnab Nandi & Phil Bernstein 9 3rd Party Feed query Users Search engine + data warehouse We have web search click data For both Warehouse & 3rd party website The databases we are integrating (usually) have a presence on the web Why not use click data as a feature for schema & taxonomy matching? results
Outline 10 Scenario Using Clicklogs Core idea Using Query Distributions Example System Architecture Results  Arnab Nandi & Phil Bernstein
Core idea 11 “If two (sets of) products are searched for by similar queries, then they are similar” Web Search Small laptop Arnab Nandi & Phil Bernstein
Core idea 12 Warehouse Asus.com Clicklog hardware Small Laptops Pro. Laptops eee X Y eee ::: small laptops Small laptop Small laptop Small laptop Z Arnab Nandi & Phil Bernstein
Query Distributions click count Arnab Nandi & Phil Bernstein 13
Mapping to Taxonomy 14 Map URL to product, which belongs to taxonomy http://www.amazon.com/dp/B001JTA59C Shopping | Electronics |Netbooks 3rd party DB (provided to us) Arnab Nandi & Phil Bernstein
Aggregating Query Distributions 15 Warehouse Asus.com hardware Small Laptops Pro. Laptops eee eee ::: small laptops Arnab Nandi & Phil Bernstein
Aggregate URLs to categories 16 Aggregate queries for each URL to schema element / taxonomy term Electronics|ElectronicsFeatures|Brands|Asus EEE “netbook”, “laptop”, “cheap laptop” Office Products|OfficeMachines|Netbooks “netbook” Arnab Nandi & Phil Bernstein
Generating Correspondences Goal: To match two schema elements (or categories), determine if they have the same distribution of queries searching for them. Process For each page (URL) Identify query distribution Identify category / schema element of that page For each category / schema element C Aggregate over pages in C to get query distribution For each foreign category / schema element  Find host category / schema element with most similar query distribution 17 Arnab Nandi & Phil Bernstein
Outline 18 Scenario Using Clicklogs Core idea Using Query Distributions Example System Architecture Results  Arnab Nandi & Phil Bernstein
Example: Taxonomy Matching Arnab Nandi & Phil Bernstein 19 Warehouse: Professional Laptops Warehouse: Small Laptops eee
Example: Taxonomy Matching Arnab Nandi & Phil Bernstein 20 “laptop” : 70 / 75“netbook” : 5/75 Warehouse: Professional Laptops “laptop”: 25/45“netbook”: 20/45 Warehouse: Small Laptops “laptop”: 5/25“netbook”: 15/25“cheap laptop”: 5/25 eee
Distribution Similarity Metric Arnab Nandi & Phil Bernstein 21 Jaccard(qhost, qforeign) ✕MinFreq(qhost, qforeign) Σ (all qhost, qforeign combinations)
“small laptops” vs “eee”laptop vs laptop     netbookvsnetbooklaptop vs cheap laptop 1 x (25/45)  +        1 x (20/45)+             0.5 x (5/25) = 0.74 Example: Taxonomy Matching Arnab Nandi & Phil Bernstein 22 Warehouse: Professional Laptops “laptop” : 70 / 75“netbook” : 5/75         0.31  Warehouse: Small Laptops “laptop”: 25/45“netbook”: 20/45      0.74 eee “laptop”: 5/25“netbook”: 15/25“cheap laptop”: 5/25
Advantages of Clicklogs Arnab Nandi & Phil Bernstein 23 Resilient to language Resilient to new domains, data, and features As long as people query & click, we have data to learn from Generates mappings previous methods can’t Electronics ▷ Electronics Features ▷ Brands ▷ Texas Instruments≈ Office Products ▷ Office Machines ▷ Calculators Software ▷ Categories ▷ Programming ▷ Programming Languages ▷Visual Basic  ≈ Software ▷ Developer Tools
System Design 24 Arnab Nandi & Phil Bernstein
Outline 25 Scenario Using Clicklogs Core idea Using Query Distributions Example System Architecture Results  Arnab Nandi & Phil Bernstein
Experimenting with Click Logs Arnab Nandi & Phil Bernstein 26 Commercial warehouse mapping, 258 products from a 70,000 term Amazon.com taxonomy (613 in gold) to a 6,000 term warehouse taxonomy (40 in gold) Live.com (now Bing.com) search querylog Amazon to warehouse mapping task, consecutively halving the clicklog size used 1.8 million clicks to Amazon.com product pages Typically each product had a query distribution averaging 13 unique (i.e., different) search queries (min 1, max 181, stdev 22).
Summary of Results Arnab Nandi & Phil Bernstein 27 90% precision / recall possible Query distribution is a good similarity metric Bigger clicklogs imply better recall Technique isn't very sensitive to similarity metric
Precision / Recall Arnab Nandi & Phil Bernstein 28 Commercial warehouse mapping, 258 products from a 70K term Amazon.com taxonomy to a 6,000 term warehouse taxonomy (613 categories used)
Summary of Results Arnab Nandi & Phil Bernstein 29  90% precision / recall possible Query distribution is a good similarity metric Bigger clicklogs imply better recall Technique isn't very sensitive to similarity metric
Match Quality Arnab Nandi & Phil Bernstein 30 ,[object Object]
QDs of similar aggregates are similarQDs are unique to entities  QDs are unique to aggregate classes 
Summary of Results Arnab Nandi & Phil Bernstein 31 90% precision / recall possible Query distribution is a good similarity metric Bigger clicklogs imply better recall Technique isn't very sensitive to similarity metric
Varying Clicklog Size 32 Successively decreased clicklog size by half Recall decreases as clicklog size is decreased Arnab Nandi & Phil Bernstein
Summary of Results Arnab Nandi & Phil Bernstein 33 90% precision / recall possible Query distribution is a good similarity metric Bigger clicklogs imply better recall Technique isn't very sensitive to similarity metric
Comparing Query Distributions 34 	   	Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign) ,[object Object]
Minimal difference due to size of most queriesΣ (all qhost, qforeign combinations) Arnab Nandi & Phil Bernstein
Summary of Results Arnab Nandi & Phil Bernstein 35 90% precision / recall possible ,[object Object]
Bigger clicklogs imply better recall

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 

Kürzlich hochgeladen (20)

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 

Empfohlen

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Empfohlen (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

  • 1. HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching Arnab Nandi Phil BernsteinUniv of Michigan Microsoft Research
  • 2. Scenario Arnab Nandi & Phil Bernstein 2
  • 3. Scenario Arnab Nandi & Phil Bernstein 3 Search over structured data Commerce entertainment Data onboarding– merge an XML data feed from a 3rd partyto Microsoft data warehouse.
  • 4.
  • 6.
  • 7. Schema Matching 3rd Party Movie Site (Foreign) Warehouse: Movies (Host) -<Movie>  <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title>  <Release Key="Yes">2008</Release>  <Description>Ever…</Description>  <RunTime>127</RunTime> <Categories>  <Category>Action</Category>  <Category>Comedy</Category>  </Categories>  <MPAA>PG-13</MPAA>  <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl> -<Persons>  <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person> -</Persons>  </Movie> <MOVIE>  <MOVIE_ID>57590</MOVIE_ID>  <MOVIE_NAME>Indiana Jones and the Kingdom of the Crystal Skull</MOVIE_NAME>  <RUNTIME>02:00</RUNTIME>  <GENRE1>Action/Adventure</GENRE1>  <GENRE2/>  <RATING>NR</RATING>  <ADVISORY/>  <URL>http://www.indianajones.com/</URL>  <ACTOR1>Harrison Ford</ACTOR1>  <ACTOR2>Karen Allen</ACTOR2> </MOVIE> 6 Arnab Nandi & Phil Bernstein
  • 8. Taxonomy Matching 3rd Party Movie Site (Foreign) Warehouse: Movies (Host) -<Movie>  <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title>  <Release Key="Yes">2008</Release>  <Description>Ever…</Description>  <RunTime>127</RunTime> <Categories>  <Category>Action</Category>  <Category>Comedy</Category>  </Categories>  <MPAA>PG-13</MPAA>  <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl> -<Persons>  <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person> -</Persons>  </Movie> <MOVIE>  <MOVIE_ID>57590</MOVIE_ID>  <MOVIE_NAME>Indiana Jones and the Kingdom of the Crystal Skull</MOVIE_NAME>  <RUNTIME>02:00</RUNTIME>  <GENRE1>Action/Adventure</GENRE1>  <GENRE2/>  <RATING>NR</RATING>  <ADVISORY/>  <URL>http://www.indianajones.com/</URL>  <ACTOR1>Harrison Ford</ACTOR1>  <ACTOR2>Karen Allen</ACTOR2> </MOVIE> 7 Arnab Nandi & Phil Bernstein
  • 9. Various Problems 8 Badly normalized…. Unit conversion… In-band signaling… Arbitrary labels Zero documentation Not enough instances Formatting choices… Non standard vocabulary / language Arnab Nandi & Phil Bernstein
  • 10. Unlike conventional matching… Arnab Nandi & Phil Bernstein 9 3rd Party Feed query Users Search engine + data warehouse We have web search click data For both Warehouse & 3rd party website The databases we are integrating (usually) have a presence on the web Why not use click data as a feature for schema & taxonomy matching? results
  • 11. Outline 10 Scenario Using Clicklogs Core idea Using Query Distributions Example System Architecture Results Arnab Nandi & Phil Bernstein
  • 12. Core idea 11 “If two (sets of) products are searched for by similar queries, then they are similar” Web Search Small laptop Arnab Nandi & Phil Bernstein
  • 13. Core idea 12 Warehouse Asus.com Clicklog hardware Small Laptops Pro. Laptops eee X Y eee ::: small laptops Small laptop Small laptop Small laptop Z Arnab Nandi & Phil Bernstein
  • 14. Query Distributions click count Arnab Nandi & Phil Bernstein 13
  • 15. Mapping to Taxonomy 14 Map URL to product, which belongs to taxonomy http://www.amazon.com/dp/B001JTA59C Shopping | Electronics |Netbooks 3rd party DB (provided to us) Arnab Nandi & Phil Bernstein
  • 16. Aggregating Query Distributions 15 Warehouse Asus.com hardware Small Laptops Pro. Laptops eee eee ::: small laptops Arnab Nandi & Phil Bernstein
  • 17. Aggregate URLs to categories 16 Aggregate queries for each URL to schema element / taxonomy term Electronics|ElectronicsFeatures|Brands|Asus EEE “netbook”, “laptop”, “cheap laptop” Office Products|OfficeMachines|Netbooks “netbook” Arnab Nandi & Phil Bernstein
  • 18. Generating Correspondences Goal: To match two schema elements (or categories), determine if they have the same distribution of queries searching for them. Process For each page (URL) Identify query distribution Identify category / schema element of that page For each category / schema element C Aggregate over pages in C to get query distribution For each foreign category / schema element Find host category / schema element with most similar query distribution 17 Arnab Nandi & Phil Bernstein
  • 19. Outline 18 Scenario Using Clicklogs Core idea Using Query Distributions Example System Architecture Results Arnab Nandi & Phil Bernstein
  • 20. Example: Taxonomy Matching Arnab Nandi & Phil Bernstein 19 Warehouse: Professional Laptops Warehouse: Small Laptops eee
  • 21. Example: Taxonomy Matching Arnab Nandi & Phil Bernstein 20 “laptop” : 70 / 75“netbook” : 5/75 Warehouse: Professional Laptops “laptop”: 25/45“netbook”: 20/45 Warehouse: Small Laptops “laptop”: 5/25“netbook”: 15/25“cheap laptop”: 5/25 eee
  • 22. Distribution Similarity Metric Arnab Nandi & Phil Bernstein 21 Jaccard(qhost, qforeign) ✕MinFreq(qhost, qforeign) Σ (all qhost, qforeign combinations)
  • 23. “small laptops” vs “eee”laptop vs laptop netbookvsnetbooklaptop vs cheap laptop 1 x (25/45) + 1 x (20/45)+ 0.5 x (5/25) = 0.74 Example: Taxonomy Matching Arnab Nandi & Phil Bernstein 22 Warehouse: Professional Laptops “laptop” : 70 / 75“netbook” : 5/75 0.31 Warehouse: Small Laptops “laptop”: 25/45“netbook”: 20/45 0.74 eee “laptop”: 5/25“netbook”: 15/25“cheap laptop”: 5/25
  • 24. Advantages of Clicklogs Arnab Nandi & Phil Bernstein 23 Resilient to language Resilient to new domains, data, and features As long as people query & click, we have data to learn from Generates mappings previous methods can’t Electronics ▷ Electronics Features ▷ Brands ▷ Texas Instruments≈ Office Products ▷ Office Machines ▷ Calculators Software ▷ Categories ▷ Programming ▷ Programming Languages ▷Visual Basic  ≈ Software ▷ Developer Tools
  • 25. System Design 24 Arnab Nandi & Phil Bernstein
  • 26. Outline 25 Scenario Using Clicklogs Core idea Using Query Distributions Example System Architecture Results Arnab Nandi & Phil Bernstein
  • 27. Experimenting with Click Logs Arnab Nandi & Phil Bernstein 26 Commercial warehouse mapping, 258 products from a 70,000 term Amazon.com taxonomy (613 in gold) to a 6,000 term warehouse taxonomy (40 in gold) Live.com (now Bing.com) search querylog Amazon to warehouse mapping task, consecutively halving the clicklog size used 1.8 million clicks to Amazon.com product pages Typically each product had a query distribution averaging 13 unique (i.e., different) search queries (min 1, max 181, stdev 22).
  • 28. Summary of Results Arnab Nandi & Phil Bernstein 27 90% precision / recall possible Query distribution is a good similarity metric Bigger clicklogs imply better recall Technique isn't very sensitive to similarity metric
  • 29. Precision / Recall Arnab Nandi & Phil Bernstein 28 Commercial warehouse mapping, 258 products from a 70K term Amazon.com taxonomy to a 6,000 term warehouse taxonomy (613 categories used)
  • 30. Summary of Results Arnab Nandi & Phil Bernstein 29  90% precision / recall possible Query distribution is a good similarity metric Bigger clicklogs imply better recall Technique isn't very sensitive to similarity metric
  • 31.
  • 32. QDs of similar aggregates are similarQDs are unique to entities  QDs are unique to aggregate classes 
  • 33. Summary of Results Arnab Nandi & Phil Bernstein 31 90% precision / recall possible Query distribution is a good similarity metric Bigger clicklogs imply better recall Technique isn't very sensitive to similarity metric
  • 34. Varying Clicklog Size 32 Successively decreased clicklog size by half Recall decreases as clicklog size is decreased Arnab Nandi & Phil Bernstein
  • 35. Summary of Results Arnab Nandi & Phil Bernstein 33 90% precision / recall possible Query distribution is a good similarity metric Bigger clicklogs imply better recall Technique isn't very sensitive to similarity metric
  • 36.
  • 37. Minimal difference due to size of most queriesΣ (all qhost, qforeign combinations) Arnab Nandi & Phil Bernstein
  • 38.
  • 39. Bigger clicklogs imply better recall
  • 40.
  • 41. Related + Future Work Arnab Nandi & Phil Bernstein 37 “Mixed” methods Ontology matching: A machine learning approach (Handbook on Ontologies 2004)A Doan, J Madhavan, P Domingos, A Halevy Learning to match the schemas of data sources: A multistrategy approach (Machine Learning Journal 2003)A Doan, P Domingos, A Halevy Schema and ontology matching with COMA++ (SIGMOD 2005)D Aumueller, HH Do, S Massmann, E Rahm
  • 42. Conclusion Unsupervised mapping is possible very high recall / precision when enough queries are present Click logs are promising Finds results that other methods cannot find As clicklog size increases, it will produce more mappings Combinable with existing methods 38 Arnab Nandi & Phil Bernstein
  • 44. Existing Methods 40 A Survey of Approaches to Automatic Schema Matching (VLDBJ 2001)  Erhard Rahm, Philip A. Bernstein Arnab Nandi & Phil Bernstein
  • 45. Name-based & Instance-based Arnab Nandi & Phil Bernstein 41 Not ideal for our use case Need high precision “Task B”: Commercial warehouse mapping, 258 products in a 70K term taxonomy to a 6,000 term taxonomy