HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching Arnab Nandi Phil BernsteinUniv of Michigan Microsoft Research

Scenario Arnab Nandi & Phil Bernstein 2

Scenario Arnab Nandi & Phil Bernstein 3 Search over structured data Commerce entertainment Data onboarding– merge an XML data feed from a 3rd partyto Microsoft data warehouse.

Scenario Arnab Nandi & Phil Bernstein 4 “Amazon.com” 3rd Party Feed 3rd Party Feed 3rd Party Feed 3rd Party Feed query Users Search engine + data warehouse results ,[object Object]

Minimal Human Involvement,[object Object]

Schema Matching 3rd Party Movie Site (Foreign) Warehouse: Movies (Host) -<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime> <Categories> <Category>Action</Category> <Category>Comedy</Category> </Categories> <MPAA>PG-13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl> -<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person> -</Persons> </Movie> <MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2> </MOVIE> 6 Arnab Nandi & Phil Bernstein

Taxonomy Matching 3rd Party Movie Site (Foreign) Warehouse: Movies (Host) -<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime> <Categories> <Category>Action</Category> <Category>Comedy</Category> </Categories> <MPAA>PG-13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl> -<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person> -</Persons> </Movie> <MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2> </MOVIE> 7 Arnab Nandi & Phil Bernstein

Various Problems 8 Badly normalized…. Unit conversion… In-band signaling… Arbitrary labels Zero documentation Not enough instances Formatting choices… Non standard vocabulary / language Arnab Nandi & Phil Bernstein

Unlike conventional matching… Arnab Nandi & Phil Bernstein 9 3rd Party Feed query Users Search engine + data warehouse We have web search click data For both Warehouse & 3rd party website The databases we are integrating (usually) have a presence on the web Why not use click data as a feature for schema & taxonomy matching? results

Outline 10 Scenario Using Clicklogs Core idea Using Query Distributions Example System Architecture Results Arnab Nandi & Phil Bernstein

Core idea 11 “If two (sets of) products are searched for by similar queries, then they are similar” Web Search Small laptop Arnab Nandi & Phil Bernstein

Core idea 12 Warehouse Asus.com Clicklog hardware Small Laptops Pro. Laptops eee X Y eee ::: small laptops Small laptop Small laptop Small laptop Z Arnab Nandi & Phil Bernstein

Query Distributions click count Arnab Nandi & Phil Bernstein 13

Mapping to Taxonomy 14 Map URL to product, which belongs to taxonomy http://www.amazon.com/dp/B001JTA59C Shopping | Electronics |Netbooks 3rd party DB (provided to us) Arnab Nandi & Phil Bernstein

Aggregating Query Distributions 15 Warehouse Asus.com hardware Small Laptops Pro. Laptops eee eee ::: small laptops Arnab Nandi & Phil Bernstein

Aggregate URLs to categories 16 Aggregate queries for each URL to schema element / taxonomy term Electronics|ElectronicsFeatures|Brands|Asus EEE “netbook”, “laptop”, “cheap laptop” Office Products|OfficeMachines|Netbooks “netbook” Arnab Nandi & Phil Bernstein

Generating Correspondences Goal: To match two schema elements (or categories), determine if they have the same distribution of queries searching for them. Process For each page (URL) Identify query distribution Identify category / schema element of that page For each category / schema element C Aggregate over pages in C to get query distribution For each foreign category / schema element Find host category / schema element with most similar query distribution 17 Arnab Nandi & Phil Bernstein

Example: Taxonomy Matching Arnab Nandi & Phil Bernstein 19 Warehouse: Professional Laptops Warehouse: Small Laptops eee

Example: Taxonomy Matching Arnab Nandi & Phil Bernstein 20 “laptop” : 70 / 75“netbook” : 5/75 Warehouse: Professional Laptops “laptop”: 25/45“netbook”: 20/45 Warehouse: Small Laptops “laptop”: 5/25“netbook”: 15/25“cheap laptop”: 5/25 eee

Distribution Similarity Metric Arnab Nandi & Phil Bernstein 21 Jaccard(qhost, qforeign) ✕MinFreq(qhost, qforeign) Σ (all qhost, qforeign combinations)

“small laptops” vs “eee”laptop vs laptop netbookvsnetbooklaptop vs cheap laptop 1 x (25/45) + 1 x (20/45)+ 0.5 x (5/25) = 0.74 Example: Taxonomy Matching Arnab Nandi & Phil Bernstein 22 Warehouse: Professional Laptops “laptop” : 70 / 75“netbook” : 5/75 0.31 Warehouse: Small Laptops “laptop”: 25/45“netbook”: 20/45 0.74 eee “laptop”: 5/25“netbook”: 15/25“cheap laptop”: 5/25

Advantages of Clicklogs Arnab Nandi & Phil Bernstein 23 Resilient to language Resilient to new domains, data, and features As long as people query & click, we have data to learn from Generates mappings previous methods can’t Electronics ▷ Electronics Features ▷ Brands ▷ Texas Instruments≈ Office Products ▷ Office Machines ▷ Calculators Software ▷ Categories ▷ Programming ▷ Programming Languages ▷Visual Basic ≈ Software ▷ Developer Tools

System Design 24 Arnab Nandi & Phil Bernstein

Experimenting with Click Logs Arnab Nandi & Phil Bernstein 26 Commercial warehouse mapping, 258 products from a 70,000 term Amazon.com taxonomy (613 in gold) to a 6,000 term warehouse taxonomy (40 in gold) Live.com (now Bing.com) search querylog Amazon to warehouse mapping task, consecutively halving the clicklog size used 1.8 million clicks to Amazon.com product pages Typically each product had a query distribution averaging 13 unique (i.e., different) search queries (min 1, max 181, stdev 22).

Summary of Results Arnab Nandi & Phil Bernstein 27 90% precision / recall possible Query distribution is a good similarity metric Bigger clicklogs imply better recall Technique isn't very sensitive to similarity metric

Precision / Recall Arnab Nandi & Phil Bernstein 28 Commercial warehouse mapping, 258 products from a 70K term Amazon.com taxonomy to a 6,000 term warehouse taxonomy (613 categories used)

Summary of Results Arnab Nandi & Phil Bernstein 29  90% precision / recall possible Query distribution is a good similarity metric Bigger clicklogs imply better recall Technique isn't very sensitive to similarity metric

Match Quality Arnab Nandi & Phil Bernstein 30 ,[object Object]

QDs of similar aggregates are similarQDs are unique to entities  QDs are unique to aggregate classes 

Summary of Results Arnab Nandi & Phil Bernstein 31 90% precision / recall possible Query distribution is a good similarity metric Bigger clicklogs imply better recall Technique isn't very sensitive to similarity metric

Varying Clicklog Size 32 Successively decreased clicklog size by half Recall decreases as clicklog size is decreased Arnab Nandi & Phil Bernstein

Summary of Results Arnab Nandi & Phil Bernstein 33 90% precision / recall possible Query distribution is a good similarity metric Bigger clicklogs imply better recall Technique isn't very sensitive to similarity metric

Comparing Query Distributions 34 Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign) ,[object Object]

Minimal difference due to size of most queriesΣ (all qhost, qforeign combinations) Arnab Nandi & Phil Bernstein

Summary of Results Arnab Nandi & Phil Bernstein 35 90% precision / recall possible ,[object Object]

Bigger clicklogs imply better recall

HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Empfohlen

Empfohlen (20)

HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching