We address the problem of unsupervised matching of schema
information from a large number of data sources into the
schema of a data warehouse. The matching process is the
first step of a framework to integrate data feeds from third-
party data providers into a structured-search engine’s data
warehouse. Our experiments show that traditional schema-
based and instance-based schema matching methods fall short.
We propose a new technique based on the search engine’s
clicklogs. Two schema elements are matched if the distribution of keyword queries that cause clickthroughs on their
instances are similar. We present experiments on large commercial datasets that show the new technique has much better accuracy than traditional techniques.
3. Scenario Arnab Nandi & Phil Bernstein 3 Search over structured data Commerce entertainment Data onboarding– merge an XML data feed from a 3rd partyto Microsoft data warehouse.
7. Schema Matching 3rd Party Movie Site (Foreign) Warehouse: Movies (Host) -<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime> <Categories> <Category>Action</Category> <Category>Comedy</Category> </Categories> <MPAA>PG-13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl> -<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person> -</Persons> </Movie> <MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2> </MOVIE> 6 Arnab Nandi & Phil Bernstein
8. Taxonomy Matching 3rd Party Movie Site (Foreign) Warehouse: Movies (Host) -<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime> <Categories> <Category>Action</Category> <Category>Comedy</Category> </Categories> <MPAA>PG-13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl> -<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person> -</Persons> </Movie> <MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2> </MOVIE> 7 Arnab Nandi & Phil Bernstein
9. Various Problems 8 Badly normalized…. Unit conversion… In-band signaling… Arbitrary labels Zero documentation Not enough instances Formatting choices… Non standard vocabulary / language Arnab Nandi & Phil Bernstein
10. Unlike conventional matching… Arnab Nandi & Phil Bernstein 9 3rd Party Feed query Users Search engine + data warehouse We have web search click data For both Warehouse & 3rd party website The databases we are integrating (usually) have a presence on the web Why not use click data as a feature for schema & taxonomy matching? results
11. Outline 10 Scenario Using Clicklogs Core idea Using Query Distributions Example System Architecture Results Arnab Nandi & Phil Bernstein
12. Core idea 11 “If two (sets of) products are searched for by similar queries, then they are similar” Web Search Small laptop Arnab Nandi & Phil Bernstein
13. Core idea 12 Warehouse Asus.com Clicklog hardware Small Laptops Pro. Laptops eee X Y eee ::: small laptops Small laptop Small laptop Small laptop Z Arnab Nandi & Phil Bernstein
15. Mapping to Taxonomy 14 Map URL to product, which belongs to taxonomy http://www.amazon.com/dp/B001JTA59C Shopping | Electronics |Netbooks 3rd party DB (provided to us) Arnab Nandi & Phil Bernstein
16. Aggregating Query Distributions 15 Warehouse Asus.com hardware Small Laptops Pro. Laptops eee eee ::: small laptops Arnab Nandi & Phil Bernstein
17. Aggregate URLs to categories 16 Aggregate queries for each URL to schema element / taxonomy term Electronics|ElectronicsFeatures|Brands|Asus EEE “netbook”, “laptop”, “cheap laptop” Office Products|OfficeMachines|Netbooks “netbook” Arnab Nandi & Phil Bernstein
18. Generating Correspondences Goal: To match two schema elements (or categories), determine if they have the same distribution of queries searching for them. Process For each page (URL) Identify query distribution Identify category / schema element of that page For each category / schema element C Aggregate over pages in C to get query distribution For each foreign category / schema element Find host category / schema element with most similar query distribution 17 Arnab Nandi & Phil Bernstein
19. Outline 18 Scenario Using Clicklogs Core idea Using Query Distributions Example System Architecture Results Arnab Nandi & Phil Bernstein
20. Example: Taxonomy Matching Arnab Nandi & Phil Bernstein 19 Warehouse: Professional Laptops Warehouse: Small Laptops eee
22. Distribution Similarity Metric Arnab Nandi & Phil Bernstein 21 Jaccard(qhost, qforeign) ✕MinFreq(qhost, qforeign) Σ (all qhost, qforeign combinations)
23. “small laptops” vs “eee”laptop vs laptop netbookvsnetbooklaptop vs cheap laptop 1 x (25/45) + 1 x (20/45)+ 0.5 x (5/25) = 0.74 Example: Taxonomy Matching Arnab Nandi & Phil Bernstein 22 Warehouse: Professional Laptops “laptop” : 70 / 75“netbook” : 5/75 0.31 Warehouse: Small Laptops “laptop”: 25/45“netbook”: 20/45 0.74 eee “laptop”: 5/25“netbook”: 15/25“cheap laptop”: 5/25
24. Advantages of Clicklogs Arnab Nandi & Phil Bernstein 23 Resilient to language Resilient to new domains, data, and features As long as people query & click, we have data to learn from Generates mappings previous methods can’t Electronics ▷ Electronics Features ▷ Brands ▷ Texas Instruments≈ Office Products ▷ Office Machines ▷ Calculators Software ▷ Categories ▷ Programming ▷ Programming Languages ▷Visual Basic ≈ Software ▷ Developer Tools
26. Outline 25 Scenario Using Clicklogs Core idea Using Query Distributions Example System Architecture Results Arnab Nandi & Phil Bernstein
27. Experimenting with Click Logs Arnab Nandi & Phil Bernstein 26 Commercial warehouse mapping, 258 products from a 70,000 term Amazon.com taxonomy (613 in gold) to a 6,000 term warehouse taxonomy (40 in gold) Live.com (now Bing.com) search querylog Amazon to warehouse mapping task, consecutively halving the clicklog size used 1.8 million clicks to Amazon.com product pages Typically each product had a query distribution averaging 13 unique (i.e., different) search queries (min 1, max 181, stdev 22).
28. Summary of Results Arnab Nandi & Phil Bernstein 27 90% precision / recall possible Query distribution is a good similarity metric Bigger clicklogs imply better recall Technique isn't very sensitive to similarity metric
29. Precision / Recall Arnab Nandi & Phil Bernstein 28 Commercial warehouse mapping, 258 products from a 70K term Amazon.com taxonomy to a 6,000 term warehouse taxonomy (613 categories used)
30. Summary of Results Arnab Nandi & Phil Bernstein 29 90% precision / recall possible Query distribution is a good similarity metric Bigger clicklogs imply better recall Technique isn't very sensitive to similarity metric
31.
32. QDs of similar aggregates are similarQDs are unique to entities QDs are unique to aggregate classes
33. Summary of Results Arnab Nandi & Phil Bernstein 31 90% precision / recall possible Query distribution is a good similarity metric Bigger clicklogs imply better recall Technique isn't very sensitive to similarity metric
34. Varying Clicklog Size 32 Successively decreased clicklog size by half Recall decreases as clicklog size is decreased Arnab Nandi & Phil Bernstein
35. Summary of Results Arnab Nandi & Phil Bernstein 33 90% precision / recall possible Query distribution is a good similarity metric Bigger clicklogs imply better recall Technique isn't very sensitive to similarity metric
36.
37. Minimal difference due to size of most queriesΣ (all qhost, qforeign combinations) Arnab Nandi & Phil Bernstein
41. Related + Future Work Arnab Nandi & Phil Bernstein 37 “Mixed” methods Ontology matching: A machine learning approach (Handbook on Ontologies 2004)A Doan, J Madhavan, P Domingos, A Halevy Learning to match the schemas of data sources: A multistrategy approach (Machine Learning Journal 2003)A Doan, P Domingos, A Halevy Schema and ontology matching with COMA++ (SIGMOD 2005)D Aumueller, HH Do, S Massmann, E Rahm
42. Conclusion Unsupervised mapping is possible very high recall / precision when enough queries are present Click logs are promising Finds results that other methods cannot find As clicklog size increases, it will produce more mappings Combinable with existing methods 38 Arnab Nandi & Phil Bernstein
44. Existing Methods 40 A Survey of Approaches to Automatic Schema Matching (VLDBJ 2001) Erhard Rahm, Philip A. Bernstein Arnab Nandi & Phil Bernstein
45. Name-based & Instance-based Arnab Nandi & Phil Bernstein 41 Not ideal for our use case Need high precision “Task B”: Commercial warehouse mapping, 258 products in a 70K term taxonomy to a 6,000 term taxonomy