Anzeige
Anzeige

Más contenido relacionado

Similar a Data Mining with Background Knowledge from the Web - Introducing the RapidMiner Linked Open Data Extension(20)

Anzeige

Más de Heiko Paulheim(20)

Anzeige

Data Mining with Background Knowledge from the Web - Introducing the RapidMiner Linked Open Data Extension

  1. Data Mining with Background Knowledge from the Web Introducing the RapidMiner Linked Open Data Extension 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 1 Heiko Paulheim, Petar Ristoski, Evgeny Mitichkin, Christian Bizer
  2. Motivation: An Example Data Mining Task • Analyzing book sales ISBN City Sold 3-2347-3427-1 Darmstadt 124 3-43784-324-2 Mannheim 493 3-145-34587-0 Roßdorf 14 ISBN City Population ... Genre Publisher ... Sold 3-2347-3427-1 Darm-stadt 144402 ... Crime Bloody 3-43784-324-2 Mann-heim 291458 … Crime Guns Ltd. … 493 ... Books ... 124 3-145-34587-0 Roß-dorf 12019 ... Travel Up&Away ... 14 ... → Crime novels sell better in larger cities 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 2
  3. Motivation • Many data mining problems are solved better – when you have more background knowledge (leaving scalability aside) • Problems: – Tedious work – Selection bias: what to include? 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 3
  4. Linked Open Data in a Nutshell • Started in 2007 • A collection of ~1,000 open datasets – from various domains, e.g., general knowledge, government data, … – using semantic web standards (HTTP, RDF, SPARQL,…) • Machine processable • Free of charge • Sophisticated tool stacks 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 4
  5. Linked Open Data in a Nutshell http://lod-cloud.net/ 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 5
  6. Example: DBpedia 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 6
  7. The RapidMiner LOD Extension 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 7
  8. The RapidMiner LOD Extension • Automatic discovery of links to Linked Open Data – for local data objects – e.g., the database entry Boston is linked to http://dbpedia.org/resource/Boston • Automatic generation of attributes – e.g., add all numeric values found for Boston (and other cities) • Plus – Feature selection algorithms optimized for LOD – Automatic following of links to other datasets – Schema matching (coming soon) • No need to know Semantic Web technologies! 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 8
  9. Example: the Auto MPG Dataset • A well-known UCI dataset – Goal: predict fuel consumption of cars • Hypothesis: background knowledge → more accurate predictions • Used background knowledge: – Entity types and categories from DBpedia (=Wikipedia) 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 9
  10. Example: the Auto MPG Dataset • A well-known UCI dataset – Goal: predict fuel consumption of cars • Hypothesis: background knowledge → more accurate predictions • Used background knowledge: – Entity types and categories from DBpedia (=Wikipedia) • Result: M5Rules down to almost half the prediction error – i.e., on average, we are wrong by 1.6 instead of 2.9 MPG 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 10
  11. Example: the Auto MPG Dataset • The original attributes are – cylinders, displacement, horsepower, weight, acceleration, model, origin – plus name (unique string) and mpg (target) • Models built are, e.g., – high horsepower/weight → high consumption • Additional attributes lead to further insights, e.g. – front-wheel drives have a lower consumption than rear-wheel drives – hatchbacks have a lower consumption than station wagons – rally cars generally have a low consumption 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 11
  12. Example: Analyzing Statistics • As shown, e.g., at ESWC 2012, SemStats 2013 • Statistics found on the web often contain only few attributes – extreme case: only entity + target • Examples: – Quality of living in cities (right) – Corruption by country – Fertility rate by country – Suicide rate by country – Box office revenue of films – ... 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 12
  13. Example: Analyzing Statistics • Process in RapidMiner: – load statistic – link entities (cities, countries, etc.) to LOD cloud – collect additional attributes – analyze for correlations with target attribute of statistic 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 13
  14. Example: Analyzing Statistics • Quality of living in cities worldwide: indicators for low quality – too hot (highest temperature in June exceeds 27°C) – too cold (highest temperature in January below 16°C) – too big (total area exceeds 334km²) – poor cultural live (no music recordings made in this city) – or simply: wrong place on the map (latitude<24, longitude<47) all those attributes come from LOD! 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 14
  15. Example: Analyzing Statistics • Corruption Perception Index (CPI) by Transparency International • Indicators for low corruption: – high HDI (human development index) – large number of companies – large number of NGOs – small number of cargo airlines?! • Burnout rates in German DAX companies – Positive correlation between turnover and burnout rates – Car manufacturers are less prone to burnout – Local companies are less prone to burnout than international ones • Exception: Frankfurt 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 15
  16. Example: Analyzing Statistics • Sexual activity (based on Durex survey 2005-2009) – Higher in French speaking than in English speaking countries – High GDP per capita → low activity – High unemployment rate → high activity – High number of ISPs → low activity http://xkcd.com/552/ 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 16
  17. Further Usage Examples • Classification of Twitter messages (SMILE, 2013) – given a target, e.g., messages related to car traffic – annotate message, extract abstract features for concepts – e.g. “I-90” → highway • Prediction of user location for Twitter (ICWSM, 2013) – useful, e.g., for market research – combination with sentiment analysis: public opinion maps • Identifying disputed topics in the news (LD4KD, 2014) – on a corpus of different online newspapers – identified, e.g., concurrent opinions on drug legislation and gay marriage • Debugging Linked Open Data as such – e.g., identifying wrong links and axioms – combination with outlier detection 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 17
  18. Conclusions • Many data mining tasks are better solved with more background knowledge – better predictive models – more insights from additional attributes • A lot of such knowledge exists as Linked Open Data • The Linked Open Data extension grants easy access to that data – from within RapidMiner – without the need to know anything about RDF, SPARQL, etc. • Try it out! – find “Linked Open Data” on the marketplace – Google Group: https://groups.google.com/forum/#!forum/rmlod 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 18
  19. Data Mining with Background Knowledge from the Web Introducing the RapidMiner Linked Open Data Extension 08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 19 Heiko Paulheim, Petar Ristoski, Evgeny Mitichkin, Christian Bizer
Anzeige