Data Mining with Background Knowledge from the Web - Introducing the RapidMiner Linked Open Data Extension

Data Mining with Background Knowledge
from the Web
Introducing the RapidMiner
Linked Open Data Extension
08/20/14 Paulheim, Ristoski, Mitichkin, Bizer 1
Heiko Paulheim, Petar Ristoski, Evgeny Mitichkin, Christian Bizer

Motivation: An Example Data Mining Task
• Analyzing book sales
ISBN City Sold
3-2347-3427-1 Darmstadt 124
3-43784-324-2 Mannheim 493
3-145-34587-0 Roßdorf 14
ISBN City Population ... Genre Publisher ... Sold
3-2347-3427-1 Darm-stadt
144402 ... Crime Bloody
3-43784-324-2 Mann-heim
291458 … Crime Guns Ltd. … 493
...
Books
... 124
3-145-34587-0 Roß-dorf
12019 ... Travel Up&Away ... 14
...
→ Crime novels sell better in larger cities

Motivation
• Many data mining problems are solved better
– when you have more background knowledge
(leaving scalability aside)
• Problems:
– Tedious work
– Selection bias: what to include?

Linked Open Data in a Nutshell
• Started in 2007
• A collection of ~1,000 open datasets
– from various domains, e.g., general knowledge, government data, …
– using semantic web standards (HTTP, RDF, SPARQL,…)
• Machine processable
• Free of charge
• Sophisticated tool stacks

Linked Open Data in a Nutshell
http://lod-cloud.net/

Example: DBpedia

The RapidMiner LOD Extension

The RapidMiner LOD Extension
• Automatic discovery of links to Linked Open Data
– for local data objects
– e.g., the database entry Boston is linked to
http://dbpedia.org/resource/Boston
• Automatic generation of attributes
– e.g., add all numeric values found for Boston (and other cities)
• Plus
– Feature selection algorithms optimized for LOD
– Automatic following of links to other datasets
– Schema matching (coming soon)
• No need to know Semantic Web technologies!

Example: the Auto MPG Dataset
• A well-known UCI dataset
– Goal: predict fuel consumption of cars
• Hypothesis: background knowledge → more accurate predictions
• Used background knowledge:
– Entity types and categories from DBpedia (=Wikipedia)

• A well-known UCI dataset
– Goal: predict fuel consumption of cars
• Hypothesis: background knowledge → more accurate predictions
• Used background knowledge:
– Entity types and categories from DBpedia (=Wikipedia)
• Result: M5Rules down to almost half the prediction error
– i.e., on average, we are wrong by 1.6 instead of 2.9 MPG

• The original attributes are
– cylinders, displacement, horsepower, weight, acceleration, model, origin
– plus name (unique string) and mpg (target)
• Models built are, e.g.,
– high horsepower/weight → high consumption
• Additional attributes lead to further insights, e.g.
– front-wheel drives have a lower consumption than rear-wheel drives
– hatchbacks have a lower consumption than station wagons
– rally cars generally have a low consumption

Example: Analyzing Statistics
• As shown, e.g., at ESWC 2012, SemStats 2013
• Statistics found on the web often
contain only few attributes
– extreme case: only entity + target
• Examples:
– Quality of living in cities (right)
– Corruption by country
– Fertility rate by country
– Suicide rate by country
– Box office revenue of films
– ...

• Process in RapidMiner:
– load statistic
– link entities (cities, countries, etc.) to LOD cloud
– collect additional attributes
– analyze for correlations with target attribute of statistic

• Quality of living in cities worldwide: indicators for low quality
– too hot (highest temperature in June exceeds 27°C)
– too cold (highest temperature in January below 16°C)
– too big (total area exceeds 334km²)
– poor cultural live (no music recordings made in this city)
– or simply: wrong place on the map (latitude<24, longitude<47)
all those attributes
come from LOD!

• Corruption Perception Index (CPI) by Transparency International
• Indicators for low corruption:
– high HDI (human development index)
– large number of companies
– large number of NGOs
– small number of cargo airlines?!
• Burnout rates in German DAX companies
– Positive correlation between turnover and burnout rates
– Car manufacturers are less prone to burnout
– Local companies are less prone to burnout than international ones
• Exception: Frankfurt

• Sexual activity (based on Durex survey 2005-2009)
– Higher in French speaking than in English speaking countries
– High GDP per capita → low activity
– High unemployment rate → high activity
– High number of ISPs → low activity
http://xkcd.com/552/

Further Usage Examples
• Classification of Twitter messages (SMILE, 2013)
– given a target, e.g., messages related to car traffic
– annotate message, extract abstract features for concepts
– e.g. “I-90” → highway
• Prediction of user location for Twitter (ICWSM, 2013)
– useful, e.g., for market research
– combination with sentiment analysis: public opinion maps
• Identifying disputed topics in the news (LD4KD, 2014)
– on a corpus of different online newspapers
– identified, e.g., concurrent opinions on drug legislation and gay marriage
• Debugging Linked Open Data as such
– e.g., identifying wrong links and axioms
– combination with outlier detection

Conclusions
• Many data mining tasks are better solved
with more background knowledge
– better predictive models
– more insights from additional attributes
• A lot of such knowledge exists as Linked Open Data
• The Linked Open Data extension grants easy access to that data
– from within RapidMiner
– without the need to know anything about RDF, SPARQL, etc.
• Try it out!
– find “Linked Open Data” on the marketplace
– Google Group: https://groups.google.com/forum/#!forum/rmlod

Data Mining with Background Knowledge
from the Web
Introducing the RapidMiner
Linked Open Data Extension
Heiko Paulheim, Petar Ristoski, Evgeny Mitichkin, Christian Bizer

Data Mining with Background Knowledge from the Web - Introducing the RapidMiner Linked Open Data Extension

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (16)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Data Mining with Background Knowledge from the Web - Introducing the RapidMiner Linked Open Data Extension

Ähnlich wie Data Mining with Background Knowledge from the Web - Introducing the RapidMiner Linked Open Data Extension (20)

Mehr von Heiko Paulheim

Mehr von Heiko Paulheim (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data Mining with Background Knowledge from the Web - Introducing the RapidMiner Linked Open Data Extension