3. 09/30/15 Heiko Paulheim 3
Motivating Example
⢠Understanding population changes in the Netherlands
4. 09/30/15 Heiko Paulheim 4
Motivating Example
⢠Understanding population changes in the Netherlands
⢠What we can see in the data
â population changes by municipality are very diverse
â ranging from -12% to +53% over the last 15 years
⢠What we cannot see from the data
â How do growing regions differ from
shrinking ones?
â Which factors drive people's movements?
⢠As very often, we need more knowledge...
5. 09/30/15 Heiko Paulheim 5
Motivating Example
⢠Proposed approach:
â link data at hand to LOD Cloud
â harvest additional information about regions
â look for interesting patterns
link combine cleanse transform analyze
6. 09/30/15 Heiko Paulheim 6
RapidMiner Linked Open Data Extension
Introducing RapidMiner:
â
An open source platform for data mining and predictive analytics
â
Processes are designed by wiring operators in a GUI
(no programming)
â
Operators for data loading, transformation, modeling, visualization, âŚ
â
Scalable, distributed, parallel processing in a cloud environment
â
200,000 active users
â
Developers can write their own extensions
7. 09/30/15 Heiko Paulheim 7
RapidMiner Linked Open Data Extension
⢠The extension adds operators for
â accessing local and remote (Linked and non Linked) data
â linking local to remote data
â combining data from various sources
â automatically following links to other datasets
⢠Data analysts can use it without knowing RDF, SPARQL, etc.
8. 09/30/15 Heiko Paulheim 8
Example Use Case
⢠Understanding population changes in the Netherlands
⢠RapidMiner workflow:
â Import original table
â Link municipalities to DBpedia
⢠alternative: link provinces to Eurostat
â Build enriched table
â Analyze the results
9. 09/30/15 Heiko Paulheim 9
Example Findings
⢠Growing regions: Flevoland, Utrecht, North/South Holland
⢠Shrinking regions: Limburg, Groningen, Friesland
⢠Provincal capitals are growing
⢠Growth in regions with high population
⢠Growth in regions with high income
â but also: growth in regions with high unemployment
10. 09/30/15 Heiko Paulheim 10
Example Findings
⢠Negative correlation between growth and elevation?!
11. 09/30/15 Heiko Paulheim 11
Behind the Scenes: RapidMiner LOD Extension
⢠RapidMiner uses a tabular data model
12. 09/30/15 Heiko Paulheim 12
Behind the Scenes: RapidMiner LOD Extension
⢠Linking local data to LOD Sources
â based on URI patterns
â based on text search
â using specialized services (e.g., DBpedia Lookup)
⢠Following links
â e.g., automatically follow all owl:sameAs links to other datasets
to a certain depth
⢠Harvesting attributes
â e.g., add all numeric attributes found
â built-in support for aggregations
13. 09/30/15 Heiko Paulheim 13
Behind the Scenes: RapidMiner LOD Extension
⢠Matching and fusion
â e.g., many sources contain âpopulationâ
as an attribute
â automatic identification of similar attributes
â automatic fusion using different policies
⢠Attribute set filtering
â exploiting schema information
â more effective in finding redundant attributes
15. 09/30/15 Heiko Paulheim 15
Other Examples
⢠Analyzing unemployment in France (SemStats'13)
â using background knowledge from DBpedia, Eurostat, Linked Geo Data
â exploiting links from DBpedia to GADM for visualization
16. 09/30/15 Heiko Paulheim 16
Other Examples
⢠Example correlations for unemployment in France:
â African islands, Islands in the Indian Ocean,
Outermost regions of the EU (positive)
â GDP (negative)
â Disposable income (negative)
â Hospital beds/inhabitants (negative)
â RnD spendings (negative)
â Energy consumption (negative)
â Population growth (positive)
â Casualties in traffic accidents (negative)
â Fast food restaurants (positive)
â Police stations (positive)
17. 09/30/15 Heiko Paulheim 17
Other Examples
⢠Data Set: Suicide rates by country
â http://www.washingtonpost.com/wp-srv/world/suiciderate.html
⢠Findings for suicide rates
â Democracies have lower suicide rates than other forms of government
â High HDI â low suicide rate
â High population density â high suicide rate
â By geography:
⢠At the sea â low
⢠In the mountains â high
â High Gini index â low suicide rate
⢠High Gini index â unequal distribution of wealth
â High usage of nuclear power â high suicide rates
18. 09/30/15 Heiko Paulheim 18
Other Examples
⢠Data set: Durex worldwise survey on sexual activity
â http://chartsbin.com/view/uya
⢠Findings:
â By geography:
⢠High in Europe, low in Asia
⢠Low in Island states
â By language:
⢠English speaking: low
⢠French speaking: high
â Low average age â high activity
â High GDP per capita â low activity
â High unemployment rate â high activity
â High number of ISP providers â low activity
20. 09/30/15 Heiko Paulheim 20
Other Use Cases
⢠Incident detection from Twitter
fire at #mannheim
#universityomg two cars on
fire #A5 #accident
fire at train station
still burning
my heart
is on fire!!!come on baby
light my fire
boss should fire
that stupid moron
21. 09/30/15 Heiko Paulheim 21
Other Use Cases
⢠Example set:
â âAgain crash on I90â
â âAccident on I90â
dbpedia:Interstate_90
dbpedia-owl:Road
rdf:type
dbpedia:Interstate_51
rdf:type
⢠Model:
â dbpedia-owl:Road â indicates traffic accident
⢠Applying the model:
â âTwo cars collided on I51â â indicates traffic accident
⢠Using LOD+RapidMiner
â automatically learns a model
â avoids overfitting
22. 09/30/15 Heiko Paulheim 22
Other Use Cases
⢠Building Semantic Recommeder Systems (ESWC'14)
⢠Combines two extensions:
â Linked Open Data extension
â Recommender system extension
⢠Use data about books
for content-based recommender
â best system (out of 24)
on two out of three tasks
â used data from DBpedia
and RDF Book Mashup
23. 09/30/15 Heiko Paulheim 23
What is Special about Hilversum?
⢠Compare Hilversum to other Cities in the Netherlands
â find distinctive features
⢠Finding the needles in the haystack
of statements about Hilversum
in DBpedia
24. 09/30/15 Heiko Paulheim 24
What is Special about Hilversum?
⢠Compare Hilversum to other Cities in the Netherlands
â find distinctive features
25. 09/30/15 Heiko Paulheim 25
What is Special about Hilversum?
⢠Compare Hilversum to other Cities in the Netherlands
â find distinctive features
⢠TopFacts application
â Demonstration at ISWC 2015
â Combines Linked Open Data with attribute-wise outlier detection
[see Paulheim/Meusel, Machine Learning 100(2-3), 2015]
26. 09/30/15 Heiko Paulheim 26
What is Special about Hilversum?
⢠Compare Hilversum to other Cities in the Netherlands
â find distinctive features
⢠Hilversum is
â a city where the modern Pentathlon olympics have been held
â the headquarter of many media companies
â a place where many music recordings have been made
27. 09/30/15 Heiko Paulheim 27
Other Use Cases
⢠Debugging Linked Open Data
â loading a set of statements
â augment with additional features
â run outlier detection
⢠again: a special
extension
⢠Example: identify wrong
dataset interlinks
(WoDOOM'14)
â AUC up to 85%
28. 09/30/15 Heiko Paulheim 28
Summary
⢠The RapidMiner LOD Extension
â brings data analysis to the web of data
â can be used by data analysts without learning SPARQL
⢠Availability
â on the RapidMiner
marketplace
â installable from
inside RapidMiner
â >9,000 installations
and counting
29. 09/30/15 Heiko Paulheim 29
Take Home Messages
⢠The Web is full of data
â ...and more and more becomes Linked Data
⢠Intelligent data processing
â helps unlocking the potential of that data
â enables intelligent applications
⢠A good fit
â Sophisticated analytics platforms
(e.g., RapidMiner), and
â Linked Open Data