SlideShare ist ein Scribd-Unternehmen logo
1 von 63
10/08/13 Heiko Paulheim 1
Exploiting Linked Open Data
as Background Knowledge in Data Mining
Heiko Paulheim, University of Mannheim
10/08/13 Heiko Paulheim 2
Outline
• Motivation
• The original FeGeLOD framework
• Experiments
• Applications
• The RapidMiner Linked Open Data Extension
• Challenges and Future Work
10/08/13 Heiko Paulheim 3
Motivation: An Example Data Mining Task
• Analyzing book sales
ISBN City Sold
3-2347-3427-1 Darmstadt 124
3-43784-324-2 Mannheim 493
3-145-34587-0 Roßdorf 14
...
ISBN City Population ... Genre Publisher ... Sold
3-2347-3427-1 Darm-
stadt
144402 ... Crime Bloody
Books
... 124
3-43784-324-2 Mann-
heim
291458 … Crime Guns Ltd. … 493
3-145-34587-0 Roß-
dorf
12019 ... Travel Up&Away ... 14
...
→ Crime novels sell better in larger cities
10/08/13 Heiko Paulheim 4
Motivation
• Many data mining problems are solved better
– when you have more background knowledge
(leaving scalability aside)
• Problems:
– Tedious work
– Selection bias: what to include?
10/08/13 Heiko Paulheim 5
Motivation
http://lod-cloud.net/
10/08/13 Heiko Paulheim 6
Motivation
• Idea:
– reuse background knowledge from Linked Open Data
– include it in the data mining process as needed
• Two main variants:
– develop mining/learning algorithms that run directly on Linked Data
– create relational features from Linked Data
10/08/13 Heiko Paulheim 7
Motivation
• Develop mining/learning algorithms
– e.g., DL Learner
– e.g., dedicated Kernel functions
• Advantages:
– can be quite efficient
– no reduction to “flat” table structure
– semantics can be respected directly
10/08/13 Heiko Paulheim 8
Motivation
• Create relational features
– e.g., LiDDM
– e.g., AutoSPARQL
– e.g., FeGeLOD / RapidMiner Linked Open Data Extension
• Advantages:
– Easy combination of knowledge from various sources
• including relational features in the original data
– Arbitrary mining algorithms/tools possible
10/08/13 Heiko Paulheim 9
FeGeLOD – Feature Generation from LOD
IS B N
3 -2 3 4 7 -3 4 2 7 -1
C ity
D a r m s ta d t
# s o ld
1 2 4
N a m e d E n t it y
R e c o g n it io n
IS B N
3 -2 3 4 7 -3 4 2 7 - 1
C ity
D a r m s ta d t
# s o ld
1 2 4
C ity _ U R I
h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t
F e a t u r e
G e n e r a t io n
IS B N
3 -2 3 4 7 -3 4 2 7 -1
C ity
D a r m s ta d t
# s o ld
1 2 4
C ity _ U R I
h ttp : / / d b p e d ia .o r g / r e s o u r c e / D a r m s ta d t
C ity _ U R I_ d b p e d ia -o w l: p o p u la tio n T o ta l
1 4 1 4 7 1
C ity _ U R I_ ...
...
F e a t u r e
S e le c t io n
IS B N
3 -2 3 4 7 -3 4 2 7 - 1
C ity
D a r m s ta d t
# s o ld
1 2 4
C ity _ U R I
h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t
C ity _ U R I_ d b p e d ia -o w l:p o p u la tio n T o ta l
1 4 1 4 7 1
10/08/13 Heiko Paulheim 10
FeGeLOD – Feature Generation from LOD
• Original prototype, based on Weka:
– Simple NER (guessing URIs)
– Seven generators:
• direct types
• data properties
• unqualified relations (boolean, numeric)
• qualified relations (boolean, numeric)
• individuals (dangerous!) - may be restricted to specific property
– Simple feature selection: filtering features
• that have only* different values (expect numerical)
• that have only* identical values
• that are mostly missing*
*) 95% or 99%
10/08/13 Heiko Paulheim 11
Experiments
• Testing with two* standard machine learning data sets
– Zoo: classifying animals
– AAUP: predicting income of university employees
(regression task)
• Question: how much improvement do additional features bring?
*) standard ML datasets with speaking labels are scarce!
10/08/13 Heiko Paulheim 12
Experiments: Zoo Dataset
10/08/13 Heiko Paulheim 13
First Results: AAUP
10/08/13 Heiko Paulheim 14
Experiments: Early Insights
• Additional features often improve the results
• Zoo dataset:
– Ripper: 89.11 to 96.04
– SMO: 93.07 to 97.03
– No improvement for Naive Bayes
• AAUP dataset (compensation):
– M5: 59.88 to 51.28
– SMO: 74.12 to 61.97
– No improvement for linear regression
• ...but they may also cause problems
– extreme example: 6.54 to 189.90 for linear regression
– memory and timeouts due to large datasets
10/08/13 Heiko Paulheim 15
Experiments: Quality of Features
• Information gain of features on Zoo dataset
10/08/13 Heiko Paulheim 16
Experiments: Quality of Features
• Information gain of features on AAUP dataset (compensation)
10/08/13 Heiko Paulheim 17
Application: Classifying Events from Wikipedia
• Event Extraction from Wikipedia
• Joint work with Dennis Wegener and Daniel Hienert (GESIS)
• Task: event classification (e.g., Politics, Sports, ...)
http://www.vizgr.org/historical-events/timeline/
10/08/13 Heiko Paulheim 18
Application: Classifying Events from Wikipedia
• Source Material:
http://www.vizgr.org/historical-events/timeline/
10/08/13 Heiko Paulheim 19
Application: Classifying Events from Wikipedia
• Positive Examples for class politics:
– 2011, March 15 - German chancellor Angela Merkel shuts down the seven
oldest German nuclear power plants.
– 2010, June 3 – Christian Wulff is nominated for President of Germany by
Angela Merkel.
• Negative Examples for class politics:
– 2010, July 7 – Spain defeats Germany 1-0 to win its semi-final and for its first
time, along with Netherlands make the 2010 FIFA World Cup Final.
– 2012, February 16 – Roman Lob is selected to represent Germany in the
Eurovision Song Contest.
10/08/13 Heiko Paulheim 20
Application: Classifying Events from Wikipedia
• Positive Examples for class politics:
– 2011, March 15 - German chancellor Angela Merkel shuts down the seven
oldest German nuclear power plants.
– 2010, June 3 – Christian Wulff is nominated for President of Germany by
Angela Merkel.
• Negative Examples for class politics:
– 2010, July 7 – Spain defeats Germany 1-0 to win its semi-final and for its first
time, along with Netherlands make the 2010 FIFA World Cup Final.
– 2012, February 16 – Roman Lob is selected to represent Germany in the
Eurovision Song Contest.
• Possible learned model:
– "Angela Merkel" → Politics
10/08/13 Heiko Paulheim 21
Application: Classifying Events from Wikipedia
• Possibly Learned Model:
– "Angela Merkel" → Politics
• How can we do better?
• Background knowledge from Linked Open Data
– 2011, March 15 - German chancellor Angela Merkel [class: Politician] shuts
down the seven oldest German nuclear power plants.
– 2012, May 13, Elections in North Rhine-Westphalia – Hannelore Kraft [class:
Politician] is elected to continue as Minister-President, heading an SPD-
Green coalition.
• Model learned in that case:
– "[class: Politician]" → Politics
10/08/13 Heiko Paulheim 22
Application: Classifying Events from Wikipedia
• Model learned in that case:
– "[class: Politician]" → Politics
• Much more general
– Can also classify events with politicians
not contained in the training set
• Less training examples required
– A few events with politicians, athletes, singers, ... are enough
10/08/13 Heiko Paulheim 23
Application: Classifying Events from Wikipedia
• Experiments on Wikipedia data
– >10 categories
– 1,000 labeled examples as training set
– Classification accuracy: 80%
• Plus:
– We have trained a language-independent model!
• often, models are like "elect*" → Politics
– 22. Mai 2012: Peter Altmaier [class: Politician] wird als Nachfolger von
Norbert Röttgen [class: Politician] zum Bundesumweltminister ernannt.
– 6 januari 2012: Jonas Sjöstedt [class: Politician] väljs till ny partiledare för
Vänsterpartiet efter Lars Ohly [class: Politician].
10/08/13 Heiko Paulheim 24
Application: Classifying Tweets
• Joint work with Axel Schulz and Petar Ristoski (SAP Research)
• Goal: using Twitter for emergency management
fire at #mannheim
#universityomg two cars on
fire #A5 #accident
fire at train station
still burning
my heart
is on fire!!!come on baby
light my fire
boss should fire
that stupid moron
10/08/13 Heiko Paulheim 25
Application: Classifying Tweets
• Social media contains data on many incidents
– But keyword search is not enough
– Detecting small incidents is hard
– Manual inspection is too expensive (and slow)
• Machine learning could help
– Train a model to classify incident/non incident tweets
– Apply model for detecting incident related tweets
• Training data:
– Traffic accidents
– ~2,000 tweets containing relevant keywords (“car”, “crash”, etc.),
hand labeled (50% related to traffic incidents)
10/08/13 Heiko Paulheim 26
Application: Classifying Tweets
• Learning to classify tweets:
– Positive and negative examples
– Features:
• Stemming
• POS tagging
• Word n-grams
• …
• Accuracy ~90%
• But
– Accuracy drops to ~85% when applying the model to a different city
10/08/13 Heiko Paulheim 27
Application: Classifying Tweets
• Example set:
– “Again crash on I90”
– “Accident on I90”
• Model:
– “I90” → indicates traffic accident
• Applying the model:
– “Two cars crashed on I51” → not related to traffic accident
10/08/13 Heiko Paulheim 28
Using LOD for Preventing Overfitting
• Example set:
– “Again crash on I90”
– “Accident on I90”
dbpedia:Interstate_90
dbpedia-owl:Road
rdf:type
dbpedia:Interstate_51
rdf:type
• Model:
– dbpedia-owl:Road → indicates traffic accident
• Applying the model:
– “Two cars crashed on I51” → indicates traffic accident
• Using DBpedia Spotlight + FeGeLOD
– Accuracy keeps up at 90%
– Overfitting is avoided
10/08/13 Heiko Paulheim 29
Explaining Statistics
• Statistics are very wide spread
– Quality of living in cities
– Corruption by country
– Fertility rate by country
– Suicide rate by country
– Box office revenue of films
– ...
10/08/13 Heiko Paulheim 30
Explaining Statistics
• Questions we are often interested in
– Why does city X have a high/low quality of living?
– Why is the corruption higher in country A than in country B?
– Will a new film create a high/low box office revenue?
• i.e., we are looking for
– explanations
– forecasts (e.g., extrapolations)
10/08/13 Heiko Paulheim 31
Explaining Statistics
http://xkcd.com/605/
10/08/13 Heiko Paulheim 32
Explaining Statistics
• What statistics often look like
10/08/13 Heiko Paulheim 33
Explaining Statistics
• There are powerful tools for finding correlations etc.
– but many statistics cannot be interpreted directly
– background knowledge is missing
• Approach:
– use Linked Open Data for enriching statistical data (e.g., FeGeLOD)
– run analysis tools for finding explanations
10/08/13 Heiko Paulheim 34
Prototype Tool: Explain-a-LOD
• Loads a statistics file (e.g., CSV)
• Adds background knowledge
• Runs basic analysis (correlation, rule learning)
• Presents explanations
10/08/13 Heiko Paulheim 35
Statistical Data: Examples
• Data Set: Mercer Quality of Living
– Quality of living in 216 cities word wide
– norm: NYC=100 (value range 23-109)
– As of 1999
– http://across.co.nz/qualityofliving.htm
• LOD data sets used in the examples:
– DBpedia
– CIA World Factbook for statistics by country
10/08/13 Heiko Paulheim 36
Statistical Data: Examples
• Examples for low quality cities
– big hot cities (junHighC >= 27 and areaTotalKm >= 334)
– cold cities where no music has ever been recorded
(recordedIn_in = false and janHighC <= 16)
– latitude <= 24 and longitude <= 47
• a very accurate rule
• but what's the interpretation? Next Record Studio
2547 miles
Next Record Studio
2547 miles
10/08/13 Heiko Paulheim 37
Statistical Data: Examples
10/08/13 Heiko Paulheim 38
Statistical Data: Examples
• Data Set: Transparency International
– 177 Countries and a corruption perception indicator
(between 1 and 10)
– As of 2010
– http://www.transparency.org/cpi2010/results
10/08/13 Heiko Paulheim 39
Statistical Data: Examples
• Example rules for countries with low corruption
– HDI > 78%
• Human Development Index, calculated from
live expectancy, education level, economic performance
– OECD member states
– Foundation place of more than nine organizations
– More than ten mountains
– More than ten companies with their headquarter in that state,
but less than two cargo airlines
10/08/13 Heiko Paulheim 40
Statistical Data: Examples
• Data Set: Burnout rates
– 16 German DAX companies
– Absolute and relative numbers
– As of 2011
– http://de.statista.com/statistik/daten/studie/226959/umfrage/burn-out-
erkrankungen-unter-mitarbeitern-ausgewaehlter-dax-unternehmen/
10/08/13 Heiko Paulheim 41
Datavalues
Type
Unqualifiedrelation(boolean)
Unqualifiedrelation(numeric)
Qualifiedrelation(boolean)
Qualifiedrelation(numeric)
Joint
1
1.5
2
2.5
3
3.5
4
4.5
5
Correlation
Rule Learning
Evaluation of Feature Quality
• Quality of living dataset
10/08/13 Heiko Paulheim 42
Datavalues
Type
Unqualifiedrelation(boolean)
Unqualifiedrelation(numeric)
Qualifiedrelation(boolean)
Qualifiedrelation(numeric)
Joint
1
1.5
2
2.5
3
3.5
4
4.5
5
Correlation
Rule Learning
Evaluation of Feature Quality
• Corruption dataset
10/08/13 Heiko Paulheim 43
Statistical Data: Examples
• Findings for burnout rates
– Positive correlation between turnover and burnout rates
– Car manufacturers are less prone to burnout
– German companies are less prone to burnout than international ones
• Exception: Frankfurt
10/08/13 Heiko Paulheim 44
Statistical Data: Examples
• Data Set: Antidepressives consumption
– In European countries
– Source: OECD
– http://www.oecd-ilibrary.org/social-issues-migration-health/health-at-a-glance-
2011/pharmaceutical-consumption_health_glance-2011-39-en
10/08/13 Heiko Paulheim 45
Statistical Data: Examples
• Findings for antidepressives consumption
– Larger countries have higher consumption
– Low HDI → high consumption
– By geography:
• Nordic countries, countries at the Atlantic: high
• Mediterranean: medium
• Alpine countries: low
– High average age → high consumption
– High birth rates → high consumption
10/08/13 Heiko Paulheim 46
Statistical Data: Examples
• Data Set: Suicide rates
– By country
– OECD states
– As of 2005
– http://www.washingtonpost.com/wp-srv/world/suiciderate.html
10/08/13 Heiko Paulheim 47
Statistical Data: Examples
• Findings for suicide rates
– Democraties have lower suicide rates than other forms of government
– High HDI → low suicide rate
– High population density → high suicide rate
– By geography:
• At the sea → low
• In the mountains → high
– High Gini index → low suicide rate
• High Gini index ↔ unequal distribution of wealth
– High usage of nuclear power → high suicide rates
10/08/13 Heiko Paulheim 48
Statistical Data: Examples
• Data set: sexual activity
– Percentage of people having sex weekly
– By country
– Survey by Durex 2005-2009
– http://chartsbin.com/view/uya
10/08/13 Heiko Paulheim 49
Statistical Data: Examples
• Findings on sexual activity
– By geography:
• High in Europe, low in Asia
• Low in Island states
– By language:
• English speaking: low
• French speaking: high
– Low average age → high activity
– High GDP per capita → low activity
– High unemployment rate → high activity
– High number of ISP providers → low activity
10/08/13 Heiko Paulheim 50
Try it... but be careful!
• Download from
http://www.ke.tu-darmstadt.de/resources/explain-a-lod
• including a demo video, papers, etc.
http://xkcd.com/552/
10/08/13 Heiko Paulheim 51
RapidMiner Linked Open Data Extension
• August 16th
, 2013: FeGeLOD celebrates its 2nd
birthday
• Problems
– still no nice UI
– special configurations are tricky
– difficult to enhance
• Decision
– Reimplementation on RapidMiner platform
– September 13th
, 2013:
Release of RapidMiner Linked Open Data Extension
– Available from RapidMiner marketplace
• http://dws.informatik.uni-mannheim.de/en/research/rapidminer-lod-extension/
10/08/13 Heiko Paulheim 52
RapidMiner Linked Open Data Extension
• Simple wiring of operators
– linkers
– generators
• Combination with powerful RapidMiner operators
10/08/13 Heiko Paulheim 53
RapidMiner Linked Open Data Extension
• Easy SPARQL endpoint definitions
• Support of custom SPARQL statements
10/08/13 Heiko Paulheim 54
Challenges and Future Work
• SPARQL variants
– Some endpoints support special/non-standard SPARQL constructs
– COUNT(...)
– transitive closure
– exploit where applicable
• Implementations without SPARQL
– Freebase
– OpenCyc
10/08/13 Heiko Paulheim 55
Challenges and Future Work
• Linking is still challenging
– URI patterns are not flexible
– Search by label is time consuming
– Services like DBpedia Lookup are scarce
• Limitations of completely unsupervised linking
– e.g., Hurricanes
– how to use headlines/attribute names?
10/08/13 Heiko Paulheim 56
Challenges and Future Work
• Linking as optimization problem
– find candidates for all entities, e.g., by DBpedia lookup
– find a selection of candidates that are most similar to each other
• e.g., all of them are U.S. cities
– some experiments with types and categories
• problem: not complete
– some problems cannot be addressed (e.g.: Hurricanes)
• Alternatives:
– semi supervised linking – user provides some example links
– active learning
10/08/13 Heiko Paulheim 57
Challenges and Future Work
• Exploiting semantics for feature selection
• Given two features:
– f1: type(RoadsInAlaska)
– f2: type(Road)
• and the schema definition Road rdfs:subclassOf RoadsInAlaska
• Exploit that information for feature selection
– e.g., gain(f1) ≈ gain(f2), f1<f2 → remove f1
10/08/13 Heiko Paulheim 58
Challenges and Future Work
• Incompleteness of LOD
– e.g., type information in DBpedia
– may lead to findings such as
• if a city is of type Place, the quality of living is high
– possible remedy: autocomplete on the dataset
(e.g., Paulheim/Bizer 2013)
• Biases in LOD
– e.g., DBpedia has a bias towards western culture
– may lead to findings such as
• if many records have been made in a city, the quality of living is high
10/08/13 Heiko Paulheim 59
Challenges and Future Work
• Features not used for scalability reasons:
– features for single entities
• e.g., “Roman Polanski directorOf X”
– features more than one hop away
• e.g., “Cities with a university which has a computer science department”
– some are covered by YAGO types, e.g., “AustralianBandsFoundedIn1990”
• but subject to YAGO's selection bias
• Approaches are required to use such features
– which respect scalability
– “generate first, filter later” is not the best solution
• e.g., “Cities with at least one of ArtSchoolsInParis”
– on-the-fly filtering may be more suitable
• e.g., sampling
10/08/13 Heiko Paulheim 60
Challenges and Future Work
• Automatically exploit data sources with non-simple structures
EU18931 a Funding .
EU18931 has-grant-value [
has-amount 1300000 .
has-unit-of-measure EUR .
]
• Support geo/temporal features
– e.g., Data Cubes
– e.g., Linked Geo Data
• Construct complex features (in a scalable way!)
– e.g., cinemas per inhabitant
real example from
CORDIS dataset
10/08/13 Heiko Paulheim 61
Wrap-up
• Linked Data is useful as background knowledge
– especially on problems which have little knowledge in themselves
• Unsupervised methods
– avoid biases and work without knowledge about LOD
– but: scalability and generality problems
• RapidMiner LOD extension
– a constantly growing toolkit
10/08/13 Heiko Paulheim 62
Credits & Thanks
• Past contributors of FeGeLOD:
– Johannes Fürnkranz
– Raad Bahmani
– Alexander Gabriel
– Simon Holthausen
• Current team of RapidMiner Linked Open Data Extension:
– Chris Bizer
– Petar Ristoski
– Evgeny Mitichkin
10/08/13 Heiko Paulheim 63
Exploiting Linked Open Data
as Background Knowledge in Data Mining
Heiko Paulheim, University of Mannheim

Weitere ähnliche Inhalte

Andere mochten auch

Proposal for a quality framework for the evaluation of administrative and sur...
Proposal for a quality framework for the evaluation of administrative and sur...Proposal for a quality framework for the evaluation of administrative and sur...
Proposal for a quality framework for the evaluation of administrative and sur...Piet J.H. Daas
 
Quality management system procedures
Quality management system proceduresQuality management system procedures
Quality management system proceduresselinasimpson2101
 
Quality framework
Quality frameworkQuality framework
Quality frameworksaurabhshri
 
WebeX Presentation - Quality Consortium
WebeX Presentation - Quality ConsortiumWebeX Presentation - Quality Consortium
WebeX Presentation - Quality ConsortiumThe Avoca Group
 
Sharepoint quality management system
Sharepoint quality management systemSharepoint quality management system
Sharepoint quality management systemselinasimpson2101
 
Process asset library as process improvement and knowledge sharing tool
Process asset library as process improvement and knowledge sharing toolProcess asset library as process improvement and knowledge sharing tool
Process asset library as process improvement and knowledge sharing toolKobi Vider
 
2004 E2M - The ShopView Story Information Package.PDF
2004 E2M - The ShopView Story Information Package.PDF2004 E2M - The ShopView Story Information Package.PDF
2004 E2M - The ShopView Story Information Package.PDFMelissa Jones
 
QMS SharePoint Structure Definition Document
QMS SharePoint Structure Definition DocumentQMS SharePoint Structure Definition Document
QMS SharePoint Structure Definition DocumentMelissa Jones
 
Part 3 - SharePoint QMS Anyone Can Make - Data Dictionary
Part 3 - SharePoint QMS Anyone Can Make - Data DictionaryPart 3 - SharePoint QMS Anyone Can Make - Data Dictionary
Part 3 - SharePoint QMS Anyone Can Make - Data DictionaryMelissa Jones
 
QMS SharePoint Wireframe - download and edit for you use
QMS SharePoint Wireframe - download and edit for you useQMS SharePoint Wireframe - download and edit for you use
QMS SharePoint Wireframe - download and edit for you useMelissa Jones
 
Quality framework 1
Quality framework 1Quality framework 1
Quality framework 1Shwetha Bhat
 
Metadata Quality Assurance Framework at QQML2016 conference - full version
Metadata Quality Assurance Framework at QQML2016 conference - full versionMetadata Quality Assurance Framework at QQML2016 conference - full version
Metadata Quality Assurance Framework at QQML2016 conference - full versionPéter Király
 
Quality measurement - How to measure the quality of any object?
Quality measurement - How to measure the quality of any object?Quality measurement - How to measure the quality of any object?
Quality measurement - How to measure the quality of any object?Grzegorz Grela
 
15 Months to Certification: Using SharePoint as the Platform for an ISO 9001 ...
15 Months to Certification: Using SharePoint as the Platform for an ISO 9001 ...15 Months to Certification: Using SharePoint as the Platform for an ISO 9001 ...
15 Months to Certification: Using SharePoint as the Platform for an ISO 9001 ...Barry Peters
 
QMS Calibration Powerpoint
QMS Calibration PowerpointQMS Calibration Powerpoint
QMS Calibration PowerpointDennis J Morgan
 
PAS: The Planning Quality Framework
PAS: The Planning Quality FrameworkPAS: The Planning Quality Framework
PAS: The Planning Quality FrameworkPAS_Team
 
Solr 6.0 Graph Query Overview
Solr 6.0 Graph Query OverviewSolr 6.0 Graph Query Overview
Solr 6.0 Graph Query OverviewKevin Watters
 

Andere mochten auch (20)

Proposal for a quality framework for the evaluation of administrative and sur...
Proposal for a quality framework for the evaluation of administrative and sur...Proposal for a quality framework for the evaluation of administrative and sur...
Proposal for a quality framework for the evaluation of administrative and sur...
 
Quality management system procedures
Quality management system proceduresQuality management system procedures
Quality management system procedures
 
Quality framework
Quality frameworkQuality framework
Quality framework
 
Bpo risk management 2013
Bpo risk management 2013Bpo risk management 2013
Bpo risk management 2013
 
WebeX Presentation - Quality Consortium
WebeX Presentation - Quality ConsortiumWebeX Presentation - Quality Consortium
WebeX Presentation - Quality Consortium
 
Sharepoint quality management system
Sharepoint quality management systemSharepoint quality management system
Sharepoint quality management system
 
Mixed Methods Research
Mixed Methods ResearchMixed Methods Research
Mixed Methods Research
 
Process asset library as process improvement and knowledge sharing tool
Process asset library as process improvement and knowledge sharing toolProcess asset library as process improvement and knowledge sharing tool
Process asset library as process improvement and knowledge sharing tool
 
2004 E2M - The ShopView Story Information Package.PDF
2004 E2M - The ShopView Story Information Package.PDF2004 E2M - The ShopView Story Information Package.PDF
2004 E2M - The ShopView Story Information Package.PDF
 
QMS SharePoint Structure Definition Document
QMS SharePoint Structure Definition DocumentQMS SharePoint Structure Definition Document
QMS SharePoint Structure Definition Document
 
Part 3 - SharePoint QMS Anyone Can Make - Data Dictionary
Part 3 - SharePoint QMS Anyone Can Make - Data DictionaryPart 3 - SharePoint QMS Anyone Can Make - Data Dictionary
Part 3 - SharePoint QMS Anyone Can Make - Data Dictionary
 
QMS SharePoint Wireframe - download and edit for you use
QMS SharePoint Wireframe - download and edit for you useQMS SharePoint Wireframe - download and edit for you use
QMS SharePoint Wireframe - download and edit for you use
 
Quality framework 1
Quality framework 1Quality framework 1
Quality framework 1
 
Metadata Quality Assurance Framework at QQML2016 conference - full version
Metadata Quality Assurance Framework at QQML2016 conference - full versionMetadata Quality Assurance Framework at QQML2016 conference - full version
Metadata Quality Assurance Framework at QQML2016 conference - full version
 
Quality measurement - How to measure the quality of any object?
Quality measurement - How to measure the quality of any object?Quality measurement - How to measure the quality of any object?
Quality measurement - How to measure the quality of any object?
 
Audit Quality Framework & Proportionate Application of ISAs
Audit Quality Framework & Proportionate Application of ISAsAudit Quality Framework & Proportionate Application of ISAs
Audit Quality Framework & Proportionate Application of ISAs
 
15 Months to Certification: Using SharePoint as the Platform for an ISO 9001 ...
15 Months to Certification: Using SharePoint as the Platform for an ISO 9001 ...15 Months to Certification: Using SharePoint as the Platform for an ISO 9001 ...
15 Months to Certification: Using SharePoint as the Platform for an ISO 9001 ...
 
QMS Calibration Powerpoint
QMS Calibration PowerpointQMS Calibration Powerpoint
QMS Calibration Powerpoint
 
PAS: The Planning Quality Framework
PAS: The Planning Quality FrameworkPAS: The Planning Quality Framework
PAS: The Planning Quality Framework
 
Solr 6.0 Graph Query Overview
Solr 6.0 Graph Query OverviewSolr 6.0 Graph Query Overview
Solr 6.0 Graph Query Overview
 

Ähnlich wie Exploiting Linked Open Data as Background Knowledge in Data Mining

Understanding the Cuban Blogosphere: Retrospective and Perspectives based on ...
Understanding the Cuban Blogosphere: Retrospective and Perspectives based on ...Understanding the Cuban Blogosphere: Retrospective and Perspectives based on ...
Understanding the Cuban Blogosphere: Retrospective and Perspectives based on ...Dagmar Monett
 
Creation of custom KOS-based recommendation systems
Creation of custom KOS-based recommendation systemsCreation of custom KOS-based recommendation systems
Creation of custom KOS-based recommendation systemsGESIS
 
Linked Open Data enhanced Knowledge Discovery
Linked Open Data enhanced  Knowledge DiscoveryLinked Open Data enhanced  Knowledge Discovery
Linked Open Data enhanced Knowledge DiscoveryHeiko Paulheim
 
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...Heiko Paulheim
 
Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Heiko Paulheim
 
Open Learning Analytics LSAC2018
Open Learning Analytics LSAC2018Open Learning Analytics LSAC2018
Open Learning Analytics LSAC2018Ian Dolphin
 
Introduction MA Data, Culture and Society | University of Westminster, UK
Introduction MA Data, Culture and Society | University of Westminster, UKIntroduction MA Data, Culture and Society | University of Westminster, UK
Introduction MA Data, Culture and Society | University of Westminster, UKslejay
 
Machine Learning with and for Semantic Web Knowledge Graphs
Machine Learning with and for Semantic Web Knowledge GraphsMachine Learning with and for Semantic Web Knowledge Graphs
Machine Learning with and for Semantic Web Knowledge GraphsHeiko Paulheim
 
Open Data and Data Journalism
Open Data and Data JournalismOpen Data and Data Journalism
Open Data and Data JournalismIrina Radchenko
 
Research Data management - Importance, Good Practices, Guidance
Research Data management - Importance, Good Practices, GuidanceResearch Data management - Importance, Good Practices, Guidance
Research Data management - Importance, Good Practices, GuidanceFrank Uiterwaal
 
Strata Big data presentation
Strata Big data presentationStrata Big data presentation
Strata Big data presentationPiet J.H. Daas
 
Weakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterWeakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterHeiko Paulheim
 
Big data as a source for official statistics
Big data as a source for official statisticsBig data as a source for official statistics
Big data as a source for official statisticsEdwin de Jonge
 
How to read a million books?
How to read a million books?How to read a million books?
How to read a million books?cneudecker
 
Professional Information Research
Professional Information ResearchProfessional Information Research
Professional Information ResearchEric Kokke
 
Data Science: Harnessing Open Data for High Impact Solutions
Data Science: Harnessing Open Data for High Impact SolutionsData Science: Harnessing Open Data for High Impact Solutions
Data Science: Harnessing Open Data for High Impact SolutionsMohd Izhar Firdaus Ismail
 
ACS CINF Luncheon talk (Boston 2018)
ACS CINF Luncheon talk (Boston 2018)ACS CINF Luncheon talk (Boston 2018)
ACS CINF Luncheon talk (Boston 2018)Alex Clark
 

Ähnlich wie Exploiting Linked Open Data as Background Knowledge in Data Mining (20)

Towards Topics-based, Semantics-assisted News Search | WIMS13
Towards Topics-based, Semantics-assisted News Search | WIMS13Towards Topics-based, Semantics-assisted News Search | WIMS13
Towards Topics-based, Semantics-assisted News Search | WIMS13
 
Understanding the Cuban Blogosphere: Retrospective and Perspectives based on ...
Understanding the Cuban Blogosphere: Retrospective and Perspectives based on ...Understanding the Cuban Blogosphere: Retrospective and Perspectives based on ...
Understanding the Cuban Blogosphere: Retrospective and Perspectives based on ...
 
Creation of custom KOS-based recommendation systems
Creation of custom KOS-based recommendation systemsCreation of custom KOS-based recommendation systems
Creation of custom KOS-based recommendation systems
 
Linked Open Data enhanced Knowledge Discovery
Linked Open Data enhanced  Knowledge DiscoveryLinked Open Data enhanced  Knowledge Discovery
Linked Open Data enhanced Knowledge Discovery
 
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
 
Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Make Embeddings Semantic Again!
Make Embeddings Semantic Again!
 
Open Learning Analytics LSAC2018
Open Learning Analytics LSAC2018Open Learning Analytics LSAC2018
Open Learning Analytics LSAC2018
 
Introduction MA Data, Culture and Society | University of Westminster, UK
Introduction MA Data, Culture and Society | University of Westminster, UKIntroduction MA Data, Culture and Society | University of Westminster, UK
Introduction MA Data, Culture and Society | University of Westminster, UK
 
Machine Learning with and for Semantic Web Knowledge Graphs
Machine Learning with and for Semantic Web Knowledge GraphsMachine Learning with and for Semantic Web Knowledge Graphs
Machine Learning with and for Semantic Web Knowledge Graphs
 
Datainnovation
DatainnovationDatainnovation
Datainnovation
 
Open Data and Data Journalism
Open Data and Data JournalismOpen Data and Data Journalism
Open Data and Data Journalism
 
Research Data management - Importance, Good Practices, Guidance
Research Data management - Importance, Good Practices, GuidanceResearch Data management - Importance, Good Practices, Guidance
Research Data management - Importance, Good Practices, Guidance
 
Strata Big data presentation
Strata Big data presentationStrata Big data presentation
Strata Big data presentation
 
Weakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterWeakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on Twitter
 
Spark
SparkSpark
Spark
 
Big data as a source for official statistics
Big data as a source for official statisticsBig data as a source for official statistics
Big data as a source for official statistics
 
How to read a million books?
How to read a million books?How to read a million books?
How to read a million books?
 
Professional Information Research
Professional Information ResearchProfessional Information Research
Professional Information Research
 
Data Science: Harnessing Open Data for High Impact Solutions
Data Science: Harnessing Open Data for High Impact SolutionsData Science: Harnessing Open Data for High Impact Solutions
Data Science: Harnessing Open Data for High Impact Solutions
 
ACS CINF Luncheon talk (Boston 2018)
ACS CINF Luncheon talk (Boston 2018)ACS CINF Luncheon talk (Boston 2018)
ACS CINF Luncheon talk (Boston 2018)
 

Mehr von Heiko Paulheim

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...Heiko Paulheim
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfHeiko Paulheim
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vecHeiko Paulheim
 
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI SystemsKnowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI SystemsHeiko Paulheim
 
From Wikis to Knowledge Graphs
From Wikis to Knowledge GraphsFrom Wikis to Knowledge Graphs
From Wikis to Knowledge GraphsHeiko Paulheim
 
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids  on the Knowledge Graph BlockBeyond DBpedia and YAGO – The New Kids  on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph BlockHeiko Paulheim
 
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge GraphFrom Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge GraphHeiko Paulheim
 
Towards Knowledge Graph Profiling
Towards Knowledge Graph ProfilingTowards Knowledge Graph Profiling
Towards Knowledge Graph ProfilingHeiko Paulheim
 
Knowledge Graphs on the Web
Knowledge Graphs on the WebKnowledge Graphs on the Web
Knowledge Graphs on the WebHeiko Paulheim
 
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and OntologyData-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and OntologyHeiko Paulheim
 
Fast Approximate A-box Consistency Checking using Machine Learning
Fast Approximate  A-box Consistency Checking using Machine LearningFast Approximate  A-box Consistency Checking using Machine Learning
Fast Approximate A-box Consistency Checking using Machine LearningHeiko Paulheim
 
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopServing DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopHeiko Paulheim
 
Combining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionCombining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionHeiko Paulheim
 
Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesHeiko Paulheim
 
What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataHeiko Paulheim
 
Detecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpediaDetecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpediaHeiko Paulheim
 
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionIdentifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionHeiko Paulheim
 
Type Inference on Noisy RDF Data
Type Inference on Noisy RDF DataType Inference on Noisy RDF Data
Type Inference on Noisy RDF DataHeiko Paulheim
 

Mehr von Heiko Paulheim (19)

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdf
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vec
 
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI SystemsKnowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
 
From Wikis to Knowledge Graphs
From Wikis to Knowledge GraphsFrom Wikis to Knowledge Graphs
From Wikis to Knowledge Graphs
 
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids  on the Knowledge Graph BlockBeyond DBpedia and YAGO – The New Kids  on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
 
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge GraphFrom Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
 
How much is a Triple?
How much is a Triple?How much is a Triple?
How much is a Triple?
 
Towards Knowledge Graph Profiling
Towards Knowledge Graph ProfilingTowards Knowledge Graph Profiling
Towards Knowledge Graph Profiling
 
Knowledge Graphs on the Web
Knowledge Graphs on the WebKnowledge Graphs on the Web
Knowledge Graphs on the Web
 
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and OntologyData-driven Joint Debugging of the DBpedia Mappings and Ontology
Data-driven Joint Debugging of the DBpedia Mappings and Ontology
 
Fast Approximate A-box Consistency Checking using Machine Learning
Fast Approximate  A-box Consistency Checking using Machine LearningFast Approximate  A-box Consistency Checking using Machine Learning
Fast Approximate A-box Consistency Checking using Machine Learning
 
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on TopServing DBpedia with DOLCE - More Than Just Adding a Cherry on Top
Serving DBpedia with DOLCE - More Than Just Adding a Cherry on Top
 
Combining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly DetectionCombining Ontology Matchers via Anomaly Detection
Combining Ontology Matchers via Anomaly Detection
 
Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia Entities
 
What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open Data
 
Detecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpediaDetecting Incorrect Numerical Data in DBpedia
Detecting Incorrect Numerical Data in DBpedia
 
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier DetectionIdentifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
 
Type Inference on Noisy RDF Data
Type Inference on Noisy RDF DataType Inference on Noisy RDF Data
Type Inference on Noisy RDF Data
 

Kürzlich hochgeladen

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 

Kürzlich hochgeladen (20)

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 

Exploiting Linked Open Data as Background Knowledge in Data Mining

  • 1. 10/08/13 Heiko Paulheim 1 Exploiting Linked Open Data as Background Knowledge in Data Mining Heiko Paulheim, University of Mannheim
  • 2. 10/08/13 Heiko Paulheim 2 Outline • Motivation • The original FeGeLOD framework • Experiments • Applications • The RapidMiner Linked Open Data Extension • Challenges and Future Work
  • 3. 10/08/13 Heiko Paulheim 3 Motivation: An Example Data Mining Task • Analyzing book sales ISBN City Sold 3-2347-3427-1 Darmstadt 124 3-43784-324-2 Mannheim 493 3-145-34587-0 Roßdorf 14 ... ISBN City Population ... Genre Publisher ... Sold 3-2347-3427-1 Darm- stadt 144402 ... Crime Bloody Books ... 124 3-43784-324-2 Mann- heim 291458 … Crime Guns Ltd. … 493 3-145-34587-0 Roß- dorf 12019 ... Travel Up&Away ... 14 ... → Crime novels sell better in larger cities
  • 4. 10/08/13 Heiko Paulheim 4 Motivation • Many data mining problems are solved better – when you have more background knowledge (leaving scalability aside) • Problems: – Tedious work – Selection bias: what to include?
  • 5. 10/08/13 Heiko Paulheim 5 Motivation http://lod-cloud.net/
  • 6. 10/08/13 Heiko Paulheim 6 Motivation • Idea: – reuse background knowledge from Linked Open Data – include it in the data mining process as needed • Two main variants: – develop mining/learning algorithms that run directly on Linked Data – create relational features from Linked Data
  • 7. 10/08/13 Heiko Paulheim 7 Motivation • Develop mining/learning algorithms – e.g., DL Learner – e.g., dedicated Kernel functions • Advantages: – can be quite efficient – no reduction to “flat” table structure – semantics can be respected directly
  • 8. 10/08/13 Heiko Paulheim 8 Motivation • Create relational features – e.g., LiDDM – e.g., AutoSPARQL – e.g., FeGeLOD / RapidMiner Linked Open Data Extension • Advantages: – Easy combination of knowledge from various sources • including relational features in the original data – Arbitrary mining algorithms/tools possible
  • 9. 10/08/13 Heiko Paulheim 9 FeGeLOD – Feature Generation from LOD IS B N 3 -2 3 4 7 -3 4 2 7 -1 C ity D a r m s ta d t # s o ld 1 2 4 N a m e d E n t it y R e c o g n it io n IS B N 3 -2 3 4 7 -3 4 2 7 - 1 C ity D a r m s ta d t # s o ld 1 2 4 C ity _ U R I h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t F e a t u r e G e n e r a t io n IS B N 3 -2 3 4 7 -3 4 2 7 -1 C ity D a r m s ta d t # s o ld 1 2 4 C ity _ U R I h ttp : / / d b p e d ia .o r g / r e s o u r c e / D a r m s ta d t C ity _ U R I_ d b p e d ia -o w l: p o p u la tio n T o ta l 1 4 1 4 7 1 C ity _ U R I_ ... ... F e a t u r e S e le c t io n IS B N 3 -2 3 4 7 -3 4 2 7 - 1 C ity D a r m s ta d t # s o ld 1 2 4 C ity _ U R I h ttp : / / d b p e d ia .o r g / r e s o u r c e/ D a r m s ta d t C ity _ U R I_ d b p e d ia -o w l:p o p u la tio n T o ta l 1 4 1 4 7 1
  • 10. 10/08/13 Heiko Paulheim 10 FeGeLOD – Feature Generation from LOD • Original prototype, based on Weka: – Simple NER (guessing URIs) – Seven generators: • direct types • data properties • unqualified relations (boolean, numeric) • qualified relations (boolean, numeric) • individuals (dangerous!) - may be restricted to specific property – Simple feature selection: filtering features • that have only* different values (expect numerical) • that have only* identical values • that are mostly missing* *) 95% or 99%
  • 11. 10/08/13 Heiko Paulheim 11 Experiments • Testing with two* standard machine learning data sets – Zoo: classifying animals – AAUP: predicting income of university employees (regression task) • Question: how much improvement do additional features bring? *) standard ML datasets with speaking labels are scarce!
  • 12. 10/08/13 Heiko Paulheim 12 Experiments: Zoo Dataset
  • 13. 10/08/13 Heiko Paulheim 13 First Results: AAUP
  • 14. 10/08/13 Heiko Paulheim 14 Experiments: Early Insights • Additional features often improve the results • Zoo dataset: – Ripper: 89.11 to 96.04 – SMO: 93.07 to 97.03 – No improvement for Naive Bayes • AAUP dataset (compensation): – M5: 59.88 to 51.28 – SMO: 74.12 to 61.97 – No improvement for linear regression • ...but they may also cause problems – extreme example: 6.54 to 189.90 for linear regression – memory and timeouts due to large datasets
  • 15. 10/08/13 Heiko Paulheim 15 Experiments: Quality of Features • Information gain of features on Zoo dataset
  • 16. 10/08/13 Heiko Paulheim 16 Experiments: Quality of Features • Information gain of features on AAUP dataset (compensation)
  • 17. 10/08/13 Heiko Paulheim 17 Application: Classifying Events from Wikipedia • Event Extraction from Wikipedia • Joint work with Dennis Wegener and Daniel Hienert (GESIS) • Task: event classification (e.g., Politics, Sports, ...) http://www.vizgr.org/historical-events/timeline/
  • 18. 10/08/13 Heiko Paulheim 18 Application: Classifying Events from Wikipedia • Source Material: http://www.vizgr.org/historical-events/timeline/
  • 19. 10/08/13 Heiko Paulheim 19 Application: Classifying Events from Wikipedia • Positive Examples for class politics: – 2011, March 15 - German chancellor Angela Merkel shuts down the seven oldest German nuclear power plants. – 2010, June 3 – Christian Wulff is nominated for President of Germany by Angela Merkel. • Negative Examples for class politics: – 2010, July 7 – Spain defeats Germany 1-0 to win its semi-final and for its first time, along with Netherlands make the 2010 FIFA World Cup Final. – 2012, February 16 – Roman Lob is selected to represent Germany in the Eurovision Song Contest.
  • 20. 10/08/13 Heiko Paulheim 20 Application: Classifying Events from Wikipedia • Positive Examples for class politics: – 2011, March 15 - German chancellor Angela Merkel shuts down the seven oldest German nuclear power plants. – 2010, June 3 – Christian Wulff is nominated for President of Germany by Angela Merkel. • Negative Examples for class politics: – 2010, July 7 – Spain defeats Germany 1-0 to win its semi-final and for its first time, along with Netherlands make the 2010 FIFA World Cup Final. – 2012, February 16 – Roman Lob is selected to represent Germany in the Eurovision Song Contest. • Possible learned model: – "Angela Merkel" → Politics
  • 21. 10/08/13 Heiko Paulheim 21 Application: Classifying Events from Wikipedia • Possibly Learned Model: – "Angela Merkel" → Politics • How can we do better? • Background knowledge from Linked Open Data – 2011, March 15 - German chancellor Angela Merkel [class: Politician] shuts down the seven oldest German nuclear power plants. – 2012, May 13, Elections in North Rhine-Westphalia – Hannelore Kraft [class: Politician] is elected to continue as Minister-President, heading an SPD- Green coalition. • Model learned in that case: – "[class: Politician]" → Politics
  • 22. 10/08/13 Heiko Paulheim 22 Application: Classifying Events from Wikipedia • Model learned in that case: – "[class: Politician]" → Politics • Much more general – Can also classify events with politicians not contained in the training set • Less training examples required – A few events with politicians, athletes, singers, ... are enough
  • 23. 10/08/13 Heiko Paulheim 23 Application: Classifying Events from Wikipedia • Experiments on Wikipedia data – >10 categories – 1,000 labeled examples as training set – Classification accuracy: 80% • Plus: – We have trained a language-independent model! • often, models are like "elect*" → Politics – 22. Mai 2012: Peter Altmaier [class: Politician] wird als Nachfolger von Norbert Röttgen [class: Politician] zum Bundesumweltminister ernannt. – 6 januari 2012: Jonas Sjöstedt [class: Politician] väljs till ny partiledare för Vänsterpartiet efter Lars Ohly [class: Politician].
  • 24. 10/08/13 Heiko Paulheim 24 Application: Classifying Tweets • Joint work with Axel Schulz and Petar Ristoski (SAP Research) • Goal: using Twitter for emergency management fire at #mannheim #universityomg two cars on fire #A5 #accident fire at train station still burning my heart is on fire!!!come on baby light my fire boss should fire that stupid moron
  • 25. 10/08/13 Heiko Paulheim 25 Application: Classifying Tweets • Social media contains data on many incidents – But keyword search is not enough – Detecting small incidents is hard – Manual inspection is too expensive (and slow) • Machine learning could help – Train a model to classify incident/non incident tweets – Apply model for detecting incident related tweets • Training data: – Traffic accidents – ~2,000 tweets containing relevant keywords (“car”, “crash”, etc.), hand labeled (50% related to traffic incidents)
  • 26. 10/08/13 Heiko Paulheim 26 Application: Classifying Tweets • Learning to classify tweets: – Positive and negative examples – Features: • Stemming • POS tagging • Word n-grams • … • Accuracy ~90% • But – Accuracy drops to ~85% when applying the model to a different city
  • 27. 10/08/13 Heiko Paulheim 27 Application: Classifying Tweets • Example set: – “Again crash on I90” – “Accident on I90” • Model: – “I90” → indicates traffic accident • Applying the model: – “Two cars crashed on I51” → not related to traffic accident
  • 28. 10/08/13 Heiko Paulheim 28 Using LOD for Preventing Overfitting • Example set: – “Again crash on I90” – “Accident on I90” dbpedia:Interstate_90 dbpedia-owl:Road rdf:type dbpedia:Interstate_51 rdf:type • Model: – dbpedia-owl:Road → indicates traffic accident • Applying the model: – “Two cars crashed on I51” → indicates traffic accident • Using DBpedia Spotlight + FeGeLOD – Accuracy keeps up at 90% – Overfitting is avoided
  • 29. 10/08/13 Heiko Paulheim 29 Explaining Statistics • Statistics are very wide spread – Quality of living in cities – Corruption by country – Fertility rate by country – Suicide rate by country – Box office revenue of films – ...
  • 30. 10/08/13 Heiko Paulheim 30 Explaining Statistics • Questions we are often interested in – Why does city X have a high/low quality of living? – Why is the corruption higher in country A than in country B? – Will a new film create a high/low box office revenue? • i.e., we are looking for – explanations – forecasts (e.g., extrapolations)
  • 31. 10/08/13 Heiko Paulheim 31 Explaining Statistics http://xkcd.com/605/
  • 32. 10/08/13 Heiko Paulheim 32 Explaining Statistics • What statistics often look like
  • 33. 10/08/13 Heiko Paulheim 33 Explaining Statistics • There are powerful tools for finding correlations etc. – but many statistics cannot be interpreted directly – background knowledge is missing • Approach: – use Linked Open Data for enriching statistical data (e.g., FeGeLOD) – run analysis tools for finding explanations
  • 34. 10/08/13 Heiko Paulheim 34 Prototype Tool: Explain-a-LOD • Loads a statistics file (e.g., CSV) • Adds background knowledge • Runs basic analysis (correlation, rule learning) • Presents explanations
  • 35. 10/08/13 Heiko Paulheim 35 Statistical Data: Examples • Data Set: Mercer Quality of Living – Quality of living in 216 cities word wide – norm: NYC=100 (value range 23-109) – As of 1999 – http://across.co.nz/qualityofliving.htm • LOD data sets used in the examples: – DBpedia – CIA World Factbook for statistics by country
  • 36. 10/08/13 Heiko Paulheim 36 Statistical Data: Examples • Examples for low quality cities – big hot cities (junHighC >= 27 and areaTotalKm >= 334) – cold cities where no music has ever been recorded (recordedIn_in = false and janHighC <= 16) – latitude <= 24 and longitude <= 47 • a very accurate rule • but what's the interpretation? Next Record Studio 2547 miles Next Record Studio 2547 miles
  • 37. 10/08/13 Heiko Paulheim 37 Statistical Data: Examples
  • 38. 10/08/13 Heiko Paulheim 38 Statistical Data: Examples • Data Set: Transparency International – 177 Countries and a corruption perception indicator (between 1 and 10) – As of 2010 – http://www.transparency.org/cpi2010/results
  • 39. 10/08/13 Heiko Paulheim 39 Statistical Data: Examples • Example rules for countries with low corruption – HDI > 78% • Human Development Index, calculated from live expectancy, education level, economic performance – OECD member states – Foundation place of more than nine organizations – More than ten mountains – More than ten companies with their headquarter in that state, but less than two cargo airlines
  • 40. 10/08/13 Heiko Paulheim 40 Statistical Data: Examples • Data Set: Burnout rates – 16 German DAX companies – Absolute and relative numbers – As of 2011 – http://de.statista.com/statistik/daten/studie/226959/umfrage/burn-out- erkrankungen-unter-mitarbeitern-ausgewaehlter-dax-unternehmen/
  • 41. 10/08/13 Heiko Paulheim 41 Datavalues Type Unqualifiedrelation(boolean) Unqualifiedrelation(numeric) Qualifiedrelation(boolean) Qualifiedrelation(numeric) Joint 1 1.5 2 2.5 3 3.5 4 4.5 5 Correlation Rule Learning Evaluation of Feature Quality • Quality of living dataset
  • 42. 10/08/13 Heiko Paulheim 42 Datavalues Type Unqualifiedrelation(boolean) Unqualifiedrelation(numeric) Qualifiedrelation(boolean) Qualifiedrelation(numeric) Joint 1 1.5 2 2.5 3 3.5 4 4.5 5 Correlation Rule Learning Evaluation of Feature Quality • Corruption dataset
  • 43. 10/08/13 Heiko Paulheim 43 Statistical Data: Examples • Findings for burnout rates – Positive correlation between turnover and burnout rates – Car manufacturers are less prone to burnout – German companies are less prone to burnout than international ones • Exception: Frankfurt
  • 44. 10/08/13 Heiko Paulheim 44 Statistical Data: Examples • Data Set: Antidepressives consumption – In European countries – Source: OECD – http://www.oecd-ilibrary.org/social-issues-migration-health/health-at-a-glance- 2011/pharmaceutical-consumption_health_glance-2011-39-en
  • 45. 10/08/13 Heiko Paulheim 45 Statistical Data: Examples • Findings for antidepressives consumption – Larger countries have higher consumption – Low HDI → high consumption – By geography: • Nordic countries, countries at the Atlantic: high • Mediterranean: medium • Alpine countries: low – High average age → high consumption – High birth rates → high consumption
  • 46. 10/08/13 Heiko Paulheim 46 Statistical Data: Examples • Data Set: Suicide rates – By country – OECD states – As of 2005 – http://www.washingtonpost.com/wp-srv/world/suiciderate.html
  • 47. 10/08/13 Heiko Paulheim 47 Statistical Data: Examples • Findings for suicide rates – Democraties have lower suicide rates than other forms of government – High HDI → low suicide rate – High population density → high suicide rate – By geography: • At the sea → low • In the mountains → high – High Gini index → low suicide rate • High Gini index ↔ unequal distribution of wealth – High usage of nuclear power → high suicide rates
  • 48. 10/08/13 Heiko Paulheim 48 Statistical Data: Examples • Data set: sexual activity – Percentage of people having sex weekly – By country – Survey by Durex 2005-2009 – http://chartsbin.com/view/uya
  • 49. 10/08/13 Heiko Paulheim 49 Statistical Data: Examples • Findings on sexual activity – By geography: • High in Europe, low in Asia • Low in Island states – By language: • English speaking: low • French speaking: high – Low average age → high activity – High GDP per capita → low activity – High unemployment rate → high activity – High number of ISP providers → low activity
  • 50. 10/08/13 Heiko Paulheim 50 Try it... but be careful! • Download from http://www.ke.tu-darmstadt.de/resources/explain-a-lod • including a demo video, papers, etc. http://xkcd.com/552/
  • 51. 10/08/13 Heiko Paulheim 51 RapidMiner Linked Open Data Extension • August 16th , 2013: FeGeLOD celebrates its 2nd birthday • Problems – still no nice UI – special configurations are tricky – difficult to enhance • Decision – Reimplementation on RapidMiner platform – September 13th , 2013: Release of RapidMiner Linked Open Data Extension – Available from RapidMiner marketplace • http://dws.informatik.uni-mannheim.de/en/research/rapidminer-lod-extension/
  • 52. 10/08/13 Heiko Paulheim 52 RapidMiner Linked Open Data Extension • Simple wiring of operators – linkers – generators • Combination with powerful RapidMiner operators
  • 53. 10/08/13 Heiko Paulheim 53 RapidMiner Linked Open Data Extension • Easy SPARQL endpoint definitions • Support of custom SPARQL statements
  • 54. 10/08/13 Heiko Paulheim 54 Challenges and Future Work • SPARQL variants – Some endpoints support special/non-standard SPARQL constructs – COUNT(...) – transitive closure – exploit where applicable • Implementations without SPARQL – Freebase – OpenCyc
  • 55. 10/08/13 Heiko Paulheim 55 Challenges and Future Work • Linking is still challenging – URI patterns are not flexible – Search by label is time consuming – Services like DBpedia Lookup are scarce • Limitations of completely unsupervised linking – e.g., Hurricanes – how to use headlines/attribute names?
  • 56. 10/08/13 Heiko Paulheim 56 Challenges and Future Work • Linking as optimization problem – find candidates for all entities, e.g., by DBpedia lookup – find a selection of candidates that are most similar to each other • e.g., all of them are U.S. cities – some experiments with types and categories • problem: not complete – some problems cannot be addressed (e.g.: Hurricanes) • Alternatives: – semi supervised linking – user provides some example links – active learning
  • 57. 10/08/13 Heiko Paulheim 57 Challenges and Future Work • Exploiting semantics for feature selection • Given two features: – f1: type(RoadsInAlaska) – f2: type(Road) • and the schema definition Road rdfs:subclassOf RoadsInAlaska • Exploit that information for feature selection – e.g., gain(f1) ≈ gain(f2), f1<f2 → remove f1
  • 58. 10/08/13 Heiko Paulheim 58 Challenges and Future Work • Incompleteness of LOD – e.g., type information in DBpedia – may lead to findings such as • if a city is of type Place, the quality of living is high – possible remedy: autocomplete on the dataset (e.g., Paulheim/Bizer 2013) • Biases in LOD – e.g., DBpedia has a bias towards western culture – may lead to findings such as • if many records have been made in a city, the quality of living is high
  • 59. 10/08/13 Heiko Paulheim 59 Challenges and Future Work • Features not used for scalability reasons: – features for single entities • e.g., “Roman Polanski directorOf X” – features more than one hop away • e.g., “Cities with a university which has a computer science department” – some are covered by YAGO types, e.g., “AustralianBandsFoundedIn1990” • but subject to YAGO's selection bias • Approaches are required to use such features – which respect scalability – “generate first, filter later” is not the best solution • e.g., “Cities with at least one of ArtSchoolsInParis” – on-the-fly filtering may be more suitable • e.g., sampling
  • 60. 10/08/13 Heiko Paulheim 60 Challenges and Future Work • Automatically exploit data sources with non-simple structures EU18931 a Funding . EU18931 has-grant-value [ has-amount 1300000 . has-unit-of-measure EUR . ] • Support geo/temporal features – e.g., Data Cubes – e.g., Linked Geo Data • Construct complex features (in a scalable way!) – e.g., cinemas per inhabitant real example from CORDIS dataset
  • 61. 10/08/13 Heiko Paulheim 61 Wrap-up • Linked Data is useful as background knowledge – especially on problems which have little knowledge in themselves • Unsupervised methods – avoid biases and work without knowledge about LOD – but: scalability and generality problems • RapidMiner LOD extension – a constantly growing toolkit
  • 62. 10/08/13 Heiko Paulheim 62 Credits & Thanks • Past contributors of FeGeLOD: – Johannes Fürnkranz – Raad Bahmani – Alexander Gabriel – Simon Holthausen • Current team of RapidMiner Linked Open Data Extension: – Chris Bizer – Petar Ristoski – Evgeny Mitichkin
  • 63. 10/08/13 Heiko Paulheim 63 Exploiting Linked Open Data as Background Knowledge in Data Mining Heiko Paulheim, University of Mannheim