SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Downloaden Sie, um offline zu lesen
06/02/14 Dominik Wienand, Heiko Paulheim 1
Detecting Incorrect Numerical Data
in DBpedia
Dominik Wienand, Heiko Paulheim
06/02/14 Dominik Wienand, Heiko Paulheim 3
Motivation
• Challenge
– Wikipedia is made for humans, not machines
– Input format in Wikipedia is not constrained
• The following are all valid representations of the same height value
(and perfectly understandable by humans)
– 6 ft 6 in, 6ft 6in, 6'6'', 6'6”, 6´6´´, …
– 1.98m, 1,98m, 1m 98, 1m 98cm, 198cm, 198 cm, …
– 6 ft 6 in (198 cm), 6ft 6in (1.98m), 6'6'' (1.98 m), …
– 6 ft 6 in[1], 6 ft 6 in [citation needed], …
– ...
06/02/14 Dominik Wienand, Heiko Paulheim 4
Motivation
• Challenge
– We're (hopefully) slowly stepping out of the research labs
• e.g., applications in Emergency Management, Finance, …
– so we need reliable information
• i.e., DBpedia has to be able to deal with all of those variants
• But
– it is hard to cover each and every case
– if the case is rare, we may not even know it
• Idea
– A posteriori plausibility checking
– Find values that are likely to be wrongly extracted
06/02/14 Dominik Wienand, Heiko Paulheim 5
Idea
• Use outlier detection to find unlikely values
– e.g., extremely large or small values
• Outlier Detection
– “An outlying observation, or outlier, is one that appears to deviate
markedly from other members of the sample in which it occurs.”
(Grubbs,1969)
– Outliers are not necessarily wrong!
06/02/14 Dominik Wienand, Heiko Paulheim 6
Approach
• Basic approach
– use all values of a numerical property (e.g., height) as a population
– find outliers in that population
• Outlier detection approaches used
– Median Absolute Deviation (Dispersion)
– Interquartile Range
– Kernel Density Estimation
– Kernel Density Estimation iterative
• i.e., remove found outliers and repeat
06/02/14 Dominik Wienand, Heiko Paulheim 7
Median Absolute Deviation (MAD)
• MAD is the median deviation from the median of a sample, i.e.
• MAD can be used for outlier detection
– all values that are k*MAD away from the median
are considered to be outliers
– e.g., k=3
MAD:=mediani |X i−medianj( X j)|
Carl Friedrich Gauss
06/02/14 Dominik Wienand, Heiko Paulheim 10
Approach
• Observation
– some properties are used on a variety of different things
• Example: dbpedia-owl:height
– persons, vehicles
• Example: dbpedia-owl:population
– villages, cities, countries, continents
• Finding outliers in those mixed sets might be hard
– refined approach: preprocess data
– divide into subpopulations
06/02/14 Dominik Wienand, Heiko Paulheim 11
Approach
• Preprocessing A: single type
– split by single type
– one data population per type (in the DBpedia ontology)
– only the most specific type is used
• Preprocessing B: cluster by type vectors
– each instance represented by vector of types
– cluster instances with similar type vectors
– EM algorithm (Weka)
06/02/14 Dominik Wienand, Heiko Paulheim 12
Evaluation
• Two-fold evaluation
– Pre study on three attributes (height, population, elevation)
– Most promising approaches tested on random sample from DBpedia
• Evaluation strategy
– a posteriori evaluation
– each identified outlier is checked
– bulk checking possible due to obvious clusters/patterns in outliers
06/02/14 Dominik Wienand, Heiko Paulheim 13
Evaluation: Pre-Study
• Sample sizes:
– height: 52,522
– population: 237,700
– elevation: 206,977
• Different distributions
– e.g., height: approximate normal distribution
– e.g., population: power law distribution
06/02/14 Dominik Wienand, Heiko Paulheim 14
Evaluation: Pre-Study
• Grouping and clustering improves the precision
• IQR and KDE iterative are best
06/02/14 Dominik Wienand, Heiko Paulheim 15
Evaluation: Random Sample
• Findings from Pre-Study:
– IQR and KDE iterative work well
– clustering is too slow (>24h)
– KDE FFT has poor precision
• Building a random sample
– select 50 random resources
– get all their datatype properties
– retrieve all triples that use those properties
– remove those properties that have <50% or <100 numbers as objects
• Resulting sample:
– 12,054,727 triples
06/02/14 Dominik Wienand, Heiko Paulheim 16
Evaluation: Random Sample
• Tried best performing configuration in pre-study
– further explored parameter settings from there
– IQR provides best trade-off between runtime and quality
• Best configuration for IQR:
– 1,703 values marked as outliers
– manually checked
• shortcuts, e.g., all three digit ZIP codes for places in the US
– Precision of outlier detection: 81%
• i.e., we identify roughly 1 wrong value in 1,000
06/02/14 Dominik Wienand, Heiko Paulheim 17
Systematic Errors Found
• The footprint of an error in the extraction framework code
– here: imperial measures after 5' are truncated
→ observation: suspiciously many height values of 1.524m (=5')
06/02/14 Dominik Wienand, Heiko Paulheim 18
Systematic Errors Found
• Imperial conversion is the most severe problem
– causes almost 90% of all outliers in our sample
06/02/14 Dominik Wienand, Heiko Paulheim 19
Systematic Errors Found
• Additional numbers cause problems
• Example: village of Semaphore
– population: 28,322,006
(all of Australia: 23,379,555!)
– a clear outlier among villages
06/02/14 Dominik Wienand, Heiko Paulheim 20
Limitations
• Telling natural outliers from errors
– hard without additional evidence
• e.g., an adult person 58cm high
• e.g., a 7.4m high vehicle
06/02/14 Dominik Wienand, Heiko Paulheim 21
Beyond DBpedia
• So far, this work has been carried out only on DBpedia
• But can be transferred to any LOD dataset
• Particularly useful for crowdsourced/heuristic approaches
– Information extraction from text (e.g., NELL, ReVerb)
– Automatic fact completion
– Datasets heuristically integrated from diverse sources
06/02/14 Dominik Wienand, Heiko Paulheim 22
Ongoing Work
• Handling natural outliers
– cross checking with other sources
– for DBpedia: other language editions
• Preprocessing techniques
– e.g., dynamically building a tree of meaningful subpopulations
• Pinpointing errors
– text pattern induction on outliers found
– e.g., [0-9.,]* ([0-9]{4}) (years in parentheses cause problems)
– could also help identifying natural outliers
06/02/14 Dominik Wienand, Heiko Paulheim 23
Questions?
http://xkcd.com/539/
06/02/14 Dominik Wienand, Heiko Paulheim 24
Detecting Incorrect Numerical Data
in DBpedia
Dominik Wienand, Heiko Paulheim

Weitere ähnliche Inhalte

Was ist angesagt?

MSc DEMM Oct 2013 Finding Research Evidence
MSc DEMM Oct 2013 Finding Research EvidenceMSc DEMM Oct 2013 Finding Research Evidence
MSc DEMM Oct 2013 Finding Research Evidence
EISLibrarian
 

Was ist angesagt? (8)

Open access for business phd students
Open access for business phd studentsOpen access for business phd students
Open access for business phd students
 
MSc DEMM Oct 2013 Finding Research Evidence
MSc DEMM Oct 2013 Finding Research EvidenceMSc DEMM Oct 2013 Finding Research Evidence
MSc DEMM Oct 2013 Finding Research Evidence
 
The undergrads are coming, the undergrads are coming! Managing large-scale st...
The undergrads are coming, the undergrads are coming! Managing large-scale st...The undergrads are coming, the undergrads are coming! Managing large-scale st...
The undergrads are coming, the undergrads are coming! Managing large-scale st...
 
UKSG Conference 2016 Breakout Session - Measuring the research impact of digi...
UKSG Conference 2016 Breakout Session - Measuring the research impact of digi...UKSG Conference 2016 Breakout Session - Measuring the research impact of digi...
UKSG Conference 2016 Breakout Session - Measuring the research impact of digi...
 
Beyond the Pale: grey literature as a method of publication
Beyond the Pale: grey literature as a method of publicationBeyond the Pale: grey literature as a method of publication
Beyond the Pale: grey literature as a method of publication
 
PSY1020 Better than Google (2021)
PSY1020 Better than Google (2021)PSY1020 Better than Google (2021)
PSY1020 Better than Google (2021)
 
Beyond the Pay Wall!: Repositories as sources of supply
Beyond the Pay Wall!: Repositories as sources of supplyBeyond the Pay Wall!: Repositories as sources of supply
Beyond the Pay Wall!: Repositories as sources of supply
 
Library Services & Finding Information for M.Sc (DL) Students
Library Services & Finding Information for M.Sc (DL) StudentsLibrary Services & Finding Information for M.Sc (DL) Students
Library Services & Finding Information for M.Sc (DL) Students
 

Ähnlich wie Detecting Incorrect Numerical Data in DBpedia

rsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningrsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morning
Jeff Heaton
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
XanGwaps
 

Ähnlich wie Detecting Incorrect Numerical Data in DBpedia (20)

lecture1.ppt
lecture1.pptlecture1.ppt
lecture1.ppt
 
Lecture1
Lecture1Lecture1
Lecture1
 
Bim based process mining master thesis presentation
Bim based process mining master thesis presentation Bim based process mining master thesis presentation
Bim based process mining master thesis presentation
 
Daming
DamingDaming
Daming
 
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...
DBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in ...
 
Born print, reborn digital - the Hoppenstedt Data Archive
Born print, reborn digital - the Hoppenstedt Data ArchiveBorn print, reborn digital - the Hoppenstedt Data Archive
Born print, reborn digital - the Hoppenstedt Data Archive
 
data mining
data miningdata mining
data mining
 
Part1
Part1Part1
Part1
 
DBMS
DBMSDBMS
DBMS
 
rsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningrsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morning
 
Systems and Services: Adding Value For Research Data Assets
Systems and Services: Adding Value For Research Data AssetsSystems and Services: Adding Value For Research Data Assets
Systems and Services: Adding Value For Research Data Assets
 
How is research conducted in my field
How is research conducted in my fieldHow is research conducted in my field
How is research conducted in my field
 
Introduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's GuideIntroduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's Guide
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
 
Future se oct15
Future se oct15Future se oct15
Future se oct15
 
Data science workflow v1.1
Data science workflow v1.1Data science workflow v1.1
Data science workflow v1.1
 
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
 
Bda life cycle slideshare
Bda life cycle   slideshareBda life cycle   slideshare
Bda life cycle slideshare
 
H2O intro at Dallas Meetup
H2O intro at Dallas MeetupH2O intro at Dallas Meetup
H2O intro at Dallas Meetup
 
formseminar_module8%20(1).pptx
formseminar_module8%20(1).pptxformseminar_module8%20(1).pptx
formseminar_module8%20(1).pptx
 

Mehr von Heiko Paulheim

Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids  on the Knowledge Graph BlockBeyond DBpedia and YAGO – The New Kids  on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
Heiko Paulheim
 

Mehr von Heiko Paulheim (20)

Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdf
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vec
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vec
 
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI SystemsKnowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
Knowledge Matters! The Role of Knowledge Graphs in Modern AI Systems
 
From Wikis to Knowledge Graphs
From Wikis to Knowledge GraphsFrom Wikis to Knowledge Graphs
From Wikis to Knowledge Graphs
 
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
Using Knowledge Graphs in Data Science - From Symbolic to Latent Representati...
 
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids  on the Knowledge Graph BlockBeyond DBpedia and YAGO – The New Kids  on the Knowledge Graph Block
Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist’s Perspec...
 
Machine Learning & Embeddings for Large Knowledge Graphs
Machine Learning & Embeddings  for Large Knowledge GraphsMachine Learning & Embeddings  for Large Knowledge Graphs
Machine Learning & Embeddings for Large Knowledge Graphs
 
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge GraphFrom Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph
 
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
Big Data, Smart Algorithms, and Market Power - A Computer Scientist's Perspec...
 
Make Embeddings Semantic Again!
Make Embeddings Semantic Again!Make Embeddings Semantic Again!
Make Embeddings Semantic Again!
 
How much is a Triple?
How much is a Triple?How much is a Triple?
How much is a Triple?
 
Machine Learning with and for Semantic Web Knowledge Graphs
Machine Learning with and for Semantic Web Knowledge GraphsMachine Learning with and for Semantic Web Knowledge Graphs
Machine Learning with and for Semantic Web Knowledge Graphs
 
Weakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on TwitterWeakly Supervised Learning for Fake News Detection on Twitter
Weakly Supervised Learning for Fake News Detection on Twitter
 
Towards Knowledge Graph Profiling
Towards Knowledge Graph ProfilingTowards Knowledge Graph Profiling
Towards Knowledge Graph Profiling
 
Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia Entities
 
What the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open DataWhat the Adoption of schema.org Tells about Linked Open Data
What the Adoption of schema.org Tells about Linked Open Data
 
Linked Open Data enhanced Knowledge Discovery
Linked Open Data enhanced  Knowledge DiscoveryLinked Open Data enhanced  Knowledge Discovery
Linked Open Data enhanced Knowledge Discovery
 

Kürzlich hochgeladen

Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 

Kürzlich hochgeladen (20)

Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 

Detecting Incorrect Numerical Data in DBpedia

  • 1. 06/02/14 Dominik Wienand, Heiko Paulheim 1 Detecting Incorrect Numerical Data in DBpedia Dominik Wienand, Heiko Paulheim
  • 2.
  • 3. 06/02/14 Dominik Wienand, Heiko Paulheim 3 Motivation • Challenge – Wikipedia is made for humans, not machines – Input format in Wikipedia is not constrained • The following are all valid representations of the same height value (and perfectly understandable by humans) – 6 ft 6 in, 6ft 6in, 6'6'', 6'6”, 6´6´´, … – 1.98m, 1,98m, 1m 98, 1m 98cm, 198cm, 198 cm, … – 6 ft 6 in (198 cm), 6ft 6in (1.98m), 6'6'' (1.98 m), … – 6 ft 6 in[1], 6 ft 6 in [citation needed], … – ...
  • 4. 06/02/14 Dominik Wienand, Heiko Paulheim 4 Motivation • Challenge – We're (hopefully) slowly stepping out of the research labs • e.g., applications in Emergency Management, Finance, … – so we need reliable information • i.e., DBpedia has to be able to deal with all of those variants • But – it is hard to cover each and every case – if the case is rare, we may not even know it • Idea – A posteriori plausibility checking – Find values that are likely to be wrongly extracted
  • 5. 06/02/14 Dominik Wienand, Heiko Paulheim 5 Idea • Use outlier detection to find unlikely values – e.g., extremely large or small values • Outlier Detection – “An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs.” (Grubbs,1969) – Outliers are not necessarily wrong!
  • 6. 06/02/14 Dominik Wienand, Heiko Paulheim 6 Approach • Basic approach – use all values of a numerical property (e.g., height) as a population – find outliers in that population • Outlier detection approaches used – Median Absolute Deviation (Dispersion) – Interquartile Range – Kernel Density Estimation – Kernel Density Estimation iterative • i.e., remove found outliers and repeat
  • 7. 06/02/14 Dominik Wienand, Heiko Paulheim 7 Median Absolute Deviation (MAD) • MAD is the median deviation from the median of a sample, i.e. • MAD can be used for outlier detection – all values that are k*MAD away from the median are considered to be outliers – e.g., k=3 MAD:=mediani |X i−medianj( X j)| Carl Friedrich Gauss
  • 8.
  • 9.
  • 10. 06/02/14 Dominik Wienand, Heiko Paulheim 10 Approach • Observation – some properties are used on a variety of different things • Example: dbpedia-owl:height – persons, vehicles • Example: dbpedia-owl:population – villages, cities, countries, continents • Finding outliers in those mixed sets might be hard – refined approach: preprocess data – divide into subpopulations
  • 11. 06/02/14 Dominik Wienand, Heiko Paulheim 11 Approach • Preprocessing A: single type – split by single type – one data population per type (in the DBpedia ontology) – only the most specific type is used • Preprocessing B: cluster by type vectors – each instance represented by vector of types – cluster instances with similar type vectors – EM algorithm (Weka)
  • 12. 06/02/14 Dominik Wienand, Heiko Paulheim 12 Evaluation • Two-fold evaluation – Pre study on three attributes (height, population, elevation) – Most promising approaches tested on random sample from DBpedia • Evaluation strategy – a posteriori evaluation – each identified outlier is checked – bulk checking possible due to obvious clusters/patterns in outliers
  • 13. 06/02/14 Dominik Wienand, Heiko Paulheim 13 Evaluation: Pre-Study • Sample sizes: – height: 52,522 – population: 237,700 – elevation: 206,977 • Different distributions – e.g., height: approximate normal distribution – e.g., population: power law distribution
  • 14. 06/02/14 Dominik Wienand, Heiko Paulheim 14 Evaluation: Pre-Study • Grouping and clustering improves the precision • IQR and KDE iterative are best
  • 15. 06/02/14 Dominik Wienand, Heiko Paulheim 15 Evaluation: Random Sample • Findings from Pre-Study: – IQR and KDE iterative work well – clustering is too slow (>24h) – KDE FFT has poor precision • Building a random sample – select 50 random resources – get all their datatype properties – retrieve all triples that use those properties – remove those properties that have <50% or <100 numbers as objects • Resulting sample: – 12,054,727 triples
  • 16. 06/02/14 Dominik Wienand, Heiko Paulheim 16 Evaluation: Random Sample • Tried best performing configuration in pre-study – further explored parameter settings from there – IQR provides best trade-off between runtime and quality • Best configuration for IQR: – 1,703 values marked as outliers – manually checked • shortcuts, e.g., all three digit ZIP codes for places in the US – Precision of outlier detection: 81% • i.e., we identify roughly 1 wrong value in 1,000
  • 17. 06/02/14 Dominik Wienand, Heiko Paulheim 17 Systematic Errors Found • The footprint of an error in the extraction framework code – here: imperial measures after 5' are truncated → observation: suspiciously many height values of 1.524m (=5')
  • 18. 06/02/14 Dominik Wienand, Heiko Paulheim 18 Systematic Errors Found • Imperial conversion is the most severe problem – causes almost 90% of all outliers in our sample
  • 19. 06/02/14 Dominik Wienand, Heiko Paulheim 19 Systematic Errors Found • Additional numbers cause problems • Example: village of Semaphore – population: 28,322,006 (all of Australia: 23,379,555!) – a clear outlier among villages
  • 20. 06/02/14 Dominik Wienand, Heiko Paulheim 20 Limitations • Telling natural outliers from errors – hard without additional evidence • e.g., an adult person 58cm high • e.g., a 7.4m high vehicle
  • 21. 06/02/14 Dominik Wienand, Heiko Paulheim 21 Beyond DBpedia • So far, this work has been carried out only on DBpedia • But can be transferred to any LOD dataset • Particularly useful for crowdsourced/heuristic approaches – Information extraction from text (e.g., NELL, ReVerb) – Automatic fact completion – Datasets heuristically integrated from diverse sources
  • 22. 06/02/14 Dominik Wienand, Heiko Paulheim 22 Ongoing Work • Handling natural outliers – cross checking with other sources – for DBpedia: other language editions • Preprocessing techniques – e.g., dynamically building a tree of meaningful subpopulations • Pinpointing errors – text pattern induction on outliers found – e.g., [0-9.,]* ([0-9]{4}) (years in parentheses cause problems) – could also help identifying natural outliers
  • 23. 06/02/14 Dominik Wienand, Heiko Paulheim 23 Questions? http://xkcd.com/539/
  • 24. 06/02/14 Dominik Wienand, Heiko Paulheim 24 Detecting Incorrect Numerical Data in DBpedia Dominik Wienand, Heiko Paulheim