SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
Moon Landing or Safari?
A Study of Systematic Errors and their Causes in Geographic Linked Data
Krzysztof Janowicz1, Yingjie Hu1, Grant McKenzie2, Song Gao1, Blake Regalia1,
Gengchen Mai1, Rui Zhu1, Benjamin Adams3, and Kerry Taylor4
2016/10/01
1
STKO Lab, University of California, Santa Barbara, USA
2
Department of Geographical Sciences, University of Maryland, USA
3
Centre for eResearch, The University of Auckland, New Zealand
4
Australian National University, Australia
blake.regalia@gmail.com
Linked Data
Linked Data: representing data as collections of intra & inter-linking graphs.
The nodes and edges of the graphs are Internationalized Resource Identifiers
(IRIs). It is built upon the Resource Description Framework (RDF); enabling
Web docs & services to share structured data about anything.
blake.regalia@gmail.com
Linked Data Significance
Linked Data is already in very wide use; it powers many ‘smart’ query services.
It is revolutionizing data publishing and retrieval.
blake.regalia@gmail.com
Linked Data Significance
The Linked Data cloud grows every year, but it suffers from: data quality
issues, limited availability, and lack of data persistence. Data quality and
maintenance are known to be the most difficult issues facing data publishers.
blake.regalia@gmail.com
Geographic Linked Data
Geographic data is one of the primary nexuses for structured data on the
world-wide web.
blake.regalia@gmail.com
Data Scientists
As Geographic Information Scientists, it is our responsibility to:
• assess the quality of structured geo-data on the web
• discover systematic errors
• identify their root causes
• and publish our recommendations for best practices
Our motivation is only to improve data quality, not to criticize others for falling
victim to these errors.
Most of these errors are common. They tend to arise from easily overlooked
qualities of geographic information.
blake.regalia@gmail.com
Errors
We have broken down systematic errors into the following categories:
1. Triplification and Extraction
2. Improper use of ontologies / Limited understanding of domain
3. Designing new ontologies / Oversimplified conceptual models
4. Data accuracy / Lack of ‘uncertainty’ framework
blake.regalia@gmail.com
Triplification
(1) “Triplification” typically refers to the transformation of flat data into RDF.
blake.regalia@gmail.com
Natural Language Processing
(2) Extraction of semantically-rich semi-structured or unstructured data using
natural language processing and machine learning; e.g., DBpedia, FRED1
.
Anakin Skywalker was a male human born on Tatooine who became a Jedi
Knight, and later served the Galactic Empire as Darth Vader.
1
http://wit.istc.cnr.it/stlab-tools/fred/demo blake.regalia@gmail.com
Natural Language Processing
(2) Extraction of semantically-rich semi-structured or unstructured data using
natural language processing and machine learning; e.g., DBpedia, FRED1
.
Anakin Skywalker was a male human born on Tatooine who became a Jedi
Knight, and later served the Galactic Empire as Darth Vader.
1
http://wit.istc.cnr.it/stlab-tools/fred/demo blake.regalia@gmail.com
Natural Language Processing
(2) Extraction of semantically-rich semi-structured or unstructured data using
natural language processing and machine learning; e.g., DBpedia, FRED1
.
Anakin Skywalker was a male human born on Tatooine who became a Jedi
Knight, and later served the Galactic Empire as Darth Vader.
blake.regalia@gmail.com
Triplification Errors
(1) and (2) are both liable to the same types of errors that can occur during
the extraction & conversion of the source data from its original format.
Time to investigate for errors! How does one begin searching for systematic
errors in world-wide geographic data? By using a map!
blake.regalia@gmail.com
World Map Image
(In regards to the previous slide):
No base-map on this image; yet you can clearly recognize this is a map of the
world. The Linked Data cloud has a high spatial coverage!
The large “X” in the center of the map can be blamed on a parsing error. This
can happen when one of a coordinate’s decimal values is reused for the latitude
or longitude; this has the effect of locating a point at (X, X) or (Y, Y).
blake.regalia@gmail.com
World Map Image
(In regards to the previous slide):
Notice the grid-like structure in Russia; we see a reguarly-spaced snapping of
points. Those are the results of decimal truncation; a process that forces a
floating-point value into an integer.
Lastly, see those ghostly images of land masses where there shouldn’t be land
masses? These are reflections of New Zealand and Australia mirrored about the
Equator; we also found evidence of horizontal mirroring as well. Two
explanations for this: (1) Negative signs (or a lack thereof); (2) Improper
parsing of Quadrant identifiers; e.g., Oeste starts with an ‘O’ (Spanish word for
West) but parsing throws this out and longitude gets flipped onto the other
side of the globe.
blake.regalia@gmail.com
Problem Essentials
Lessons Learned:
If triplification software does not account for full range of variations,
unexpected geometries may occur.
Coordinate discrepancy rectangularization2
2
http://dbpedia.org/page/Solar_Star blake.regalia@gmail.com
Ontology Use & Domain Errors
Ontology Fertility
Apparently, the location of the Moon Landing event took place in Algeria. So
what’s the deal? Was it a Moon Landing or a Safari?
dbr:Tranquility Base geo:lat 0.713889; geo:long 23.7078 .
W3C Basic Geo spec declares WGS84 as the coordinate reference system - but this is
not enforced through axiomatization, so there is no consideration for preventing
geo:lat and geo:long fromm being used to represent locations on any celestial body,
not just Earth. The Moon, Mars, Tatooine, etc.
The oversimplification of vocabularies or schemas (for making publication easier)
can lead to the incorrect usage of an ontology.
blake.regalia@gmail.com
Domain Error
Let’s perform a simple, typical, spatial query using Linked Data:
How many people live around the Gulf of Guinea?
Population = 7.6 billion
According to our query results, the Gulf of Guinea has the highest population density
in the world... How can this be? Well, because we didn’t expect planet Earth to be
located in it’s own reference system! Earth has a population value, so it gets counted
in our results as if it were just another populated place.
blake.regalia@gmail.com
Data Quality via Ontology Tradeoff
Lessons Learned:
It is critical for data publishers to fully understand an ontology’s intended uses
when selecting one to construct their Linked Data.
Lifting data is not trivial; it needs to involve both domain experts and
experienced Linked Data developers.
All spatial data should have a CRS, but this imposes another hurdle-to-entry
for data publishers. Too little restriction threatens data quality; too much
deters data publishers.
Discrepancies among data sources and a lack of provenance information is toxic
to researchers who cannot ascertain its reliability.
blake.regalia@gmail.com
Modeling Errors
Modeling Errors
DBpedia shows 1.8k 0-degree persons, 371k 1-degree persons, and 31k
2-degree persons. Higher-degree persons may be from lack of information
about their birth / death place, or may be a fictitious character identified as
type Person. 0-degree persons indicate modeling errors.
blake.regalia@gmail.com
Terry Fox
“Terry Fox” is one of those 0-degree persons, his resource includes spatial
coordinates. But it looks like the person Terry Fox was accidentally matched to
the statue of Terry Fox.
blake.regalia@gmail.com
Terry Fox
Plotting the coordinates on a map reveals a place called “Mt. Terry Fox
Provincial Park”. This clearly demonstrates the consequences of a modeling
error.
blake.regalia@gmail.com
Data Accuracy and an Uncertainty
Framework
Accuracy
There are 136,964 combinations of geometries3
among places with cardinal
direction relations on DBpedia. According to our analysis, by using 8 equal
divisions of the compass rose, nearly 1
3
of these relations are inaccurate.
Using 8 equal divisions ( π
4
) of the compass Nearly 1
3
of all relations are innaccurate
3
Formatted in Well-Known Text: Geographic coordinates blake.regalia@gmail.com
Accuracy
Part of the blame for innacurate cardinal direction relations can be placed on
using point geometries for regions, making the relation true in only a portion of
the cases.
blake.regalia@gmail.com
Uncertainty
Decimal and coordinate values can be misleading; their precision implies
accuracy to the degree of the least significant digit; e.g., the centroid of Santa
Barbara is accurate to 1.1 microns:
POINT(-119.71416473389 34.425834655762)
Also, it has an area of 108.69662101458125 km2, which is accurate to a few
hundred femtometers (10e−13).
Clearly, there is a need for an uncertainty framework when it comes to providing
measurement data.
blake.regalia@gmail.com
Conclusion
Conclusions:
Geographic Information plays a key role in interlinking structured data on the
Web. Improving geo-data quality is pivotal to improving the functionality and
reliability of Linked Data for science, research, applications, etc.
We identified systematic errors in geographic Linked Data, discussed their
causes, and suggested ways to improve its quality and reliability.
Striking the balance between (a) keeping models simple and easy to use so that
they enable streamlined data publishing processes and (b) hazardous
oversimplifications, remains a major challenge to be addressed in future works.
blake.regalia@gmail.com
Questions?
Thank you!
blake.regalia@gmail.com
blake.regalia@gmail.com

Weitere ähnliche Inhalte

Andere mochten auch (7)

Activity two knowledge claims activity ao ks 2014
Activity two knowledge claims activity ao ks 2014Activity two knowledge claims activity ao ks 2014
Activity two knowledge claims activity ao ks 2014
 
Vih word
Vih wordVih word
Vih word
 
Coffee Hour Presentation
Coffee Hour PresentationCoffee Hour Presentation
Coffee Hour Presentation
 
A STUDY ON POLICY - HOLDERS SATISFACTION OF LIFE INSURANCE CORPORATION OF IND...
A STUDY ON POLICY - HOLDERS SATISFACTION OF LIFE INSURANCE CORPORATION OF IND...A STUDY ON POLICY - HOLDERS SATISFACTION OF LIFE INSURANCE CORPORATION OF IND...
A STUDY ON POLICY - HOLDERS SATISFACTION OF LIFE INSURANCE CORPORATION OF IND...
 
informe misionero adultos 20/10/2012
informe misionero adultos 20/10/2012informe misionero adultos 20/10/2012
informe misionero adultos 20/10/2012
 
Produktoversikt 2016
Produktoversikt 2016Produktoversikt 2016
Produktoversikt 2016
 
Bases de datos
Bases de datosBases de datos
Bases de datos
 

Ähnlich wie Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

Arc gis concept
Arc gis conceptArc gis concept
Arc gis concept
Arif Doel
 
Geographic Information Retrieval From Disparate Data Sources
Geographic Information Retrieval From Disparate Data SourcesGeographic Information Retrieval From Disparate Data Sources
Geographic Information Retrieval From Disparate Data Sources
Ian Turton
 

Ähnlich wie Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data (20)

AI Beyond Deep Learning
AI Beyond Deep LearningAI Beyond Deep Learning
AI Beyond Deep Learning
 
Computational Training and Data Literacy for Domain Scientists
Computational Training and Data Literacy for Domain ScientistsComputational Training and Data Literacy for Domain Scientists
Computational Training and Data Literacy for Domain Scientists
 
An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...
An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...
An Open and Shut Case? Shared Standards for Stratigraphic Data and Heritage L...
 
Computational Training for Domain Scientists & Data Literacy
Computational Training for Domain Scientists & Data LiteracyComputational Training for Domain Scientists & Data Literacy
Computational Training for Domain Scientists & Data Literacy
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
 
Data Science at Berkeley
Data Science at BerkeleyData Science at Berkeley
Data Science at Berkeley
 
Joshua Bloom Data Science at Berkeley
Joshua Bloom Data Science at BerkeleyJoshua Bloom Data Science at Berkeley
Joshua Bloom Data Science at Berkeley
 
Topological Relations in Linked Geographic Data
Topological Relations in Linked Geographic DataTopological Relations in Linked Geographic Data
Topological Relations in Linked Geographic Data
 
rworldmap: A New R package for Mapping Global Data
rworldmap: A New R package for Mapping Global Datarworldmap: A New R package for Mapping Global Data
rworldmap: A New R package for Mapping Global Data
 
Arc gis concept
Arc gis conceptArc gis concept
Arc gis concept
 
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACHCOLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
COLOCATION MINING IN UNCERTAIN DATA SETS: A PROBABILISTIC APPROACH
 
DeepLearning_JC_talk
DeepLearning_JC_talkDeepLearning_JC_talk
DeepLearning_JC_talk
 
Poster RDAP13: Provenance of Figures in the Global Change Information System
Poster RDAP13: Provenance of Figures in the Global Change Information SystemPoster RDAP13: Provenance of Figures in the Global Change Information System
Poster RDAP13: Provenance of Figures in the Global Change Information System
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Geographic Information Retrieval From Disparate Data Sources
Geographic Information Retrieval From Disparate Data SourcesGeographic Information Retrieval From Disparate Data Sources
Geographic Information Retrieval From Disparate Data Sources
 
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in R
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in RFinding Meaning in Points, Areas and Surfaces: Spatial Analysis in R
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in R
 
Knowledge Graphs and Milestone
Knowledge Graphs and MilestoneKnowledge Graphs and Milestone
Knowledge Graphs and Milestone
 
CV_myashar_2017
CV_myashar_2017CV_myashar_2017
CV_myashar_2017
 
Geo-Humanities 2017 Keynote at SIGSPATIAL 2017
Geo-Humanities 2017 Keynote at SIGSPATIAL 2017Geo-Humanities 2017 Keynote at SIGSPATIAL 2017
Geo-Humanities 2017 Keynote at SIGSPATIAL 2017
 

Kürzlich hochgeladen

development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 

Kürzlich hochgeladen (20)

development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptx
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Stages in the normal growth curve
Stages in the normal growth curveStages in the normal growth curve
Stages in the normal growth curve
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 

Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data

  • 1. Moon Landing or Safari? A Study of Systematic Errors and their Causes in Geographic Linked Data Krzysztof Janowicz1, Yingjie Hu1, Grant McKenzie2, Song Gao1, Blake Regalia1, Gengchen Mai1, Rui Zhu1, Benjamin Adams3, and Kerry Taylor4 2016/10/01 1 STKO Lab, University of California, Santa Barbara, USA 2 Department of Geographical Sciences, University of Maryland, USA 3 Centre for eResearch, The University of Auckland, New Zealand 4 Australian National University, Australia blake.regalia@gmail.com
  • 2. Linked Data Linked Data: representing data as collections of intra & inter-linking graphs. The nodes and edges of the graphs are Internationalized Resource Identifiers (IRIs). It is built upon the Resource Description Framework (RDF); enabling Web docs & services to share structured data about anything. blake.regalia@gmail.com
  • 3. Linked Data Significance Linked Data is already in very wide use; it powers many ‘smart’ query services. It is revolutionizing data publishing and retrieval. blake.regalia@gmail.com
  • 4. Linked Data Significance The Linked Data cloud grows every year, but it suffers from: data quality issues, limited availability, and lack of data persistence. Data quality and maintenance are known to be the most difficult issues facing data publishers. blake.regalia@gmail.com
  • 5. Geographic Linked Data Geographic data is one of the primary nexuses for structured data on the world-wide web. blake.regalia@gmail.com
  • 6. Data Scientists As Geographic Information Scientists, it is our responsibility to: • assess the quality of structured geo-data on the web • discover systematic errors • identify their root causes • and publish our recommendations for best practices Our motivation is only to improve data quality, not to criticize others for falling victim to these errors. Most of these errors are common. They tend to arise from easily overlooked qualities of geographic information. blake.regalia@gmail.com
  • 7. Errors We have broken down systematic errors into the following categories: 1. Triplification and Extraction 2. Improper use of ontologies / Limited understanding of domain 3. Designing new ontologies / Oversimplified conceptual models 4. Data accuracy / Lack of ‘uncertainty’ framework blake.regalia@gmail.com
  • 8. Triplification (1) “Triplification” typically refers to the transformation of flat data into RDF. blake.regalia@gmail.com
  • 9. Natural Language Processing (2) Extraction of semantically-rich semi-structured or unstructured data using natural language processing and machine learning; e.g., DBpedia, FRED1 . Anakin Skywalker was a male human born on Tatooine who became a Jedi Knight, and later served the Galactic Empire as Darth Vader. 1 http://wit.istc.cnr.it/stlab-tools/fred/demo blake.regalia@gmail.com
  • 10. Natural Language Processing (2) Extraction of semantically-rich semi-structured or unstructured data using natural language processing and machine learning; e.g., DBpedia, FRED1 . Anakin Skywalker was a male human born on Tatooine who became a Jedi Knight, and later served the Galactic Empire as Darth Vader. 1 http://wit.istc.cnr.it/stlab-tools/fred/demo blake.regalia@gmail.com
  • 11. Natural Language Processing (2) Extraction of semantically-rich semi-structured or unstructured data using natural language processing and machine learning; e.g., DBpedia, FRED1 . Anakin Skywalker was a male human born on Tatooine who became a Jedi Knight, and later served the Galactic Empire as Darth Vader. blake.regalia@gmail.com
  • 12. Triplification Errors (1) and (2) are both liable to the same types of errors that can occur during the extraction & conversion of the source data from its original format. Time to investigate for errors! How does one begin searching for systematic errors in world-wide geographic data? By using a map! blake.regalia@gmail.com
  • 13.
  • 14. World Map Image (In regards to the previous slide): No base-map on this image; yet you can clearly recognize this is a map of the world. The Linked Data cloud has a high spatial coverage! The large “X” in the center of the map can be blamed on a parsing error. This can happen when one of a coordinate’s decimal values is reused for the latitude or longitude; this has the effect of locating a point at (X, X) or (Y, Y). blake.regalia@gmail.com
  • 15.
  • 16. World Map Image (In regards to the previous slide): Notice the grid-like structure in Russia; we see a reguarly-spaced snapping of points. Those are the results of decimal truncation; a process that forces a floating-point value into an integer. Lastly, see those ghostly images of land masses where there shouldn’t be land masses? These are reflections of New Zealand and Australia mirrored about the Equator; we also found evidence of horizontal mirroring as well. Two explanations for this: (1) Negative signs (or a lack thereof); (2) Improper parsing of Quadrant identifiers; e.g., Oeste starts with an ‘O’ (Spanish word for West) but parsing throws this out and longitude gets flipped onto the other side of the globe. blake.regalia@gmail.com
  • 17. Problem Essentials Lessons Learned: If triplification software does not account for full range of variations, unexpected geometries may occur. Coordinate discrepancy rectangularization2 2 http://dbpedia.org/page/Solar_Star blake.regalia@gmail.com
  • 18. Ontology Use & Domain Errors
  • 19. Ontology Fertility Apparently, the location of the Moon Landing event took place in Algeria. So what’s the deal? Was it a Moon Landing or a Safari? dbr:Tranquility Base geo:lat 0.713889; geo:long 23.7078 . W3C Basic Geo spec declares WGS84 as the coordinate reference system - but this is not enforced through axiomatization, so there is no consideration for preventing geo:lat and geo:long fromm being used to represent locations on any celestial body, not just Earth. The Moon, Mars, Tatooine, etc. The oversimplification of vocabularies or schemas (for making publication easier) can lead to the incorrect usage of an ontology. blake.regalia@gmail.com
  • 20. Domain Error Let’s perform a simple, typical, spatial query using Linked Data: How many people live around the Gulf of Guinea? Population = 7.6 billion According to our query results, the Gulf of Guinea has the highest population density in the world... How can this be? Well, because we didn’t expect planet Earth to be located in it’s own reference system! Earth has a population value, so it gets counted in our results as if it were just another populated place. blake.regalia@gmail.com
  • 21. Data Quality via Ontology Tradeoff Lessons Learned: It is critical for data publishers to fully understand an ontology’s intended uses when selecting one to construct their Linked Data. Lifting data is not trivial; it needs to involve both domain experts and experienced Linked Data developers. All spatial data should have a CRS, but this imposes another hurdle-to-entry for data publishers. Too little restriction threatens data quality; too much deters data publishers. Discrepancies among data sources and a lack of provenance information is toxic to researchers who cannot ascertain its reliability. blake.regalia@gmail.com
  • 23. Modeling Errors DBpedia shows 1.8k 0-degree persons, 371k 1-degree persons, and 31k 2-degree persons. Higher-degree persons may be from lack of information about their birth / death place, or may be a fictitious character identified as type Person. 0-degree persons indicate modeling errors. blake.regalia@gmail.com
  • 24. Terry Fox “Terry Fox” is one of those 0-degree persons, his resource includes spatial coordinates. But it looks like the person Terry Fox was accidentally matched to the statue of Terry Fox. blake.regalia@gmail.com
  • 25. Terry Fox Plotting the coordinates on a map reveals a place called “Mt. Terry Fox Provincial Park”. This clearly demonstrates the consequences of a modeling error. blake.regalia@gmail.com
  • 26. Data Accuracy and an Uncertainty Framework
  • 27. Accuracy There are 136,964 combinations of geometries3 among places with cardinal direction relations on DBpedia. According to our analysis, by using 8 equal divisions of the compass rose, nearly 1 3 of these relations are inaccurate. Using 8 equal divisions ( π 4 ) of the compass Nearly 1 3 of all relations are innaccurate 3 Formatted in Well-Known Text: Geographic coordinates blake.regalia@gmail.com
  • 28. Accuracy Part of the blame for innacurate cardinal direction relations can be placed on using point geometries for regions, making the relation true in only a portion of the cases. blake.regalia@gmail.com
  • 29. Uncertainty Decimal and coordinate values can be misleading; their precision implies accuracy to the degree of the least significant digit; e.g., the centroid of Santa Barbara is accurate to 1.1 microns: POINT(-119.71416473389 34.425834655762) Also, it has an area of 108.69662101458125 km2, which is accurate to a few hundred femtometers (10e−13). Clearly, there is a need for an uncertainty framework when it comes to providing measurement data. blake.regalia@gmail.com
  • 30. Conclusion Conclusions: Geographic Information plays a key role in interlinking structured data on the Web. Improving geo-data quality is pivotal to improving the functionality and reliability of Linked Data for science, research, applications, etc. We identified systematic errors in geographic Linked Data, discussed their causes, and suggested ways to improve its quality and reliability. Striking the balance between (a) keeping models simple and easy to use so that they enable streamlined data publishing processes and (b) hazardous oversimplifications, remains a major challenge to be addressed in future works. blake.regalia@gmail.com