SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Incomplete and missing data in geoscience databases Towards the OWA relational model ? Stephen Henley Presented at the eSI workshop  The Closed World of Databases Meets the Open World of the Semantic Web,  Edinburgh 12-13 Oct 2006 Resources Computing International Ltd
Not just geoscience ,[object Object]
Geoscience data ,[object Object],[object Object],[object Object],[object Object]
Typical imprecise data Sample SiO2 % Cu ppm #101 53.5 128 #102 49.2 185 #103 66.3 163
Typical imprecise data Sample SiO2 % Cu ppm #101 53.5 128 #102 49.2 185 #103 66.3 163
What is “49.2% SiO2” ? ,[object Object],[object Object],[object Object],[object Object],[object Object]
Each value has its own error distribution Sample SiO2 % Cu ppm #101 ~53.5 ~128 #102 ~49.2 ~185 #103 ~66.3 ~163
What about queries ? ,[object Object],[object Object],[object Object],[object Object],[object Object]
Incomplete data
Incomplete data Hole_ID Total_Depth D_green #301 320.0 250.0 #302 300.0 270.0 #303 200.0 Unknown ?
Incomplete data Hole_ID Total_Depth D_green #301 320.0 250.0 #302 300.0 270.0 #303 200.0 > 200.0
Incomplete data
Incomplete data ,[object Object],[object Object],[object Object],[object Object]
Querying a tuple with D_green value “>200” ,[object Object],[object Object],[object Object],[object Object],[object Object]
Missing data ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Missing data item Sample SiO2 % Cu ppm #101 53.5 128 #102 - 185 #103 66.3 163
Missing data item ,[object Object],[object Object],[object Object]
CWA: Avoiding NULL ,[object Object],[object Object],[object Object]
The default-value ‘solution’ as proposed by Date ,[object Object],[object Object],[object Object],[object Object]
Proposals by Darwen and Pascal ,[object Object]
Decompose into ‘null-free’ relations Sample SiO2% Cu ppm #101 53.5 128 #102 - 185 #103 66.3 163 Sample SiO2 % Sample Cu ppm #101 53.5 #101 128 #103 66.3 #102 185 #103 163
In this way … ,[object Object],[object Object]
The CWA states that … ,[object Object],[object Object],[object Object],[object Object]
… . so ,[object Object],[object Object]
No tuple for sample #102  in the SiO2 relation Sample SiO2 % Sample Cu ppm #101 53.5 #101 128 #103 66.3 #102 185 #103 163
Under the CWA … ,[object Object],[object Object],[object Object],[object Object],Sample SiO2% or Sample SiO2% #102 51.2 #102 45.5
This implies that … ,[object Object],[object Object],[object Object],[object Object]
Codd  was  right ,[object Object],[object Object],[object Object]
So let’s take a look at the truth tables ,[object Object],[object Object],[object Object]
CWA - 2VL T  represents  TRUE F  represents  FALSE
2VL with probabilities T  represents  p=1;  F  represents  p=0 p(A  B) ,  p(A  B)  in general need statistical computation
OWA - 3VL T  represents  TRUE F  represents  FALSE U  represents  UNKNOWN
Conclusions ,[object Object],[object Object],[object Object]
Conclusions ,[object Object],[object Object],[object Object]
Conclusions ,[object Object],[object Object],[object Object]
Some final words from E.F.Codd (1990) ,[object Object],[object Object]

Weitere ähnliche Inhalte

Andere mochten auch

Investing in the Future of Geoscience Research Services
Investing in the Future of Geoscience Research ServicesInvesting in the Future of Geoscience Research Services
Investing in the Future of Geoscience Research Services
Richard Huffine
 
Dp Geosc Info Presentation Final Version 2
Dp Geosc Info Presentation Final Version 2Dp Geosc Info Presentation Final Version 2
Dp Geosc Info Presentation Final Version 2
Smita Chandra
 
satellite image processing
satellite image processingsatellite image processing
satellite image processing
avhadlaxmikant
 
Top 10 geotechnical engineer interview questions and answers
Top 10 geotechnical engineer interview questions and answersTop 10 geotechnical engineer interview questions and answers
Top 10 geotechnical engineer interview questions and answers
boolady2131
 
Geology & geophysics in oil exploration
Geology & geophysics in oil explorationGeology & geophysics in oil exploration
Geology & geophysics in oil exploration
Felipe Andrés
 

Andere mochten auch (14)

Investing in the Future of Geoscience Research Services
Investing in the Future of Geoscience Research ServicesInvesting in the Future of Geoscience Research Services
Investing in the Future of Geoscience Research Services
 
Dp Geosc Info Presentation Final Version 2
Dp Geosc Info Presentation Final Version 2Dp Geosc Info Presentation Final Version 2
Dp Geosc Info Presentation Final Version 2
 
Geoscience Data Analysis and Visualization Tools from NCAR
Geoscience Data Analysis and Visualization Tools from NCARGeoscience Data Analysis and Visualization Tools from NCAR
Geoscience Data Analysis and Visualization Tools from NCAR
 
Bernard Rowe - Global Geoscience
Bernard Rowe - Global GeoscienceBernard Rowe - Global Geoscience
Bernard Rowe - Global Geoscience
 
TECH 1940 Science & Tech? Science vs. Tech?
TECH 1940 Science & Tech?  Science vs. Tech?TECH 1940 Science & Tech?  Science vs. Tech?
TECH 1940 Science & Tech? Science vs. Tech?
 
Teac lesson 5
Teac lesson 5Teac lesson 5
Teac lesson 5
 
Geoscience satellite image processing
Geoscience satellite image processingGeoscience satellite image processing
Geoscience satellite image processing
 
Introduction
IntroductionIntroduction
Introduction
 
satellite image processing
satellite image processingsatellite image processing
satellite image processing
 
Top 10 geotechnical engineer interview questions and answers
Top 10 geotechnical engineer interview questions and answersTop 10 geotechnical engineer interview questions and answers
Top 10 geotechnical engineer interview questions and answers
 
Satellite image processing
Satellite image processingSatellite image processing
Satellite image processing
 
Exploration in the House 2015: Geoscience Information Update 2015 by Trisha M...
Exploration in the House 2015: Geoscience Information Update 2015 by Trisha M...Exploration in the House 2015: Geoscience Information Update 2015 by Trisha M...
Exploration in the House 2015: Geoscience Information Update 2015 by Trisha M...
 
Geology & geophysics in oil exploration
Geology & geophysics in oil explorationGeology & geophysics in oil exploration
Geology & geophysics in oil exploration
 
Petroleum Geology
Petroleum GeologyPetroleum Geology
Petroleum Geology
 

Incomplete And Missing Data In Geoscience Databases

  • 1. Incomplete and missing data in geoscience databases Towards the OWA relational model ? Stephen Henley Presented at the eSI workshop The Closed World of Databases Meets the Open World of the Semantic Web, Edinburgh 12-13 Oct 2006 Resources Computing International Ltd
  • 2.
  • 3.
  • 4. Typical imprecise data Sample SiO2 % Cu ppm #101 53.5 128 #102 49.2 185 #103 66.3 163
  • 5. Typical imprecise data Sample SiO2 % Cu ppm #101 53.5 128 #102 49.2 185 #103 66.3 163
  • 6.
  • 7. Each value has its own error distribution Sample SiO2 % Cu ppm #101 ~53.5 ~128 #102 ~49.2 ~185 #103 ~66.3 ~163
  • 8.
  • 10. Incomplete data Hole_ID Total_Depth D_green #301 320.0 250.0 #302 300.0 270.0 #303 200.0 Unknown ?
  • 11. Incomplete data Hole_ID Total_Depth D_green #301 320.0 250.0 #302 300.0 270.0 #303 200.0 > 200.0
  • 13.
  • 14.
  • 15.
  • 16. Missing data item Sample SiO2 % Cu ppm #101 53.5 128 #102 - 185 #103 66.3 163
  • 17.
  • 18.
  • 19.
  • 20.
  • 21. Decompose into ‘null-free’ relations Sample SiO2% Cu ppm #101 53.5 128 #102 - 185 #103 66.3 163 Sample SiO2 % Sample Cu ppm #101 53.5 #101 128 #103 66.3 #102 185 #103 163
  • 22.
  • 23.
  • 24.
  • 25. No tuple for sample #102 in the SiO2 relation Sample SiO2 % Sample Cu ppm #101 53.5 #101 128 #103 66.3 #102 185 #103 163
  • 26.
  • 27.
  • 28.
  • 29.
  • 30. CWA - 2VL T represents TRUE F represents FALSE
  • 31. 2VL with probabilities T represents p=1; F represents p=0 p(A  B) , p(A  B) in general need statistical computation
  • 32. OWA - 3VL T represents TRUE F represents FALSE U represents UNKNOWN
  • 33.
  • 34.
  • 35.
  • 36.

Hinweis der Redaktion

  1. What I am going to talk about is an old problem. It’s one that we were aware of when I was working with Keith Jeffery on relational systems development 30 years ago. Unfortunately, rather than solving the problem, insistence on the relational model being restricted to the closed world assumption is pushing us backwards
  2. - as are data in many other fields too
  3. Laboratory analyses are typically reported – and recorded in databases – as single numbers with greater or less precision , but usually the accuracy is ignored.
  4. … so this number 49.2 may well be what has been reported by the laboratory but of course it is unlikely to be exactly the ‘true’ value for the silica content of sample 102.
  5. Each data value consists of the reported ‘mean’ or ‘best estimate’ together with a definition of the error distribution about that estimate.
  6. With such data, a simple > test will no longer give absolute true or false results, but computed probability estimates based on the error distributions of the operands.
  7. OK, let’s move on to a different problem. Three drillholes. What is the depth of the ‘top of green’ in drill hole 303 ?
  8. Is it actually unknown ? Not completely. We do know that it hasn’t been intersected to the maximum depth that hole 303 has been drilled. We have some information about it.
  9. So instead of ‘missing data’ or the dreaded NULL we need to put a ‘partial’ data value in there.
  10. Let’s just take another look at the geology before moving on.
  11. So we have the strange situation of a data value which sometimes gives absolute true or false truth values in queries, and other times gives unknown.
  12. Now to the thorny question of completely missing data items.
  13. OK, sample 102 wasn’t analysed for SiO2, by mistake, or oversight, or the instrument malfunctioned. The value is just missing in the purest sense of the word.
  14. Among those who insist that relational means closed-world, there have been some very inventive methods devised to avoid the NULL representation for missing data – and a strong feeling that all data ought really to be complete.
  15. The default-value or special-value solution simply does not hold water. Not only is it more complicated than a global ‘null’ representation, but it also requires applications (or the user) to carry and manipulate a lot of extra garbage. At best it consists of a domain-specific null value which needs the same sort of processing as a global null.
  16. Effectively what we are looking at is normalisation to 6NFwhere all relations are, at most, binary.
  17. So at least we have established that any occurrence of NULL can be converted to a missing tuple. Does this actually get us anywhere ?
  18. This is important. We can discuss later what the “corresponding proposition” is or ought to be, because this may lie at the heart of the question.
  19. Decomposition has eliminated the tuple that we had, with the missing-data placeholder for SiO2 in sample 102.
  20. Here is the nub of it. What is the “corresponding proposition” ? “ Sample N contains X% SiO2” or “Sample N is reported to contain X% SiO2” ? In other words is the database a record of (a) what is or of (b) what is known ? This reasoning works for (a). For (b) it is perfectly valid for a value to be unknown and for a null or some other missing value placeholder to be used, so the problem does not arise
  21. Basically the same but allowing a continuity of values between True and False
  22. Please note that U is a truth value. It should not be confused with NULL or any missing-data placeholder. Indeed, it would be perfectly valid to have a type 3VL domain in which T,F, and U all appear as genuine values, as well as a missing-data placeholder (let’s call it null?)