SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Data Cleaning
and Data
Publishing
Workshop
2013 18-22
February,
Nairobi, Kenya
Javier Otegui
@jotegui
DATA TRANSFORMATION
 Data Transformation – the process of modifying data
with the aim of improving or enabling its fitness for a
certain purpose
 Ideally, no information loss
 Broad term:
 Content transformation
 Format transformation
 Support transformation
 …
 Examples of use:
 Enable sharing of the dataset
 Ease calculations and processing
FUNDAMENTS
 Mandatory, optional or not needed, depending on
scope of use
 Data owned and used locally:
 Analysis-specific transformations
 Limited or local network (lab):
 Analysis-specific transformations
 Data exchange among colleagues
 Publicly shared data:
 Interoperability
 Standards
 Best practices: transform to standards even in local
work
FUNDAMENTS
 Content transformations
 Schema of data storage
 Scale of measurement
 Several levels of difficulty
 Standardization of content
 Format transformations
 File format: tab-delimited, CSV, zip, spreadsheet…
 Nowadays it is fairly straightforward
 Translation between programs easy
 Exchange of information
 Support transformations
 Digitization, key step in general data management process
 Prone to issues
 Enable processing, management, analysis, publishing and sharing of
data
FUNDAMENTS
 Modify the units of the data or the elements that
compose the information
 Final product – same information, standard
compliant
 Standard – DarwinCore (DwC)
 Two specific aims:
 Change elements
 Complete missing elements
 Primary Biodiversity Data (PBD)
 Metadata
CONTENT TRANSFORMATIONS
 Georeferencing of localities
 From verbatim locality description to coordinates
 Currently not needed: GPS technology
 Improve legacy information
 Tools such as geolocate, geomancer…
 Coordinate systems
 Modify units so that they comply with the standard
 DwC for coordinates – Decimal Degree (DD)
 Easy – Degree-Minute-Second to DD
 Hard – UTM to DD
 Special attention to precision
CONTENT TRANSFORMATIONS -
GEOSPATIAL
CONTENT TRANSFORMATIONS -
GEOSPATIAL
45º 20’ – Precision 1’ (~2Km) at best
45.33333 – Precision 0.00001 (2m) too high
45º 21’
45.35 45.33
(0.01, ~1.4Km)
45.3
(0.1, ~14Km)
 Georeferencing of localities
 From verbatim locality description to coordinates
 Current GPS technology makes it easier
 Improve legacy information
 Tools such as geolocate, geomancer…
 Coordinate systems
 Modify units so that they comply with the standard
 DwC for coordinates – Decimal Degree (DD)
 Easy – Degree-Minute-Second to DD
 Hard – UTM to DD
 Special attention to precision
 Improve missing fields
 Use mapping tools and/or gazetteers to complete information
CONTENT TRANSFORMATIONS -
GEOSPATIAL
 Special character encoding
 Special characters in taxonomic names and/or authorships
 Interoperability issues may appear
 Transform these characters to simplified version or enable
different text-encoding
 Higher level taxa completion
 Transformation to broaden the potential uses
 Search in taxonomic databases or literature
CONTENT TRANSFORMATIONS -
TAXONOMIC
 Order of elements
 Different places use naturally different element order
 Example: US, July 26th 2012
 Might become 07-26-2012
 Slight modification with good parser to detect and update this
information to comply with standards
 Date systems
 Standard – DwC recommends ISO 8601
 Different formats:
 1984-09-14, 14th September 1984
 34th week of 2012, 125th day of 2012
 A good parser is needed to understand all possibilities
 Transformations to use common system and avoid ambiguities
CONTENT TRANSFORMATIONS -
TEMPORAL
 Improvement of interoperability – controlled
vocabulary
 Example: basisOfRecord
 Different languages, non-standard acronyms…
 Transform term to standard to improve retrieval of data
 Improvement of collections – metadata becomes
data
 One man’s metadata is another man’s data
 Information common to a collection might be omitted locally
 Must be added when sharing
CONTENT TRANSFORMATIONS -
METADATA
 Modify the storage of the data
 Final product – same information, easily
exchangeable format
 Two key cases:
 Text to spreadsheet and spreadsheet to text
 Text or spreadsheet to database
FILE FORMAT TRANSFORMATIONS
 The most common type of format transformation
 Import text file to spreadsheet or export from
spreadsheet to text file
 Aims
 Importing to spreadsheet – improve data processing
 Exporting to text file – share data and allow others to import
easily
 To be effective:
 No loss of data
 No transformation of content
FILE FORMAT TRANSFORMATIONS – TEXT
TO SPREADSHEET
 From CSV or tab-delimited to spreadsheet
 CSV or tab-delimited depending on the content
 Modern spreadsheets have algorithms to import data in text
files
 Most of the times, we can select the used separator
FILE FORMAT TRANSFORMATIONS – TEXT
TO SPREADSHEET
FILE FORMAT TRANSFORMATIONS – TEXT
TO SPREADSHEET
 From CSV or tab-delimited to spreadsheet
 CSV or tab-delimited depending on the content
 Modern spreadsheets have algorithms to import data in text
files
 Most of the times, we can select the used separator
 Still, this step must be taken carefully:
 More or less fields than should
 Hidden new-line characters
 …
 After importing, check
FILE FORMAT TRANSFORMATIONS – TEXT
TO SPREADSHEET
After import, check
Autofilter comes
handy
“Female” value in
“individualCount”
field??
FILE FORMAT TRANSFORMATIONS – TEXT
TO SPREADSHEET

Weitere ähnliche Inhalte

Ähnlich wie Data Cleaning and Publishing Workshop 2013

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Monitor-and-support-data-conversion 1 1.pptx
Monitor-and-support-data-conversion 1 1.pptxMonitor-and-support-data-conversion 1 1.pptx
Monitor-and-support-data-conversion 1 1.pptxbirhanugirmay559
 
Lecture 05 - The Data Warehouse and Technology
Lecture 05 - The Data Warehouse and TechnologyLecture 05 - The Data Warehouse and Technology
Lecture 05 - The Data Warehouse and Technologyphanleson
 
Looking for SDTM migration specialist
Looking for SDTM migration specialistLooking for SDTM migration specialist
Looking for SDTM migration specialistAngelo Tinazzi
 
Software Re-Engineering in Software Engineering SE28
Software Re-Engineering in Software Engineering SE28Software Re-Engineering in Software Engineering SE28
Software Re-Engineering in Software Engineering SE28koolkampus
 
Reference Data Management
Reference Data ManagementReference Data Management
Reference Data ManagementProfinit
 
Trivadis TechEvent 2017 Migrating to Cloud: Capacity Management Martin Berger
Trivadis TechEvent 2017 Migrating to Cloud: Capacity Management Martin BergerTrivadis TechEvent 2017 Migrating to Cloud: Capacity Management Martin Berger
Trivadis TechEvent 2017 Migrating to Cloud: Capacity Management Martin BergerTrivadis
 
E&P data management: Implementing data standards
E&P data management: Implementing data standardsE&P data management: Implementing data standards
E&P data management: Implementing data standardsETLSolutions
 
Migrer vos bases Oracle vers du SQL, le tout dans Azure !
Migrer vos bases Oracle vers du SQL, le tout dans Azure !Migrer vos bases Oracle vers du SQL, le tout dans Azure !
Migrer vos bases Oracle vers du SQL, le tout dans Azure !Microsoft Technet France
 
Building the DW - ETL
Building the DW - ETLBuilding the DW - ETL
Building the DW - ETLganblues
 
BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)Syaifuddin Ismail
 
Simplifying Cloud Architectures with Data Virtualization
Simplifying Cloud Architectures with Data VirtualizationSimplifying Cloud Architectures with Data Virtualization
Simplifying Cloud Architectures with Data VirtualizationDenodo
 
Scalar unstructured data april 28, 2010
Scalar unstructured data april 28, 2010Scalar unstructured data april 28, 2010
Scalar unstructured data april 28, 2010pwtoday
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?James Serra
 
Kevin De Vorsey Past is Prologue
Kevin De Vorsey Past is PrologueKevin De Vorsey Past is Prologue
Kevin De Vorsey Past is PrologueFuture Perfect 2012
 
SQL, a Master Address Repository and FME
SQL, a Master Address Repository and FMESQL, a Master Address Repository and FME
SQL, a Master Address Repository and FMESafe Software
 
JPJ1402 A Scalable Two-Phase Top-Down Specialization Approach For Data Anon...
JPJ1402   A Scalable Two-Phase Top-Down Specialization Approach For Data Anon...JPJ1402   A Scalable Two-Phase Top-Down Specialization Approach For Data Anon...
JPJ1402 A Scalable Two-Phase Top-Down Specialization Approach For Data Anon...chennaijp
 
Doctrain Life Sciences Handling Dita Topics And Translation In A Regulated ...
Doctrain Life Sciences   Handling Dita Topics And Translation In A Regulated ...Doctrain Life Sciences   Handling Dita Topics And Translation In A Regulated ...
Doctrain Life Sciences Handling Dita Topics And Translation In A Regulated ...Scott Abel
 

Ähnlich wie Data Cleaning and Publishing Workshop 2013 (20)

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Monitor-and-support-data-conversion 1 1.pptx
Monitor-and-support-data-conversion 1 1.pptxMonitor-and-support-data-conversion 1 1.pptx
Monitor-and-support-data-conversion 1 1.pptx
 
Lecture 05 - The Data Warehouse and Technology
Lecture 05 - The Data Warehouse and TechnologyLecture 05 - The Data Warehouse and Technology
Lecture 05 - The Data Warehouse and Technology
 
Looking for SDTM migration specialist
Looking for SDTM migration specialistLooking for SDTM migration specialist
Looking for SDTM migration specialist
 
Software Re-Engineering in Software Engineering SE28
Software Re-Engineering in Software Engineering SE28Software Re-Engineering in Software Engineering SE28
Software Re-Engineering in Software Engineering SE28
 
Reference Data Management
Reference Data ManagementReference Data Management
Reference Data Management
 
Trivadis TechEvent 2017 Migrating to Cloud: Capacity Management Martin Berger
Trivadis TechEvent 2017 Migrating to Cloud: Capacity Management Martin BergerTrivadis TechEvent 2017 Migrating to Cloud: Capacity Management Martin Berger
Trivadis TechEvent 2017 Migrating to Cloud: Capacity Management Martin Berger
 
E&P data management: Implementing data standards
E&P data management: Implementing data standardsE&P data management: Implementing data standards
E&P data management: Implementing data standards
 
Migrer vos bases Oracle vers du SQL, le tout dans Azure !
Migrer vos bases Oracle vers du SQL, le tout dans Azure !Migrer vos bases Oracle vers du SQL, le tout dans Azure !
Migrer vos bases Oracle vers du SQL, le tout dans Azure !
 
Building the DW - ETL
Building the DW - ETLBuilding the DW - ETL
Building the DW - ETL
 
BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)
 
Simplifying Cloud Architectures with Data Virtualization
Simplifying Cloud Architectures with Data VirtualizationSimplifying Cloud Architectures with Data Virtualization
Simplifying Cloud Architectures with Data Virtualization
 
Scalar unstructured data april 28, 2010
Scalar unstructured data april 28, 2010Scalar unstructured data april 28, 2010
Scalar unstructured data april 28, 2010
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Kevin De Vorsey Past is Prologue
Kevin De Vorsey Past is PrologueKevin De Vorsey Past is Prologue
Kevin De Vorsey Past is Prologue
 
SQL, a Master Address Repository and FME
SQL, a Master Address Repository and FMESQL, a Master Address Repository and FME
SQL, a Master Address Repository and FME
 
Data Mesh
Data MeshData Mesh
Data Mesh
 
JPJ1402 A Scalable Two-Phase Top-Down Specialization Approach For Data Anon...
JPJ1402   A Scalable Two-Phase Top-Down Specialization Approach For Data Anon...JPJ1402   A Scalable Two-Phase Top-Down Specialization Approach For Data Anon...
JPJ1402 A Scalable Two-Phase Top-Down Specialization Approach For Data Anon...
 
Database concepts
Database conceptsDatabase concepts
Database concepts
 
Doctrain Life Sciences Handling Dita Topics And Translation In A Regulated ...
Doctrain Life Sciences   Handling Dita Topics And Translation In A Regulated ...Doctrain Life Sciences   Handling Dita Topics And Translation In A Regulated ...
Doctrain Life Sciences Handling Dita Topics And Translation In A Regulated ...
 

Mehr von Javier Otegui

Highlighting Fitness-For-Use of Published Biodiversity Data
Highlighting Fitness-For-Use of Published Biodiversity DataHighlighting Fitness-For-Use of Published Biodiversity Data
Highlighting Fitness-For-Use of Published Biodiversity DataJavier Otegui
 
CLEANING-Error-Flagging-Javier
CLEANING-Error-Flagging-JavierCLEANING-Error-Flagging-Javier
CLEANING-Error-Flagging-JavierJavier Otegui
 
ASSESSMENTS-Taxonomic-Assessments-Javier
ASSESSMENTS-Taxonomic-Assessments-JavierASSESSMENTS-Taxonomic-Assessments-Javier
ASSESSMENTS-Taxonomic-Assessments-JavierJavier Otegui
 
ASSESSMENTS-Primary-Data-Precision-Javier
ASSESSMENTS-Primary-Data-Precision-JavierASSESSMENTS-Primary-Data-Precision-Javier
ASSESSMENTS-Primary-Data-Precision-JavierJavier Otegui
 
Haciendo Ciencia en Abierto / Making Open Science
Haciendo Ciencia en Abierto / Making Open ScienceHaciendo Ciencia en Abierto / Making Open Science
Haciendo Ciencia en Abierto / Making Open ScienceJavier Otegui
 
Linking systems to improve data quality
Linking systems to improve data qualityLinking systems to improve data quality
Linking systems to improve data qualityJavier Otegui
 
Biodibertsitatea... eta niri zer axola?
Biodibertsitatea... eta niri zer axola?Biodibertsitatea... eta niri zer axola?
Biodibertsitatea... eta niri zer axola?Javier Otegui
 

Mehr von Javier Otegui (7)

Highlighting Fitness-For-Use of Published Biodiversity Data
Highlighting Fitness-For-Use of Published Biodiversity DataHighlighting Fitness-For-Use of Published Biodiversity Data
Highlighting Fitness-For-Use of Published Biodiversity Data
 
CLEANING-Error-Flagging-Javier
CLEANING-Error-Flagging-JavierCLEANING-Error-Flagging-Javier
CLEANING-Error-Flagging-Javier
 
ASSESSMENTS-Taxonomic-Assessments-Javier
ASSESSMENTS-Taxonomic-Assessments-JavierASSESSMENTS-Taxonomic-Assessments-Javier
ASSESSMENTS-Taxonomic-Assessments-Javier
 
ASSESSMENTS-Primary-Data-Precision-Javier
ASSESSMENTS-Primary-Data-Precision-JavierASSESSMENTS-Primary-Data-Precision-Javier
ASSESSMENTS-Primary-Data-Precision-Javier
 
Haciendo Ciencia en Abierto / Making Open Science
Haciendo Ciencia en Abierto / Making Open ScienceHaciendo Ciencia en Abierto / Making Open Science
Haciendo Ciencia en Abierto / Making Open Science
 
Linking systems to improve data quality
Linking systems to improve data qualityLinking systems to improve data quality
Linking systems to improve data quality
 
Biodibertsitatea... eta niri zer axola?
Biodibertsitatea... eta niri zer axola?Biodibertsitatea... eta niri zer axola?
Biodibertsitatea... eta niri zer axola?
 

Data Cleaning and Publishing Workshop 2013

  • 1. Data Cleaning and Data Publishing Workshop 2013 18-22 February, Nairobi, Kenya Javier Otegui @jotegui DATA TRANSFORMATION
  • 2.  Data Transformation – the process of modifying data with the aim of improving or enabling its fitness for a certain purpose  Ideally, no information loss  Broad term:  Content transformation  Format transformation  Support transformation  …  Examples of use:  Enable sharing of the dataset  Ease calculations and processing FUNDAMENTS
  • 3.  Mandatory, optional or not needed, depending on scope of use  Data owned and used locally:  Analysis-specific transformations  Limited or local network (lab):  Analysis-specific transformations  Data exchange among colleagues  Publicly shared data:  Interoperability  Standards  Best practices: transform to standards even in local work FUNDAMENTS
  • 4.  Content transformations  Schema of data storage  Scale of measurement  Several levels of difficulty  Standardization of content  Format transformations  File format: tab-delimited, CSV, zip, spreadsheet…  Nowadays it is fairly straightforward  Translation between programs easy  Exchange of information  Support transformations  Digitization, key step in general data management process  Prone to issues  Enable processing, management, analysis, publishing and sharing of data FUNDAMENTS
  • 5.  Modify the units of the data or the elements that compose the information  Final product – same information, standard compliant  Standard – DarwinCore (DwC)  Two specific aims:  Change elements  Complete missing elements  Primary Biodiversity Data (PBD)  Metadata CONTENT TRANSFORMATIONS
  • 6.  Georeferencing of localities  From verbatim locality description to coordinates  Currently not needed: GPS technology  Improve legacy information  Tools such as geolocate, geomancer…  Coordinate systems  Modify units so that they comply with the standard  DwC for coordinates – Decimal Degree (DD)  Easy – Degree-Minute-Second to DD  Hard – UTM to DD  Special attention to precision CONTENT TRANSFORMATIONS - GEOSPATIAL
  • 7. CONTENT TRANSFORMATIONS - GEOSPATIAL 45º 20’ – Precision 1’ (~2Km) at best 45.33333 – Precision 0.00001 (2m) too high 45º 21’ 45.35 45.33 (0.01, ~1.4Km) 45.3 (0.1, ~14Km)
  • 8.  Georeferencing of localities  From verbatim locality description to coordinates  Current GPS technology makes it easier  Improve legacy information  Tools such as geolocate, geomancer…  Coordinate systems  Modify units so that they comply with the standard  DwC for coordinates – Decimal Degree (DD)  Easy – Degree-Minute-Second to DD  Hard – UTM to DD  Special attention to precision  Improve missing fields  Use mapping tools and/or gazetteers to complete information CONTENT TRANSFORMATIONS - GEOSPATIAL
  • 9.  Special character encoding  Special characters in taxonomic names and/or authorships  Interoperability issues may appear  Transform these characters to simplified version or enable different text-encoding  Higher level taxa completion  Transformation to broaden the potential uses  Search in taxonomic databases or literature CONTENT TRANSFORMATIONS - TAXONOMIC
  • 10.  Order of elements  Different places use naturally different element order  Example: US, July 26th 2012  Might become 07-26-2012  Slight modification with good parser to detect and update this information to comply with standards  Date systems  Standard – DwC recommends ISO 8601  Different formats:  1984-09-14, 14th September 1984  34th week of 2012, 125th day of 2012  A good parser is needed to understand all possibilities  Transformations to use common system and avoid ambiguities CONTENT TRANSFORMATIONS - TEMPORAL
  • 11.  Improvement of interoperability – controlled vocabulary  Example: basisOfRecord  Different languages, non-standard acronyms…  Transform term to standard to improve retrieval of data  Improvement of collections – metadata becomes data  One man’s metadata is another man’s data  Information common to a collection might be omitted locally  Must be added when sharing CONTENT TRANSFORMATIONS - METADATA
  • 12.  Modify the storage of the data  Final product – same information, easily exchangeable format  Two key cases:  Text to spreadsheet and spreadsheet to text  Text or spreadsheet to database FILE FORMAT TRANSFORMATIONS
  • 13.  The most common type of format transformation  Import text file to spreadsheet or export from spreadsheet to text file  Aims  Importing to spreadsheet – improve data processing  Exporting to text file – share data and allow others to import easily  To be effective:  No loss of data  No transformation of content FILE FORMAT TRANSFORMATIONS – TEXT TO SPREADSHEET
  • 14.  From CSV or tab-delimited to spreadsheet  CSV or tab-delimited depending on the content  Modern spreadsheets have algorithms to import data in text files  Most of the times, we can select the used separator FILE FORMAT TRANSFORMATIONS – TEXT TO SPREADSHEET
  • 15. FILE FORMAT TRANSFORMATIONS – TEXT TO SPREADSHEET
  • 16.  From CSV or tab-delimited to spreadsheet  CSV or tab-delimited depending on the content  Modern spreadsheets have algorithms to import data in text files  Most of the times, we can select the used separator  Still, this step must be taken carefully:  More or less fields than should  Hidden new-line characters  …  After importing, check FILE FORMAT TRANSFORMATIONS – TEXT TO SPREADSHEET
  • 17. After import, check Autofilter comes handy “Female” value in “individualCount” field?? FILE FORMAT TRANSFORMATIONS – TEXT TO SPREADSHEET