The document discusses data transformation which involves modifying data to improve its fitness for a particular purpose without losing information. It covers different types of transformations including content, format, and support transformations. Specific examples discussed include georeferencing locations, standardizing date and coordinate formats, and changing file formats to spreadsheets or databases to enable sharing while maintaining the original information. The goal is to prepare data for public sharing by making it interoperable and following best practices and standards.
2. Data Transformation – the process of modifying data
with the aim of improving or enabling its fitness for a
certain purpose
Ideally, no information loss
Broad term:
Content transformation
Format transformation
Support transformation
…
Examples of use:
Enable sharing of the dataset
Ease calculations and processing
FUNDAMENTS
3. Mandatory, optional or not needed, depending on
scope of use
Data owned and used locally:
Analysis-specific transformations
Limited or local network (lab):
Analysis-specific transformations
Data exchange among colleagues
Publicly shared data:
Interoperability
Standards
Best practices: transform to standards even in local
work
FUNDAMENTS
4. Content transformations
Schema of data storage
Scale of measurement
Several levels of difficulty
Standardization of content
Format transformations
File format: tab-delimited, CSV, zip, spreadsheet…
Nowadays it is fairly straightforward
Translation between programs easy
Exchange of information
Support transformations
Digitization, key step in general data management process
Prone to issues
Enable processing, management, analysis, publishing and sharing of
data
FUNDAMENTS
5. Modify the units of the data or the elements that
compose the information
Final product – same information, standard
compliant
Standard – DarwinCore (DwC)
Two specific aims:
Change elements
Complete missing elements
Primary Biodiversity Data (PBD)
Metadata
CONTENT TRANSFORMATIONS
6. Georeferencing of localities
From verbatim locality description to coordinates
Currently not needed: GPS technology
Improve legacy information
Tools such as geolocate, geomancer…
Coordinate systems
Modify units so that they comply with the standard
DwC for coordinates – Decimal Degree (DD)
Easy – Degree-Minute-Second to DD
Hard – UTM to DD
Special attention to precision
CONTENT TRANSFORMATIONS -
GEOSPATIAL
7. CONTENT TRANSFORMATIONS -
GEOSPATIAL
45º 20’ – Precision 1’ (~2Km) at best
45.33333 – Precision 0.00001 (2m) too high
45º 21’
45.35 45.33
(0.01, ~1.4Km)
45.3
(0.1, ~14Km)
8. Georeferencing of localities
From verbatim locality description to coordinates
Current GPS technology makes it easier
Improve legacy information
Tools such as geolocate, geomancer…
Coordinate systems
Modify units so that they comply with the standard
DwC for coordinates – Decimal Degree (DD)
Easy – Degree-Minute-Second to DD
Hard – UTM to DD
Special attention to precision
Improve missing fields
Use mapping tools and/or gazetteers to complete information
CONTENT TRANSFORMATIONS -
GEOSPATIAL
9. Special character encoding
Special characters in taxonomic names and/or authorships
Interoperability issues may appear
Transform these characters to simplified version or enable
different text-encoding
Higher level taxa completion
Transformation to broaden the potential uses
Search in taxonomic databases or literature
CONTENT TRANSFORMATIONS -
TAXONOMIC
10. Order of elements
Different places use naturally different element order
Example: US, July 26th 2012
Might become 07-26-2012
Slight modification with good parser to detect and update this
information to comply with standards
Date systems
Standard – DwC recommends ISO 8601
Different formats:
1984-09-14, 14th September 1984
34th week of 2012, 125th day of 2012
A good parser is needed to understand all possibilities
Transformations to use common system and avoid ambiguities
CONTENT TRANSFORMATIONS -
TEMPORAL
11. Improvement of interoperability – controlled
vocabulary
Example: basisOfRecord
Different languages, non-standard acronyms…
Transform term to standard to improve retrieval of data
Improvement of collections – metadata becomes
data
One man’s metadata is another man’s data
Information common to a collection might be omitted locally
Must be added when sharing
CONTENT TRANSFORMATIONS -
METADATA
12. Modify the storage of the data
Final product – same information, easily
exchangeable format
Two key cases:
Text to spreadsheet and spreadsheet to text
Text or spreadsheet to database
FILE FORMAT TRANSFORMATIONS
13. The most common type of format transformation
Import text file to spreadsheet or export from
spreadsheet to text file
Aims
Importing to spreadsheet – improve data processing
Exporting to text file – share data and allow others to import
easily
To be effective:
No loss of data
No transformation of content
FILE FORMAT TRANSFORMATIONS – TEXT
TO SPREADSHEET
14. From CSV or tab-delimited to spreadsheet
CSV or tab-delimited depending on the content
Modern spreadsheets have algorithms to import data in text
files
Most of the times, we can select the used separator
FILE FORMAT TRANSFORMATIONS – TEXT
TO SPREADSHEET
16. From CSV or tab-delimited to spreadsheet
CSV or tab-delimited depending on the content
Modern spreadsheets have algorithms to import data in text
files
Most of the times, we can select the used separator
Still, this step must be taken carefully:
More or less fields than should
Hidden new-line characters
…
After importing, check
FILE FORMAT TRANSFORMATIONS – TEXT
TO SPREADSHEET
17. After import, check
Autofilter comes
handy
“Female” value in
“individualCount”
field??
FILE FORMAT TRANSFORMATIONS – TEXT
TO SPREADSHEET