SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Marrying Models and Data:
Adventures in Modeling, Data
Wrangling and Software Design
Anne E. Thessen, Elizabeth North,
Sean McGinnis and Ian Mitchell
LTRANS
• Lagrangian Transport
Model
• Open Source
• http://northweb.hpl.umc
es.edu/LTRANS.htm
• Used to predict transport
of particles, subsurface
hydrocarbons, and
surface oil slicks (in
development)
GISR Deepwater Horizon Database

Number of
Data Points

• Over 7 million georeferenced data points
• Over 9 GB
• Over 2000 analytes and parameters
Database Contents
• Oceanographic Data
–
–
–
–

Salinity
Temperature
Oxygen
More

• Chemistry Data
–
–
–
–

Hydrocarbons
Heavy metals
Nutrients
More
Database Contents
• Oceanographic Data
–
–
–
–

Salinity
Temperature
Oxygen
More

• Chemistry Data
–
–
–
–

Hydrocarbons
Heavy metals
Nutrients
More

•
•
•
•

Air
Water
Tissue
Sediment/Soil
Example Plots for One Analyte
Naphthalene, August 1-15, 2010

mg l-1
Heterogeneity
• Heterogeneity
– Terms
– Units
– Format
– Structure

Benzoic Acid

Carboxybenzene

E210

Benzoic Acid

Dracylic Acid

C7H6O2

2,016

1,848
Heterogeneity
• Heterogeneity

n-Decane

– Terms
– Units
– Format
– Structure

103

parts per trillion
ppbv

66

μg/g

ng/g ppt mg/kg μg/kg

ppb
Metadata
• Metadata
– Missing
– Not computable

Name
Unit

Location

0.23
Attribution

Time
Metadata
• Metadata
– Missing
– Not computable

Name
Unit
Method

Location

0.23
Attribution

Uncertainty

Time
The Great Data Hunt
• Discovery
– Project directory
– Funding agency records
– Literature
– Internet search

Total Data Sets
Discovered

n = 140
The Great Data Hunt
• Discovery
– Project directory
– Funding agency records
– Literature
– Internet search

We identified 90
relevant data sets

Relevant
The Great Data Hunt
• Discovery
• Access
– Online
– Ask directly
– Literature

Relevant
The Great Data Hunt
• Discovery
• Access
– Online
– Ask directly
– Literature
We received responses
to 59% of our inquires
and obtained 34% of
the identified data sets

Relevant
The Great Data Hunt

– Online
– Ask directly
– Literature
We received responses
to 59% of our inquires
and obtained 34% of
the identified data sets

Frequency

• Discovery
• Access

41% of those responses were received
within 24 hours and 29% were received
within the first week

Days to Response
The Great Data Hunt

– Online
– Ask directly
– Literature

0-20 email exchanges per data set

We received responses
to 59% of our inquires
and obtained 34% of
the identified data sets

Frequency

• Discovery
• Access

41% of those responses were received
within 24 hours and 29% were received
within the first week

Number of Emails
The Great Data Hunt
• Discovery
• Access
• Citation
– Literature
– Existing requirements
– Generate new
Why didn’t people share?
•
•
•
•
•

Paper not published yet – 35%
Passed the buck – 20%
Too busy – 10%
Medical problems – 10%
Poor quality – 10%
Why should anyone share?
• Mandated
• Increased citation
and visibility
• Early access to GISR
database
• New insights
Future Work
•
•
•
•
•

Incorporate data as available
Incorporate user feedback
Web Access
Users’ Guide
Manuscripts
Thank You to Data Providers
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•

NOAA/NOS Office of Response and
Restoration
Commonwealth Scientific and Industrial
Research Organization
Environmental Protection Commission of
Hillsborough County
National Estuarine Research Reserves
Sarah Allan
Kim Anderson
Jamie Pierson
Nan Walker
Ed Overton
Richard Aronson
Ryan Moody
Charlotte Brunner
William Patterson
Kyeong Park
Kendra Daly
Liz Kujawinski
Jana Goldman
Jay Lunden
Samuel Georgian
Leslie Wade

•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•

Joe Montoya
Terry Hazen
Mandy Joye
Richard Camilli
Chris Reddy
John Kessler
David Valentine
Tom Soniat
Matt Tarr
Tom Bianchi
Tom Miller
Elise Gornish
Terry Wade
Steven Lohrenz
Dick Snyder
Paul Montagna
Patrick Bieber
Wei Wu
Mitchell Roffer
Dongjoo Joung
Mark Williams
Don Blake
Jordan Pino

•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•

John Valentine
Jeffrey Baguely
Gary Ervin
Erik Cordes
Michaeol Perdue
Bill Stickle
Andrew Zimmerman
Andrew Whitehead
Alice Ortmann
Alan Shiller
Laodong Guo
A. Ravishankara
Ken Aikin
Tom Ryerson
Prabhakar Clement
Christine Ennis
Eric Williams
Ed Sherwood
Julie Bosch
Wade Jeffrey
Chet Pilley
Just Cebrian
Ambrose Bordelon
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Air Quality Dashboard
Air Quality DashboardAir Quality Dashboard
Air Quality DashboardPuspal Hore
 
Building a semantic chemistry platform with the royal society of chemistry
Building a semantic chemistry platform with the royal society of chemistryBuilding a semantic chemistry platform with the royal society of chemistry
Building a semantic chemistry platform with the royal society of chemistryValery Tkachenko
 
Bangkok | Mar-17 | The OpenAQ Story: Combining Open Data + Community for Impact
Bangkok | Mar-17 | The OpenAQ Story: Combining Open Data + Community for ImpactBangkok | Mar-17 | The OpenAQ Story: Combining Open Data + Community for Impact
Bangkok | Mar-17 | The OpenAQ Story: Combining Open Data + Community for ImpactSmart Villages
 
Tools for Risk Mitigation in Sustainable Forest Management of Ukraine
Tools for Risk Mitigation in Sustainable Forest Management of UkraineTools for Risk Mitigation in Sustainable Forest Management of Ukraine
Tools for Risk Mitigation in Sustainable Forest Management of UkraineFSC Ukraine
 
GWAS in a model organism: Arabidopsis thaliana
GWAS in a model organism: Arabidopsis thalianaGWAS in a model organism: Arabidopsis thaliana
GWAS in a model organism: Arabidopsis thalianaGolden Helix Inc
 

Was ist angesagt? (9)

US-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental scienceUS-EPA Chemicals Dashboard – an integrated data hub for environmental science
US-EPA Chemicals Dashboard – an integrated data hub for environmental science
 
Air Quality Dashboard
Air Quality DashboardAir Quality Dashboard
Air Quality Dashboard
 
RSC ChemSpider as an environment for teaching and sharing chemistry
RSC ChemSpider as an environment for teaching and sharing chemistryRSC ChemSpider as an environment for teaching and sharing chemistry
RSC ChemSpider as an environment for teaching and sharing chemistry
 
Building a semantic chemistry platform with the royal society of chemistry
Building a semantic chemistry platform with the royal society of chemistryBuilding a semantic chemistry platform with the royal society of chemistry
Building a semantic chemistry platform with the royal society of chemistry
 
Bangkok | Mar-17 | The OpenAQ Story: Combining Open Data + Community for Impact
Bangkok | Mar-17 | The OpenAQ Story: Combining Open Data + Community for ImpactBangkok | Mar-17 | The OpenAQ Story: Combining Open Data + Community for Impact
Bangkok | Mar-17 | The OpenAQ Story: Combining Open Data + Community for Impact
 
Consensus ranking and fragmentation prediction for identification of unknowns...
Consensus ranking and fragmentation prediction for identification of unknowns...Consensus ranking and fragmentation prediction for identification of unknowns...
Consensus ranking and fragmentation prediction for identification of unknowns...
 
Tools for Risk Mitigation in Sustainable Forest Management of Ukraine
Tools for Risk Mitigation in Sustainable Forest Management of UkraineTools for Risk Mitigation in Sustainable Forest Management of Ukraine
Tools for Risk Mitigation in Sustainable Forest Management of Ukraine
 
Using online chemistry databases to facilitate structure identification in ma...
Using online chemistry databases to facilitate structure identification in ma...Using online chemistry databases to facilitate structure identification in ma...
Using online chemistry databases to facilitate structure identification in ma...
 
GWAS in a model organism: Arabidopsis thaliana
GWAS in a model organism: Arabidopsis thalianaGWAS in a model organism: Arabidopsis thaliana
GWAS in a model organism: Arabidopsis thaliana
 

Andere mochten auch

DataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application MeetupDataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application MeetupThomas Weise
 
Data Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryData Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryInside Analysis
 
Packages for data wrangling データ前処理のためのパッケージ
Packages for data wrangling データ前処理のためのパッケージPackages for data wrangling データ前処理のためのパッケージ
Packages for data wrangling データ前処理のためのパッケージHiroki K
 
Knowledge extraction from the Encyclopedia of Life using Python NLTK
Knowledge extraction from the Encyclopedia of Life using Python NLTKKnowledge extraction from the Encyclopedia of Life using Python NLTK
Knowledge extraction from the Encyclopedia of Life using Python NLTKAnne Thessen
 
Lasi datawrangling
Lasi datawranglingLasi datawrangling
Lasi datawranglingTony Hirst
 
Functional Programming With Python (EuroPython 2008)
Functional Programming With Python (EuroPython 2008)Functional Programming With Python (EuroPython 2008)
Functional Programming With Python (EuroPython 2008)Adam Byrtek
 
Web scraping for cms websites in Android Application
Web scraping for cms websites in Android ApplicationWeb scraping for cms websites in Android Application
Web scraping for cms websites in Android Applicationwebscraping
 
Learning R via Python…or the other way around
Learning R via Python…or the other way aroundLearning R via Python…or the other way around
Learning R via Python…or the other way aroundSid Xing
 
How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.Diep Nguyen
 
Data Wrangling with dplyr and tidyr Cheat Sheet
Data Wrangling with dplyr and tidyr Cheat SheetData Wrangling with dplyr and tidyr Cheat Sheet
Data Wrangling with dplyr and tidyr Cheat SheetDr. Volkan OBAN
 
Thinking in Functions: Functional Programming in Python
Thinking in Functions: Functional Programming in PythonThinking in Functions: Functional Programming in Python
Thinking in Functions: Functional Programming in PythonAnoop Thomas Mathew
 
Functional Programming
Functional ProgrammingFunctional Programming
Functional ProgrammingRyan Riley
 
Python functional programming
Python functional programmingPython functional programming
Python functional programmingGeison Goes
 
Practical Crowdfunding for Arizona Entrepreneurs - Fall 2013
Practical Crowdfunding for Arizona Entrepreneurs - Fall 2013Practical Crowdfunding for Arizona Entrepreneurs - Fall 2013
Practical Crowdfunding for Arizona Entrepreneurs - Fall 2013Propel Arizona
 
Functional Pattern Matching on Python
Functional Pattern Matching on PythonFunctional Pattern Matching on Python
Functional Pattern Matching on PythonDaker Fernandes
 
It's the end of design patterns as we know it (and i feel fine)
It's the end of design patterns as we know it (and i feel fine)It's the end of design patterns as we know it (and i feel fine)
It's the end of design patterns as we know it (and i feel fine)Luiz Borba
 
How My Comic Book Obsession Birthed a New Functional Testing Tool
How My Comic Book Obsession Birthed a New Functional Testing ToolHow My Comic Book Obsession Birthed a New Functional Testing Tool
How My Comic Book Obsession Birthed a New Functional Testing ToolFeihong Hsu
 
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Sammy Fung
 

Andere mochten auch (20)

DataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application MeetupDataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application Meetup
 
Data Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryData Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data Discovery
 
Packages for data wrangling データ前処理のためのパッケージ
Packages for data wrangling データ前処理のためのパッケージPackages for data wrangling データ前処理のためのパッケージ
Packages for data wrangling データ前処理のためのパッケージ
 
Data Wrangling
Data WranglingData Wrangling
Data Wrangling
 
Knowledge extraction from the Encyclopedia of Life using Python NLTK
Knowledge extraction from the Encyclopedia of Life using Python NLTKKnowledge extraction from the Encyclopedia of Life using Python NLTK
Knowledge extraction from the Encyclopedia of Life using Python NLTK
 
Lasi datawrangling
Lasi datawranglingLasi datawrangling
Lasi datawrangling
 
Functional Programming With Python (EuroPython 2008)
Functional Programming With Python (EuroPython 2008)Functional Programming With Python (EuroPython 2008)
Functional Programming With Python (EuroPython 2008)
 
Web scraping for cms websites in Android Application
Web scraping for cms websites in Android ApplicationWeb scraping for cms websites in Android Application
Web scraping for cms websites in Android Application
 
Learning R via Python…or the other way around
Learning R via Python…or the other way aroundLearning R via Python…or the other way around
Learning R via Python…or the other way around
 
How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.
 
Data Wrangling with dplyr and tidyr Cheat Sheet
Data Wrangling with dplyr and tidyr Cheat SheetData Wrangling with dplyr and tidyr Cheat Sheet
Data Wrangling with dplyr and tidyr Cheat Sheet
 
Thinking in Functions: Functional Programming in Python
Thinking in Functions: Functional Programming in PythonThinking in Functions: Functional Programming in Python
Thinking in Functions: Functional Programming in Python
 
Functional Programming
Functional ProgrammingFunctional Programming
Functional Programming
 
Data Wrangling
Data WranglingData Wrangling
Data Wrangling
 
Python functional programming
Python functional programmingPython functional programming
Python functional programming
 
Practical Crowdfunding for Arizona Entrepreneurs - Fall 2013
Practical Crowdfunding for Arizona Entrepreneurs - Fall 2013Practical Crowdfunding for Arizona Entrepreneurs - Fall 2013
Practical Crowdfunding for Arizona Entrepreneurs - Fall 2013
 
Functional Pattern Matching on Python
Functional Pattern Matching on PythonFunctional Pattern Matching on Python
Functional Pattern Matching on Python
 
It's the end of design patterns as we know it (and i feel fine)
It's the end of design patterns as we know it (and i feel fine)It's the end of design patterns as we know it (and i feel fine)
It's the end of design patterns as we know it (and i feel fine)
 
How My Comic Book Obsession Birthed a New Functional Testing Tool
How My Comic Book Obsession Birthed a New Functional Testing ToolHow My Comic Book Obsession Birthed a New Functional Testing Tool
How My Comic Book Obsession Birthed a New Functional Testing Tool
 
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
 

Ähnlich wie Marrying models and data: Adventures in Modeling, Data Wrangling and Software Design

Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Impro...
Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Impro...Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Impro...
Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Impro...Anne Thessen
 
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryDr. Haxel Consult
 
Developing an Australian phenology monitoring network, Tim Brown, ACEAS Grand...
Developing an Australian phenology monitoring network, Tim Brown, ACEAS Grand...Developing an Australian phenology monitoring network, Tim Brown, ACEAS Grand...
Developing an Australian phenology monitoring network, Tim Brown, ACEAS Grand...aceas13tern
 
Data Infrastructure for Coastal and Estuarine Science
Data Infrastructure for Coastal and Estuarine ScienceData Infrastructure for Coastal and Estuarine Science
Data Infrastructure for Coastal and Estuarine ScienceAnne Thessen
 
Ecosystem data and TERN: Genes to geosciences workshop 19 May 2014
Ecosystem data and TERN: Genes to geosciences workshop 19 May 2014Ecosystem data and TERN: Genes to geosciences workshop 19 May 2014
Ecosystem data and TERN: Genes to geosciences workshop 19 May 2014TERN Australia
 
How metadata drives data sharing; UK Data Archive
How metadata drives data sharing; UK Data Archive How metadata drives data sharing; UK Data Archive
How metadata drives data sharing; UK Data Archive Louise Corti
 
Arch sci symposium presentataion
Arch sci symposium presentataionArch sci symposium presentataion
Arch sci symposium presentataionAhmad Alam
 
Open Access Week - Oxford, 20-24 Oct 2014
Open Access Week - Oxford, 20-24 Oct 2014Open Access Week - Oxford, 20-24 Oct 2014
Open Access Week - Oxford, 20-24 Oct 2014Susanna-Assunta Sansone
 
Cal Poly - Data Management and the DMPTool
Cal Poly - Data Management and the DMPToolCal Poly - Data Management and the DMPTool
Cal Poly - Data Management and the DMPToolCarly Strasser
 
There is a method to it: Making meaning in information research through a mix...
There is a method to it: Making meaning in information research through a mix...There is a method to it: Making meaning in information research through a mix...
There is a method to it: Making meaning in information research through a mix...Lynn Connaway
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineKen Karapetyan
 

Ähnlich wie Marrying models and data: Adventures in Modeling, Data Wrangling and Software Design (20)

Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Impro...
Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Impro...Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Impro...
Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Impro...
 
Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...
 
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
 
Developing an Australian phenology monitoring network, Tim Brown, ACEAS Grand...
Developing an Australian phenology monitoring network, Tim Brown, ACEAS Grand...Developing an Australian phenology monitoring network, Tim Brown, ACEAS Grand...
Developing an Australian phenology monitoring network, Tim Brown, ACEAS Grand...
 
Data Infrastructure for Coastal and Estuarine Science
Data Infrastructure for Coastal and Estuarine ScienceData Infrastructure for Coastal and Estuarine Science
Data Infrastructure for Coastal and Estuarine Science
 
ChemSpider as an integration hub for interlinked chemistry data
ChemSpider as an integration hub for interlinked chemistry dataChemSpider as an integration hub for interlinked chemistry data
ChemSpider as an integration hub for interlinked chemistry data
 
Ecosystem data and TERN: Genes to geosciences workshop 19 May 2014
Ecosystem data and TERN: Genes to geosciences workshop 19 May 2014Ecosystem data and TERN: Genes to geosciences workshop 19 May 2014
Ecosystem data and TERN: Genes to geosciences workshop 19 May 2014
 
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
 
Randall "MECA Project Update"
Randall "MECA Project Update"Randall "MECA Project Update"
Randall "MECA Project Update"
 
How metadata drives data sharing; UK Data Archive
How metadata drives data sharing; UK Data Archive How metadata drives data sharing; UK Data Archive
How metadata drives data sharing; UK Data Archive
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
 
The application of text and data mining to enhance the RSC publication archive
The application of text and data mining to enhance the RSC publication archiveThe application of text and data mining to enhance the RSC publication archive
The application of text and data mining to enhance the RSC publication archive
 
Arch sci symposium presentataion
Arch sci symposium presentataionArch sci symposium presentataion
Arch sci symposium presentataion
 
Open Access Week - Oxford, 20-24 Oct 2014
Open Access Week - Oxford, 20-24 Oct 2014Open Access Week - Oxford, 20-24 Oct 2014
Open Access Week - Oxford, 20-24 Oct 2014
 
Cal Poly - Data Management and the DMPTool
Cal Poly - Data Management and the DMPToolCal Poly - Data Management and the DMPTool
Cal Poly - Data Management and the DMPTool
 
Sharing data
Sharing dataSharing data
Sharing data
 
Providing support for JC Bradleys vision of open science using RSC cheminform...
Providing support for JC Bradleys vision of open science using RSC cheminform...Providing support for JC Bradleys vision of open science using RSC cheminform...
Providing support for JC Bradleys vision of open science using RSC cheminform...
 
There is a method to it: Making meaning in information research through a mix...
There is a method to it: Making meaning in information research through a mix...There is a method to it: Making meaning in information research through a mix...
There is a method to it: Making meaning in information research through a mix...
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 

Mehr von Anne Thessen

Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...
Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...
Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...Anne Thessen
 
Unifying Genomics, Phenomics, and Environments
Unifying Genomics, Phenomics, and EnvironmentsUnifying Genomics, Phenomics, and Environments
Unifying Genomics, Phenomics, and EnvironmentsAnne Thessen
 
Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...
Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...
Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...Anne Thessen
 
Bridging discrepancies across North American butterfly naming authorities: Su...
Bridging discrepancies across North American butterfly naming authorities: Su...Bridging discrepancies across North American butterfly naming authorities: Su...
Bridging discrepancies across North American butterfly naming authorities: Su...Anne Thessen
 
Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...
Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...
Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...Anne Thessen
 
Next-Gen Taxonomic Descriptions for Microbial Eukaryotes
Next-Gen Taxonomic Descriptions for Microbial EukaryotesNext-Gen Taxonomic Descriptions for Microbial Eukaryotes
Next-Gen Taxonomic Descriptions for Microbial EukaryotesAnne Thessen
 
Linking biodiversity data for ecology
Linking biodiversity data for ecologyLinking biodiversity data for ecology
Linking biodiversity data for ecologyAnne Thessen
 
Visualizing Evolution
Visualizing EvolutionVisualizing Evolution
Visualizing EvolutionAnne Thessen
 
The Future of Microalgal Taxonomy
The Future of Microalgal TaxonomyThe Future of Microalgal Taxonomy
The Future of Microalgal TaxonomyAnne Thessen
 
Knowledge Extraction and Semantic Linking in the Encyclopedia of Life
Knowledge Extraction and Semantic Linking in the Encyclopedia of LifeKnowledge Extraction and Semantic Linking in the Encyclopedia of Life
Knowledge Extraction and Semantic Linking in the Encyclopedia of LifeAnne Thessen
 

Mehr von Anne Thessen (10)

Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...
Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...
Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neur...
 
Unifying Genomics, Phenomics, and Environments
Unifying Genomics, Phenomics, and EnvironmentsUnifying Genomics, Phenomics, and Environments
Unifying Genomics, Phenomics, and Environments
 
Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...
Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...
Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...
 
Bridging discrepancies across North American butterfly naming authorities: Su...
Bridging discrepancies across North American butterfly naming authorities: Su...Bridging discrepancies across North American butterfly naming authorities: Su...
Bridging discrepancies across North American butterfly naming authorities: Su...
 
Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...
Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...
Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...
 
Next-Gen Taxonomic Descriptions for Microbial Eukaryotes
Next-Gen Taxonomic Descriptions for Microbial EukaryotesNext-Gen Taxonomic Descriptions for Microbial Eukaryotes
Next-Gen Taxonomic Descriptions for Microbial Eukaryotes
 
Linking biodiversity data for ecology
Linking biodiversity data for ecologyLinking biodiversity data for ecology
Linking biodiversity data for ecology
 
Visualizing Evolution
Visualizing EvolutionVisualizing Evolution
Visualizing Evolution
 
The Future of Microalgal Taxonomy
The Future of Microalgal TaxonomyThe Future of Microalgal Taxonomy
The Future of Microalgal Taxonomy
 
Knowledge Extraction and Semantic Linking in the Encyclopedia of Life
Knowledge Extraction and Semantic Linking in the Encyclopedia of LifeKnowledge Extraction and Semantic Linking in the Encyclopedia of Life
Knowledge Extraction and Semantic Linking in the Encyclopedia of Life
 

Kürzlich hochgeladen

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Kürzlich hochgeladen (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Marrying models and data: Adventures in Modeling, Data Wrangling and Software Design

  • 1. Marrying Models and Data: Adventures in Modeling, Data Wrangling and Software Design Anne E. Thessen, Elizabeth North, Sean McGinnis and Ian Mitchell
  • 2.
  • 3. LTRANS • Lagrangian Transport Model • Open Source • http://northweb.hpl.umc es.edu/LTRANS.htm • Used to predict transport of particles, subsurface hydrocarbons, and surface oil slicks (in development)
  • 4. GISR Deepwater Horizon Database Number of Data Points • Over 7 million georeferenced data points • Over 9 GB • Over 2000 analytes and parameters
  • 5. Database Contents • Oceanographic Data – – – – Salinity Temperature Oxygen More • Chemistry Data – – – – Hydrocarbons Heavy metals Nutrients More
  • 6. Database Contents • Oceanographic Data – – – – Salinity Temperature Oxygen More • Chemistry Data – – – – Hydrocarbons Heavy metals Nutrients More • • • • Air Water Tissue Sediment/Soil
  • 7. Example Plots for One Analyte Naphthalene, August 1-15, 2010 mg l-1
  • 8. Heterogeneity • Heterogeneity – Terms – Units – Format – Structure Benzoic Acid Carboxybenzene E210 Benzoic Acid Dracylic Acid C7H6O2 2,016 1,848
  • 9. Heterogeneity • Heterogeneity n-Decane – Terms – Units – Format – Structure 103 parts per trillion ppbv 66 μg/g ng/g ppt mg/kg μg/kg ppb
  • 10. Metadata • Metadata – Missing – Not computable Name Unit Location 0.23 Attribution Time
  • 11. Metadata • Metadata – Missing – Not computable Name Unit Method Location 0.23 Attribution Uncertainty Time
  • 12. The Great Data Hunt • Discovery – Project directory – Funding agency records – Literature – Internet search Total Data Sets Discovered n = 140
  • 13. The Great Data Hunt • Discovery – Project directory – Funding agency records – Literature – Internet search We identified 90 relevant data sets Relevant
  • 14. The Great Data Hunt • Discovery • Access – Online – Ask directly – Literature Relevant
  • 15. The Great Data Hunt • Discovery • Access – Online – Ask directly – Literature We received responses to 59% of our inquires and obtained 34% of the identified data sets Relevant
  • 16. The Great Data Hunt – Online – Ask directly – Literature We received responses to 59% of our inquires and obtained 34% of the identified data sets Frequency • Discovery • Access 41% of those responses were received within 24 hours and 29% were received within the first week Days to Response
  • 17. The Great Data Hunt – Online – Ask directly – Literature 0-20 email exchanges per data set We received responses to 59% of our inquires and obtained 34% of the identified data sets Frequency • Discovery • Access 41% of those responses were received within 24 hours and 29% were received within the first week Number of Emails
  • 18. The Great Data Hunt • Discovery • Access • Citation – Literature – Existing requirements – Generate new
  • 19. Why didn’t people share? • • • • • Paper not published yet – 35% Passed the buck – 20% Too busy – 10% Medical problems – 10% Poor quality – 10%
  • 20. Why should anyone share? • Mandated • Increased citation and visibility • Early access to GISR database • New insights
  • 21. Future Work • • • • • Incorporate data as available Incorporate user feedback Web Access Users’ Guide Manuscripts
  • 22. Thank You to Data Providers • • • • • • • • • • • • • • • • • • • • NOAA/NOS Office of Response and Restoration Commonwealth Scientific and Industrial Research Organization Environmental Protection Commission of Hillsborough County National Estuarine Research Reserves Sarah Allan Kim Anderson Jamie Pierson Nan Walker Ed Overton Richard Aronson Ryan Moody Charlotte Brunner William Patterson Kyeong Park Kendra Daly Liz Kujawinski Jana Goldman Jay Lunden Samuel Georgian Leslie Wade • • • • • • • • • • • • • • • • • • • • • • • Joe Montoya Terry Hazen Mandy Joye Richard Camilli Chris Reddy John Kessler David Valentine Tom Soniat Matt Tarr Tom Bianchi Tom Miller Elise Gornish Terry Wade Steven Lohrenz Dick Snyder Paul Montagna Patrick Bieber Wei Wu Mitchell Roffer Dongjoo Joung Mark Williams Don Blake Jordan Pino • • • • • • • • • • • • • • • • • • • • • • • John Valentine Jeffrey Baguely Gary Ervin Erik Cordes Michaeol Perdue Bill Stickle Andrew Zimmerman Andrew Whitehead Alice Ortmann Alan Shiller Laodong Guo A. Ravishankara Ken Aikin Tom Ryerson Prabhakar Clement Christine Ennis Eric Williams Ed Sherwood Julie Bosch Wade Jeffrey Chet Pilley Just Cebrian Ambrose Bordelon

Hinweis der Redaktion

  1. Hello, my name is Anne Thessen and I’m going to speak to you about some model development that we’ve been doing. First I would like to acknowledge my coauthors, Elizabeth North, Sean McGinnis and Ian Mitchell. We are part of the Gulf Integrated Spill Research Consortium. Our work was funded by the Gulf of Mexico Research Initiative. We received institutional support from Arizona State University and the University of Maryland Center for Environmental Science. I recently started my own business called The Data Detektiv that does this type of data work. If you like what you see and have need of this sort of expertise in your project please see me after the talk or swing by my exhibitor booth.
  2. On April 20, 2010 the Deepwater Horizon oil rig exploded in the Gulf of Mexico. The broken well flowed for 87 days before being capped and released an estimated 4.9 million barrels of crude. One of the more unique aspects of this particular spill was that oil was being released under 1600 m of water. The surface slick appeared on Apr 22, 2010. This is a satellite image of the surface slick near the Mississippi River delta from NASA’s Terra satellite on May 24. The slick directly impacted about 180,000 sq km and by early June, oil had washed up on 200 km of coastline. Understanding how oil is transported in the Gulf of Mexico is key to being able to launch an effective spill response.
  3. The goal of our project is to modify an existing Lagrangian transport model, called LTRANS, that has been used to predict transport of other types of particles, so that it can be effectively used to predict where oil will go in the event of a Gulf of Mexico spill. In this panel we see snapshot of particle distributions from a model run that simulates subsurface hydrocarbons. We want to know if the particles are in the right place and if we can estimate degradation rates from the data.
  4. To determine the efficacy of the model, we are comparing the output to field data collected after the Deepwater Horizon explosion. To accomplish this, we are compiling a database of oceanographic and hydrocarbon field measurements called the GISR Deepwater Horizon database. It can be queried to get the output we need for comparison. It is over 9 GB in size and contains over 7 million georeferenced data points gathered from published and unpublished sources, government databases, volunteer networks and individual researchers.
  5. The data base contains multiple types of oceanographic and chemistry data. Here are some examples.
  6. The database contains data from air, water, tissue and sediment or soil.
  7. This is an example of the type of coverage we have in the database for a single analyte, Naphthalene.This panel is a top view showing the spatial coverage of Naphthalene samples. The red indicates Naphthalene was detected while the open circles indicate that Naphthalene was tested for, but not detected. This panel shows the depth distribution while the color indicates concentration in micrograms per liter. The green square on both panels is where the oil was released. We have over 10,000 Naphthalene measurements in the database. Assembling a database of this nature was a huge challenge. We had to develop some interesting algorithms and methods to find, access and integrate all of the data. I’m going to describe that process and, if we have time, show some LTRANS output and field data comparison visualizations.
  8. The first challenge was heterogeneity between data sets, which was quite broad. We needed to bring everything together into a single database and then be able to effectively query that database. Every provider had their own favorite terms and units and we normalized them algorithmically. Terms were normalized using a Google Fusion Table that lists a “preferred name” and all of the homonyms for that name. An algorithm can read the term used in the original data set, find it in the Fusion Table and then tag the database entry with the “preferred name”. That way when the database is queried for a particular analyte, we don’t have missed data because one homonym is used instead of another. For example, benzoic acid has five homonyms in the table. We had over 2,000 terms before reconciliation and 1,848 terms after reconciliation. We expect this number to decrease further once the Fusion table is complete.
  9. The units are handled in the same way, except there is a transformation step, wherein some math is done to convert the value to the “preferred unit”. This allows us to normalize terms and units without changing the original data set. For example, n-Decane was represented by 6 different units. The number of different units in the database was almost decreased by half after reconciliation. Formats varied from Access databases and shape files to pdf tables. All data sets, except for the databases, were normalized to our schema and then imported into an SQL database. The databases were transformed to SQL and then joined. Sometimes this had to be done manually. Sometimes we were able to write scripts to help.
  10. The second challenge was metadata (or lack thereof). Metadata was often missing, in a separate location or in a separate format from the actual data. At a minimum, for the database to work, we needed to know the basics of what, where and when.
  11. Ideally, we were able to get more information, like about methods and uncertainty. Compiling the metadata was an exercise in detective work. Even simple information like coordinates and dates were sometimes buried. Metadata about methods were often found in narrative form in published literature. Sometimes metadata were in companion files, such as xml. Sometimes I would just have to contact the data provider directly. This was often a very time consuming process because they frequently had to hunt for the data or direct me to someone else like a student or a technician.
  12. Third, locating and accessing data was also a big challenge. The first step is knowing what data sets to ask for and who to ask. A significant number of data sets were not in a repository or part of the published literature - and we knew that this would be the case. The question is, how do you learn of a data set’s existence if it hasn’t been published or deposited? We did a couple of things. One of the Gulf coast SeaGrants has a project directory for the Deepwater Horizon spill. The directory was not comprehensive and depended on folks self reporting their projects, but it was a start. We also looked for awarded projects through funding agency web sites to find folks who were likely to have data. Of course there was a literature search and a basic internet search for folks claiming to do research on the spill. That gave us a list of contacts. We ended up making contact with 140 projects. We approached each contact via email to find out if they had data and if it was relevant.
  13. Many of these contacts revealed that their data sets were not relevant to our model development. For example, measurements of hydrocarbons in fish tissues were not going to help with our model, so we did not pursue those data sets. Sometimes two different contacts were working on the same data set, so what we thought were two data sets were really only one. At the end of the process we identified 90 relevant data sets.
  14. Once the data sets were discovered, they had to be accessed. Some were freely available and were simply downloaded. Some data sets were in repositories that may or may not involve working with the data manager to gain access. Others were published as a table in supplementary material. Most data sets involved communicating with the provider to get the complete data set and the metadata. There were a few instances where the provider instructed us to take the data from the figure, but we tried to avoid doing that.
  15. Out of those 90 relevant data sets, we received responses to 59% of our inquiries and were able to obtain 34% of the data sets. The bottom chart is a breakdown of the 90 data sets. The dark orange represents the data sets we asked for - and received a response and the data. The dark purple are the data sets that were freely available online, so no response was necessary. The light orange represents the data sets that were denied us. The light purple represents the inquires that went completely unanswered. You can look at this another way. The orange represents inquiries that got a response and the purple represents the inquiries that did not get a response. The dark colors represent the inquiries that resulted in data while the light colors represent inquiries that did not result in data. This is quite good compared to sharing in some communities which can be as low as 10%.
  16. Most of the responses we received were quite timely. 41% were received within the first 24 hours and 29% were received within 2-7 days.
  17. This process can be quite labor intensive. Some data sets required up to 20 email exchanges to get all of the data and metadata situated. The average was 7.6 emails.
  18. An important part of reusing other people’s data is citing them appropriately. Data set citation is still a relatively new concept, but its starting to gain momentum. We worked with each of the data providers to find out how they wanted to be cited. Typically, if the data had a publication, the provider wanted the publication to be cited. Not all data sets had a publication. Data sets in repositories often had a citation already developed and provided by the repository. There were plenty of data sets that were unpublished and not in a repository. For these data sets we worked with the provider to generate a citation. This involved encouraging the provider to deposit in FigShare, which is a free place to deposit data and receive a citable, unique identifier for your data. If the data set was already online, like on a personal web site, the access URL was given in the citation. We also plan to develop a citation for the database as a whole with all of the providers as authors. In the future, when a user executes a query, they will also be presented with a list of citations for the data sets that appear in the query results. So they can cite the database as a whole or the individual data sets they actually use. We hope, through tools like ImpactStory, that providers can start getting credit for the data sets they generate and share.
  19. We were actively denied data by 24% of the 90 contacts made. The other 40% did not respond to our requests at all. We know nothing about the data set or why it wasn’t shared. For those 24% that did give us a reason we see that “paper not published yet” was the primary reason. All of these folks expressed willingness to share after publication. So the sharing rate will increase dramatically once all these papers come out. 20% directed me to another person who did not respond at all. Only 10% told me they were too busy. Another 10% said the data or the samples got messed up in some way and was not useful. Interestingly, medical problems were also cited as a reason for not sharing. There were actually two contacts that died unexpectedly during this period. I also want to say, as an aside, that we know there are large data sets that are not available to us because of legal reasons. They were part of the Natural Resources Damage Assessment. These data sets were not included in any of these statistics. This does not add up to 100. The last 15% was a combination of random, one-off reasons or folks being hesitant and not really giving a reason.
  20. A big impediment to data sharing is the lack of incentives for data providers. Why should anyone share? Funding agencies and publishers may mandate sharing. Sharing can increase citation rates and visibility within the community. In our case, we gave early access to the contributors. By sharing, we can develop more comprehensive, large-scale datasets that will allow us to address new challenges. Understanding the fate and transport of oil in the Gulf of Mexico is one of those challenges. Having an integrated database of the sort that we are building is the first step in improving the response to- and remediation of- oil spills. The panel shows an example of the LTRANS output compared to the field data in the database for Naphthalene. The red circles indicate field data where the analyte was detected. The open circles indicate field data where a measurement was taken and the analyte was not detected. The blue points are where LTRANS predicts that the analyte should be. This is an example of what can be achieved when data are shared and integrated.
  21. We have accomplished a lot, but we still have much to do to fulfill our goals. There will be a lot of additional data released over the next year that will be added to the database. We will be giving web access to the contributors and plan to incorporate their feedback to improve usability before opening to a wider audience. We are currently drafting and refining a users’ guide. There will be manuscripts published on the process of gathering data that I just described and a more technical paper on the database itself.
  22. Here is a list of all the data providers.As you can see, we have many. It takes a village to build a database.
  23. With that, I can take questions now or if you want to speak in more detail about this project or The Data Detektiv you can find me during breaks and poster sessionsat exhibitor booth #29.