SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Aleatha Parker-Wood*^,Brian A. Madden*,Michael McThrow*,
Darrell D.E. Long*, Ian F. Adams*, Avani Wildani*
*University of California Santa Cruz
^Conservatoire National des Arts et Métiers
Examining Extended and
Scientific Metadata for
Scalable Index Designs
What we call metadata
• Data for the system
• External to the file
• Small
• Dense
2
Abraham Silberschatz, Greg Gagne, and Peter Baer Galvin,
"Operating System Concepts, Eighth Edition "
What everyone else calls metadata
• Data for the user
• Embedded in:
• the file
• the inode
• a separate file
• a notebook somewhere on
their desk
• Wildly varying size
• Sparse
3
Embedded
Metadata
Metadata
filesMetadata
filesMetadata
files
Metadata outside
the system
Inode metadata
A scientist at work
• “Show me the data set about bears in Alaska from
last fall”
• “Show me simulation results from last week for
Vesuvius which used this code library, and where
the pressure is higher than 500 kiloPascals”
• A mix of system and scientific metadata
4
Our options
• Relational databases
• Column stores
• Spatial trees (E.g., Spyglass, Smartstore)
• Inverted indexes
• Bitmap indexes (E.g. FastBit)
• The choice of index depends on the data, but what
does the data look like?
5
Outline
• The data in brief
• Dimensionality
• Sparsity
• Atomicity
• Entropy
6
The metadata in brief
7
Discipline
Native	
  
Format
Record	
  
count
Subsample
d?
Sample	
  
count
Total	
  size
Dryad Biology XML 31K No 31K 400	
  MB
WISE Astronomy CSV 564M Yes 10K 1	
  TB
ARGO
Oceanograp
hy
NetCDF 2B Yes 635K 330GB
ORNL Climatology CSV 1478 No 1478 154KB
Dimensionality
8
Dryad WISE Argo ORNL
Total	
  
Dimensions
44 285 108 14 451
•Much higher dimensional than POSIX data
•Curse of dimensionality concerns
Sparsity
9
Sparse even within a discipline (extremely sparse
across all disciplines)
• CDF of sparsity
• For a randomly
chosen element from
X% of columns, there
is a Y% chance it will
be null
Atomicity (Dryad)
• How many times can a
field be present for a
single item?
• E.g.: A single paper can
have multiple authors
• Truncated to show
detail. One study had
800 species!
10
Some disciplines have many field values per item.
Others have range values (e.g., May-June 2010)
Entropy
• Row organization
versus column
• How compressible is
the data?
• How selective are
queries?
• Plenty of compression
available
11
Bringing it all together
• Scientific data is:
• Sparse
• High-dimensional
• Compressible
• Non-atomic (one to many)
• A mix of cardinal, ordinal, spatial, and binary data
• Query models:
• Spatial
• Range and point
• Key word
12
Comparing indexes
13
Column	
  
stores
Row	
  stores Spatial	
  trees
Inverted	
  
Indexes
HDF5 FastBit
High	
  
dimensional
Yes Yes No Yes Yes Yes
Sparse Yes Stores	
  nulls No Yes Yes Stores	
  nulls
Multiple	
  
values
Yes Yes No
List,	
  not	
  
range
Yes Yes
Non-­‐numeric	
  
data
Yes Yes No Yes Yes No
Range	
  
queries
Yes Yes Yes No Yes Yes
Specialized	
  
indexes
Yes Yes No No No No
High
Compression
Yes No No Yes No Yes
Conclusions
14
• Currently popular approaches to file system
indexing (spatial trees, RDBMS) are a poor match
for scientific data
• Current approaches to scientific indexing are not a
complete solution
• Column stores are a natural fit for scientific
metadata and queries
• Specialized indexes based on inverted indexes,
bitmaps, and spatial trees are appropriate for some
data
15
Questions?
Data types (raw and semantic)
16
Dryad WISE Argo ORNL Total
String
Numeric
Str/Num
Date
Spatial
Flagsets
100% 4% 62% 29% 28%
0% 96% 38% 71% 72%
96% 68% 77% 72% 73%
2% 4% 7% 7% 5%
2% 9% 2% 21% 7%
0% 19% 14% 0% 15%
•Support for spatial search is useful
•Application hinting is needed for good search (is
this a string, a location, or a flag set?)
How can we support this?
• Search functionality which:
• Supports these kinds of queries
• Does not double the size of storage
• Does not require a linear scan over petabytes of data
• The answers to queries are documents
• We rarely need an entire row
• Complex transactions and joins are less important
17

Weitere ähnliche Inhalte

Was ist angesagt?

Managing the research life cycle
Managing the research life cycleManaging the research life cycle
Managing the research life cycleSherry Lake
 
2016 Bio-IT World Cell Line Coordination Poster 2016-04-05v1
2016 Bio-IT World Cell Line Coordination Poster 2016-04-05v12016 Bio-IT World Cell Line Coordination Poster 2016-04-05v1
2016 Bio-IT World Cell Line Coordination Poster 2016-04-05v1Bruce Kozuma
 
DataVsStatistics
DataVsStatisticsDataVsStatistics
DataVsStatisticsjpheintz
 
EDI Training Module 4: Organizing Data Into Publishable Units
EDI Training Module 4: Organizing Data Into Publishable UnitsEDI Training Module 4: Organizing Data Into Publishable Units
EDI Training Module 4: Organizing Data Into Publishable UnitsEnvironmental Data Initiative
 
EDI Training Module 10: EDI Data Repository Overview
EDI Training Module 10:  EDI Data Repository OverviewEDI Training Module 10:  EDI Data Repository Overview
EDI Training Module 10: EDI Data Repository OverviewEnvironmental Data Initiative
 
Data Management for librarians
Data Management for librariansData Management for librarians
Data Management for librariansC. Tobin Magle
 
Using a Case Study to Teach Data Management to Librarians
Using a Case Study to Teach Data Management to LibrariansUsing a Case Study to Teach Data Management to Librarians
Using a Case Study to Teach Data Management to LibrariansSherry Lake
 
DataCite at APE 2011
DataCite at APE 2011DataCite at APE 2011
DataCite at APE 2011datacite
 
Datat and donuts: how to write a data management plan
Datat and donuts: how to write a data management planDatat and donuts: how to write a data management plan
Datat and donuts: how to write a data management planC. Tobin Magle
 
The challenge of sharing data well, how publishers can help
The challenge of sharing data well, how publishers can helpThe challenge of sharing data well, how publishers can help
The challenge of sharing data well, how publishers can helpVarsha Khodiyar
 
Introduction to Digital File Management
Introduction to Digital File ManagementIntroduction to Digital File Management
Introduction to Digital File ManagementRebekah Cummings
 
Data Management for Undergraduate Researchers
Data Management for Undergraduate ResearchersData Management for Undergraduate Researchers
Data Management for Undergraduate ResearchersRebekah Cummings
 
Top (10) challenging problems in data mining
Top (10) challenging problems  in data miningTop (10) challenging problems  in data mining
Top (10) challenging problems in data miningAhmedasbasb
 
Data Wrangling in SQL & Other Tools :: Data Wranglers DC :: June 4, 2014
Data Wrangling in SQL & Other Tools :: Data Wranglers DC :: June 4, 2014Data Wrangling in SQL & Other Tools :: Data Wranglers DC :: June 4, 2014
Data Wrangling in SQL & Other Tools :: Data Wranglers DC :: June 4, 2014Ryan B Harvey, CSDP, CSM
 
Data Citation Implementation at Dataverse
Data Citation Implementation at DataverseData Citation Implementation at Dataverse
Data Citation Implementation at DataverseMerce Crosas
 
Data Management for Undergraduate Research
Data Management for Undergraduate ResearchData Management for Undergraduate Research
Data Management for Undergraduate ResearchRebekah Cummings
 
The expanding dataverse
The expanding dataverseThe expanding dataverse
The expanding dataverseMerce Crosas
 

Was ist angesagt? (20)

Managing the research life cycle
Managing the research life cycleManaging the research life cycle
Managing the research life cycle
 
2016 Bio-IT World Cell Line Coordination Poster 2016-04-05v1
2016 Bio-IT World Cell Line Coordination Poster 2016-04-05v12016 Bio-IT World Cell Line Coordination Poster 2016-04-05v1
2016 Bio-IT World Cell Line Coordination Poster 2016-04-05v1
 
DataVsStatistics
DataVsStatisticsDataVsStatistics
DataVsStatistics
 
EDI Training Module 4: Organizing Data Into Publishable Units
EDI Training Module 4: Organizing Data Into Publishable UnitsEDI Training Module 4: Organizing Data Into Publishable Units
EDI Training Module 4: Organizing Data Into Publishable Units
 
EDI Training Module 10: EDI Data Repository Overview
EDI Training Module 10:  EDI Data Repository OverviewEDI Training Module 10:  EDI Data Repository Overview
EDI Training Module 10: EDI Data Repository Overview
 
Data Management for librarians
Data Management for librariansData Management for librarians
Data Management for librarians
 
A Guide for Reproducible Research
A Guide for Reproducible ResearchA Guide for Reproducible Research
A Guide for Reproducible Research
 
Using a Case Study to Teach Data Management to Librarians
Using a Case Study to Teach Data Management to LibrariansUsing a Case Study to Teach Data Management to Librarians
Using a Case Study to Teach Data Management to Librarians
 
Crosslinks
Crosslinks Crosslinks
Crosslinks
 
DataCite at APE 2011
DataCite at APE 2011DataCite at APE 2011
DataCite at APE 2011
 
Datat and donuts: how to write a data management plan
Datat and donuts: how to write a data management planDatat and donuts: how to write a data management plan
Datat and donuts: how to write a data management plan
 
The challenge of sharing data well, how publishers can help
The challenge of sharing data well, how publishers can helpThe challenge of sharing data well, how publishers can help
The challenge of sharing data well, how publishers can help
 
Introduction to Digital File Management
Introduction to Digital File ManagementIntroduction to Digital File Management
Introduction to Digital File Management
 
Data Management for Undergraduate Researchers
Data Management for Undergraduate ResearchersData Management for Undergraduate Researchers
Data Management for Undergraduate Researchers
 
Top (10) challenging problems in data mining
Top (10) challenging problems  in data miningTop (10) challenging problems  in data mining
Top (10) challenging problems in data mining
 
Creating dmp
Creating dmpCreating dmp
Creating dmp
 
Data Wrangling in SQL & Other Tools :: Data Wranglers DC :: June 4, 2014
Data Wrangling in SQL & Other Tools :: Data Wranglers DC :: June 4, 2014Data Wrangling in SQL & Other Tools :: Data Wranglers DC :: June 4, 2014
Data Wrangling in SQL & Other Tools :: Data Wranglers DC :: June 4, 2014
 
Data Citation Implementation at Dataverse
Data Citation Implementation at DataverseData Citation Implementation at Dataverse
Data Citation Implementation at Dataverse
 
Data Management for Undergraduate Research
Data Management for Undergraduate ResearchData Management for Undergraduate Research
Data Management for Undergraduate Research
 
The expanding dataverse
The expanding dataverseThe expanding dataverse
The expanding dataverse
 

Andere mochten auch

Slides: Safeguarding Abila through Multiple Data Perspectives
Slides: Safeguarding Abila through Multiple Data PerspectivesSlides: Safeguarding Abila through Multiple Data Perspectives
Slides: Safeguarding Abila through Multiple Data PerspectivesParang Saraf
 
Safeguarding Abila: Discovering Evolving Activist Networks
Safeguarding Abila: Discovering Evolving Activist NetworksSafeguarding Abila: Discovering Evolving Activist Networks
Safeguarding Abila: Discovering Evolving Activist NetworksParang Saraf
 
RDAP13 Jian Qin: Functional and Architectural Requirements for Metadata
RDAP13 Jian Qin: Functional and Architectural Requirements for MetadataRDAP13 Jian Qin: Functional and Architectural Requirements for Metadata
RDAP13 Jian Qin: Functional and Architectural Requirements for MetadataASIS&T
 
KnowledgeCoin : recognizing and rewarding metadata integration and sharing ...
KnowledgeCoin: recognizing and rewarding metadata integration and sharing ...KnowledgeCoin: recognizing and rewarding metadata integration and sharing ...
KnowledgeCoin : recognizing and rewarding metadata integration and sharing ...Francisco Couto
 
Slides: Safeguarding Abila: Real-time Streaming Analysis
Slides: Safeguarding Abila: Real-time Streaming AnalysisSlides: Safeguarding Abila: Real-time Streaming Analysis
Slides: Safeguarding Abila: Real-time Streaming AnalysisParang Saraf
 
Slides: Safeguarding Abila: Spatio-Temporal Activity Modeling
Slides: Safeguarding Abila: Spatio-Temporal Activity ModelingSlides: Safeguarding Abila: Spatio-Temporal Activity Modeling
Slides: Safeguarding Abila: Spatio-Temporal Activity ModelingParang Saraf
 
Functional and Architectural Requirements for Metadata: Supporting Discovery...
Functional and Architectural Requirements for Metadata: Supporting Discovery...Functional and Architectural Requirements for Metadata: Supporting Discovery...
Functional and Architectural Requirements for Metadata: Supporting Discovery...Jian Qin
 
Describing Scientific Datasets: The HCLS Community Profile
Describing Scientific Datasets: The HCLS Community ProfileDescribing Scientific Datasets: The HCLS Community Profile
Describing Scientific Datasets: The HCLS Community ProfileAlasdair Gray
 
A fast file system for unix presentation by parang saraf (cs5204 VT)
A fast file system for unix presentation by parang saraf (cs5204 VT)A fast file system for unix presentation by parang saraf (cs5204 VT)
A fast file system for unix presentation by parang saraf (cs5204 VT)Parang Saraf
 
Linking Scientific Metadata (presented at DC2010)
Linking Scientific Metadata (presented at DC2010)Linking Scientific Metadata (presented at DC2010)
Linking Scientific Metadata (presented at DC2010)Jian Qin
 
Lab Notebooks: A Librarian's Primer
Lab Notebooks: A Librarian's PrimerLab Notebooks: A Librarian's Primer
Lab Notebooks: A Librarian's PrimerKristin Briney
 
OntoSoft: A Distributed Semantic Registry for Scientific Software
OntoSoft: A Distributed Semantic Registry for Scientific SoftwareOntoSoft: A Distributed Semantic Registry for Scientific Software
OntoSoft: A Distributed Semantic Registry for Scientific Softwaredgarijo
 
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical DataA Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical DataStuart Chalk
 
Scientific Data Cataloging Framework
Scientific Data Cataloging FrameworkScientific Data Cataloging Framework
Scientific Data Cataloging FrameworkSupun Nakandala
 
The Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsThe Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsBarry Feldman
 
The Outcome Economy
The Outcome EconomyThe Outcome Economy
The Outcome EconomyHelge Tennø
 

Andere mochten auch (18)

Causality Based Versioning
Causality Based VersioningCausality Based Versioning
Causality Based Versioning
 
Slides: Safeguarding Abila through Multiple Data Perspectives
Slides: Safeguarding Abila through Multiple Data PerspectivesSlides: Safeguarding Abila through Multiple Data Perspectives
Slides: Safeguarding Abila through Multiple Data Perspectives
 
Safeguarding Abila: Discovering Evolving Activist Networks
Safeguarding Abila: Discovering Evolving Activist NetworksSafeguarding Abila: Discovering Evolving Activist Networks
Safeguarding Abila: Discovering Evolving Activist Networks
 
RDAP13 Jian Qin: Functional and Architectural Requirements for Metadata
RDAP13 Jian Qin: Functional and Architectural Requirements for MetadataRDAP13 Jian Qin: Functional and Architectural Requirements for Metadata
RDAP13 Jian Qin: Functional and Architectural Requirements for Metadata
 
KnowledgeCoin : recognizing and rewarding metadata integration and sharing ...
KnowledgeCoin: recognizing and rewarding metadata integration and sharing ...KnowledgeCoin: recognizing and rewarding metadata integration and sharing ...
KnowledgeCoin : recognizing and rewarding metadata integration and sharing ...
 
Slides: Safeguarding Abila: Real-time Streaming Analysis
Slides: Safeguarding Abila: Real-time Streaming AnalysisSlides: Safeguarding Abila: Real-time Streaming Analysis
Slides: Safeguarding Abila: Real-time Streaming Analysis
 
Slides: Safeguarding Abila: Spatio-Temporal Activity Modeling
Slides: Safeguarding Abila: Spatio-Temporal Activity ModelingSlides: Safeguarding Abila: Spatio-Temporal Activity Modeling
Slides: Safeguarding Abila: Spatio-Temporal Activity Modeling
 
Fast File System
Fast File SystemFast File System
Fast File System
 
Functional and Architectural Requirements for Metadata: Supporting Discovery...
Functional and Architectural Requirements for Metadata: Supporting Discovery...Functional and Architectural Requirements for Metadata: Supporting Discovery...
Functional and Architectural Requirements for Metadata: Supporting Discovery...
 
Describing Scientific Datasets: The HCLS Community Profile
Describing Scientific Datasets: The HCLS Community ProfileDescribing Scientific Datasets: The HCLS Community Profile
Describing Scientific Datasets: The HCLS Community Profile
 
A fast file system for unix presentation by parang saraf (cs5204 VT)
A fast file system for unix presentation by parang saraf (cs5204 VT)A fast file system for unix presentation by parang saraf (cs5204 VT)
A fast file system for unix presentation by parang saraf (cs5204 VT)
 
Linking Scientific Metadata (presented at DC2010)
Linking Scientific Metadata (presented at DC2010)Linking Scientific Metadata (presented at DC2010)
Linking Scientific Metadata (presented at DC2010)
 
Lab Notebooks: A Librarian's Primer
Lab Notebooks: A Librarian's PrimerLab Notebooks: A Librarian's Primer
Lab Notebooks: A Librarian's Primer
 
OntoSoft: A Distributed Semantic Registry for Scientific Software
OntoSoft: A Distributed Semantic Registry for Scientific SoftwareOntoSoft: A Distributed Semantic Registry for Scientific Software
OntoSoft: A Distributed Semantic Registry for Scientific Software
 
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical DataA Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
 
Scientific Data Cataloging Framework
Scientific Data Cataloging FrameworkScientific Data Cataloging Framework
Scientific Data Cataloging Framework
 
The Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsThe Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post Formats
 
The Outcome Economy
The Outcome EconomyThe Outcome Economy
The Outcome Economy
 

Ähnlich wie Analyzing Extended and Scientific Metadata for Scalable Index Designs

Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesElsevier
 
Coping with Data for WHOI JP Students
Coping with Data for WHOI JP StudentsCoping with Data for WHOI JP Students
Coping with Data for WHOI JP StudentsCarly Strasser
 
Integrating an electronic lab notebook with a data repository; American Chemi...
Integrating an electronic lab notebook with a data repository; American Chemi...Integrating an electronic lab notebook with a data repository; American Chemi...
Integrating an electronic lab notebook with a data repository; American Chemi...rmacneil88
 
Elns and repositories, American Chemical Society, Dallas, March 2014
Elns and repositories, American Chemical Society, Dallas, March 2014Elns and repositories, American Chemical Society, Dallas, March 2014
Elns and repositories, American Chemical Society, Dallas, March 2014ResearchSpace
 
data analytics lecture3.ppt
data analytics lecture3.pptdata analytics lecture3.ppt
data analytics lecture3.pptNamrataBhatt8
 
Hadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesHadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesUri Laserson
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 abhagathk
 
Design and implementation of Clinical Databases using openEHR
Design and implementation of Clinical Databases using openEHRDesign and implementation of Clinical Databases using openEHR
Design and implementation of Clinical Databases using openEHRPablo Pazos
 
Bren - UCSB - Spooky spreadsheets
Bren - UCSB - Spooky spreadsheetsBren - UCSB - Spooky spreadsheets
Bren - UCSB - Spooky spreadsheetsCarly Strasser
 
Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Jeroen Rombouts
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217lyarmey
 
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...ASIS&T
 
2013 02 data portal science group update -v smith
2013 02 data portal science group update -v smith2013 02 data portal science group update -v smith
2013 02 data portal science group update -v smithVince Smith
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
 
Creating an Urban Legend: A System for Electrophysiology Data Management and ...
Creating an Urban Legend: A System for Electrophysiology Data Management and ...Creating an Urban Legend: A System for Electrophysiology Data Management and ...
Creating an Urban Legend: A System for Electrophysiology Data Management and ...Anita de Waard
 
FSCI Data Discovery
FSCI Data DiscoveryFSCI Data Discovery
FSCI Data DiscoveryARDC
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Paul Groth
 

Ähnlich wie Analyzing Extended and Scientific Metadata for Scalable Index Designs (20)

Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific Tables
 
Coping with Data for WHOI JP Students
Coping with Data for WHOI JP StudentsCoping with Data for WHOI JP Students
Coping with Data for WHOI JP Students
 
Integrating an electronic lab notebook with a data repository; American Chemi...
Integrating an electronic lab notebook with a data repository; American Chemi...Integrating an electronic lab notebook with a data repository; American Chemi...
Integrating an electronic lab notebook with a data repository; American Chemi...
 
Elns and repositories, American Chemical Society, Dallas, March 2014
Elns and repositories, American Chemical Society, Dallas, March 2014Elns and repositories, American Chemical Society, Dallas, March 2014
Elns and repositories, American Chemical Society, Dallas, March 2014
 
data analytics lecture3.ppt
data analytics lecture3.pptdata analytics lecture3.ppt
data analytics lecture3.ppt
 
Hadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesHadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciences
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Design and implementation of Clinical Databases using openEHR
Design and implementation of Clinical Databases using openEHRDesign and implementation of Clinical Databases using openEHR
Design and implementation of Clinical Databases using openEHR
 
Bren - UCSB - Spooky spreadsheets
Bren - UCSB - Spooky spreadsheetsBren - UCSB - Spooky spreadsheets
Bren - UCSB - Spooky spreadsheets
 
Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10
 
Dbms rlde.ppt
Dbms rlde.pptDbms rlde.ppt
Dbms rlde.ppt
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217
 
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
 
2013 02 data portal science group update -v smith
2013 02 data portal science group update -v smith2013 02 data portal science group update -v smith
2013 02 data portal science group update -v smith
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
Creating an Urban Legend: A System for Electrophysiology Data Management and ...
Creating an Urban Legend: A System for Electrophysiology Data Management and ...Creating an Urban Legend: A System for Electrophysiology Data Management and ...
Creating an Urban Legend: A System for Electrophysiology Data Management and ...
 
FSCI Data Discovery
FSCI Data DiscoveryFSCI Data Discovery
FSCI Data Discovery
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
Researh data management
Researh data managementResearh data management
Researh data management
 

Kürzlich hochgeladen

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 

Kürzlich hochgeladen (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 

Analyzing Extended and Scientific Metadata for Scalable Index Designs

  • 1. Aleatha Parker-Wood*^,Brian A. Madden*,Michael McThrow*, Darrell D.E. Long*, Ian F. Adams*, Avani Wildani* *University of California Santa Cruz ^Conservatoire National des Arts et Métiers Examining Extended and Scientific Metadata for Scalable Index Designs
  • 2. What we call metadata • Data for the system • External to the file • Small • Dense 2 Abraham Silberschatz, Greg Gagne, and Peter Baer Galvin, "Operating System Concepts, Eighth Edition "
  • 3. What everyone else calls metadata • Data for the user • Embedded in: • the file • the inode • a separate file • a notebook somewhere on their desk • Wildly varying size • Sparse 3 Embedded Metadata Metadata filesMetadata filesMetadata files Metadata outside the system Inode metadata
  • 4. A scientist at work • “Show me the data set about bears in Alaska from last fall” • “Show me simulation results from last week for Vesuvius which used this code library, and where the pressure is higher than 500 kiloPascals” • A mix of system and scientific metadata 4
  • 5. Our options • Relational databases • Column stores • Spatial trees (E.g., Spyglass, Smartstore) • Inverted indexes • Bitmap indexes (E.g. FastBit) • The choice of index depends on the data, but what does the data look like? 5
  • 6. Outline • The data in brief • Dimensionality • Sparsity • Atomicity • Entropy 6
  • 7. The metadata in brief 7 Discipline Native   Format Record   count Subsample d? Sample   count Total  size Dryad Biology XML 31K No 31K 400  MB WISE Astronomy CSV 564M Yes 10K 1  TB ARGO Oceanograp hy NetCDF 2B Yes 635K 330GB ORNL Climatology CSV 1478 No 1478 154KB
  • 8. Dimensionality 8 Dryad WISE Argo ORNL Total   Dimensions 44 285 108 14 451 •Much higher dimensional than POSIX data •Curse of dimensionality concerns
  • 9. Sparsity 9 Sparse even within a discipline (extremely sparse across all disciplines) • CDF of sparsity • For a randomly chosen element from X% of columns, there is a Y% chance it will be null
  • 10. Atomicity (Dryad) • How many times can a field be present for a single item? • E.g.: A single paper can have multiple authors • Truncated to show detail. One study had 800 species! 10 Some disciplines have many field values per item. Others have range values (e.g., May-June 2010)
  • 11. Entropy • Row organization versus column • How compressible is the data? • How selective are queries? • Plenty of compression available 11
  • 12. Bringing it all together • Scientific data is: • Sparse • High-dimensional • Compressible • Non-atomic (one to many) • A mix of cardinal, ordinal, spatial, and binary data • Query models: • Spatial • Range and point • Key word 12
  • 13. Comparing indexes 13 Column   stores Row  stores Spatial  trees Inverted   Indexes HDF5 FastBit High   dimensional Yes Yes No Yes Yes Yes Sparse Yes Stores  nulls No Yes Yes Stores  nulls Multiple   values Yes Yes No List,  not   range Yes Yes Non-­‐numeric   data Yes Yes No Yes Yes No Range   queries Yes Yes Yes No Yes Yes Specialized   indexes Yes Yes No No No No High Compression Yes No No Yes No Yes
  • 14. Conclusions 14 • Currently popular approaches to file system indexing (spatial trees, RDBMS) are a poor match for scientific data • Current approaches to scientific indexing are not a complete solution • Column stores are a natural fit for scientific metadata and queries • Specialized indexes based on inverted indexes, bitmaps, and spatial trees are appropriate for some data
  • 16. Data types (raw and semantic) 16 Dryad WISE Argo ORNL Total String Numeric Str/Num Date Spatial Flagsets 100% 4% 62% 29% 28% 0% 96% 38% 71% 72% 96% 68% 77% 72% 73% 2% 4% 7% 7% 5% 2% 9% 2% 21% 7% 0% 19% 14% 0% 15% •Support for spatial search is useful •Application hinting is needed for good search (is this a string, a location, or a flag set?)
  • 17. How can we support this? • Search functionality which: • Supports these kinds of queries • Does not double the size of storage • Does not require a linear scan over petabytes of data • The answers to queries are documents • We rarely need an entire row • Complex transactions and joins are less important 17