Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Metadata lecture(9 17-14)

Wird geladen in …3

Hier ansehen

1 von 49 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (19)

Andere mochten auch (20)


Ähnlich wie Metadata lecture(9 17-14) (20)

Aktuellste (20)


Metadata lecture(9 17-14)

  1. 1. Matthew Brush Ontology Development Group OHSU Library, DMICE METADATA PERSPECTIVES FROM THE WEB AND DATABASES SYSTEMS Sept 17, 2014 brushm@ohsu.edu
  2. 2.  “Data about Data”  “Data” broadly covers any information resource  digital or physical  narrative, multimedia, structured  raw data, processed data, aggregates of datasets, or discrete elements within data sets  More formally, “Metadata is structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource” METADATA (NISO (2004) Understanding Metadata. Bethesda, NISO Press )
  3. 3.  Descriptive metadata: supports discovery and identification  e.g. title, author, identifiers, subjects, keywords  Structural metadata: describes how the components of a resource are organized  e.g. table of contents for a book, schema of database tables, manifest of files in an aggregate ‘research object’  Administrative metadata: helps manage the resource  Technical - describes technical aspects of a resource  e.g. file type, version information, how/when created  Rights management - explains intellectual property rights  e.g. licensing, use restrictions, privacy concerns  Preservation - supports maintenance and archiving of a resource  e.g. provenance/ownership, history of use, authenticity METADATA SERVES MANY PURPOSES . . . http://www.niso.org/publications/press/UnderstandingMetadata.pdf
  4. 4. Metadata comes in many forms, serves many needs, and operates in very diverse settings  I. Resource metadata (on the web)  Target: information resources as a whole  1o Goals: resource discovery and use  Form: structured, separate records  Users: everyone  Standards: many metadata frameworks/vocabularies  II. Metadata in database systems  Target: structured data and data elements  1o Goals: data consistency, aggregation, analysis  Form: ER diagrams, summary tables, data dictionaries  Users: professional data administrators and scientists  Standards: metadata and CDE registries . . . AND OPERATES IN MANY CONTEXTS
  5. 5. I. Resource Metadata (on the web) A. Overview B. Examples C. Metadata Frameworks i. Schema ii. Vocabularies iii. Conceptual Models iv. Practical Specifications v. Encoding Specifications D. Metadata Storage and Retrieval II. Metadata in Databases Systems A. Overview B. Data Elements C. Data Dictionaries D. Common Data Elements (CDEs) E. CDE Registries OUTLINE
  6. 6.  Metadata in the world that all of us have used and created in work and life  Attached to information resources we find on the web  books, videos, images, websites, datasets, . . .  Helps us to find a resource and understand what it is and how to use it I. RESOURCE METADATA (ON THE WEB)
  7. 7. Descriptive Structural Administrative Book Catalog Record http://ohsulibrary.worldcat.org/title/metadata/oclc/225088362
  8. 8. Descriptive Structural Administrative Digital Photograph Library http://crdl.usg.edu/cgi/crdl?query=id:highlander_highlanderphotos_p2-wi3-3
  9. 9. Data Set Description http://datadryad.org/resource/doi:10.5061/dryad.4ms68 Research Data Sets and Files (datadryad.org) Data File Description
  10. 10.  Resource metadata is increasingly structured according to established schemas and standards  Many standards exist that vary in their:  complexity (schemas, specifications, conceptual models)  targets (music, video, images, books, art, datasets)  goals (descriptive, administrative and preservation)  communities served (libraries, museums, research)  Benefits:  leverage existing resources  vetted by community  interoperability and integration STANDARDS ARE KEY
  11. 11. Normative standards for metadata are captured in metadata frameworks. There are five possible components of a metadata framework: A. Schema B. Vocabularies C. Conceptual Model D. Practical Specifications E. Encoding Specifications METADATA FRAMEWORKS
  12. 12.  Core of any framework – specifies the categories of information recorded  Comprised of a set of data elements along with descriptions of their attributes and rules for use  attributes described should minimally include an identifier and/or name and a definition of each element  Can also specify data types and ‘value domains’ that describe allowable values for a given element  e.g. term lists, CVs, ontologies  Example schema: Dublin Core, LOM, HCLS Dataset Std. A. METADATA SCHEMA
  13. 13.  First effort at standardizing metadata to improve resource discovery on the web  Very simple core schema consisting of 15 general data elements representing properties of a information resource, with no value restrictions.  Data Elements: title, identifier, type, description, creator, contributor, date, subject, format, language, source, publisher, relation, coverage, rights  Element Attributes: URI, label, definition, domain, range, version, comment EXAMPLE 1: DUBLIN CORE METADATA INITIATIVE (DCMI) http://dublincore.org/documents/dcmi-terms/
  14. 14.  Extensive set of metadata elements describing ‘learning objects’  “Any digital or non-digital entity that may be used for learning, education, or training"  Based loosely on DCMI schema, but:  >50 new elements to describe educational attributes of learning objects  organizes elements into a hierarchical structure  provides detailed specifications for allowable values  supports ‘application profiles’ that extend model for specific domains EXAMPLE 2: LEARNING OBJECT METADATA (IEEE-LOM) http://www.imsglobal.org/metadata/mdv1p3/imsmd_bestv1p3.html
  16. 16.  The LOM base schema defines 9 categories of metadata elements  Hierarchical structure supports user understanding, metadata organization and aggregation for analysis LOM ELEMENT HIERARCHY
  17. 17.  A unified schema that provides all key metadata fields needed to comprehensively describe research datasets  what they are, how they are produced, where they are found  meets pressing need in current research climate to support sharing, discovery, and re-use of public datasets in a standardized way  Metadata elements describe general features, identifiers, provenance and change, availability and distribution, and dataset statistics  Comprised entirely of elements (properties) from existing community vocabularies, e.g. DCMI, DCAT, PROV, VOID, FOAF  attributes and rules for element use defined in source schema EXAMPLE 3: W3C HCLS DATASET DESCRIPTION STANDARD http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/
  18. 18. B. VOCABULARIES  Set of terms (often structured) that is used to constrain entry of metadata values  Vocabularies represent general concepts  Word or code lists  Hierarchical classifications  taxonomies, thesauri, ontologies  e.g. ICD9, SNOMed, MeSH, NCIthesaurus  Authority lists provide controlled names for proper nouns  FundRef (organizations)  Global Gazeteer (places)  ORCID (people)
  19. 19.  Open Researcher and Contributor IDentifier (ORCID)  a nonproprietary alphanumeric code that uniquely identifies scientific and other academic authors (a persistent “author DOI” for researchers)  The ORCID identifier set is coming to serve as a de facto authority list to record persons contributing to scholarly research products  ORCIDs facilitates efforts to track productivity, impact, and attribution based on all scholarly outputs (publications, grants, datasets, protocols, presentations, abstracts, code, blogs, etc)  Services can aggregate scholarly outputs for a given researcher  resolves to a “CV” listing all scholarly contributions linked across various venues (e.g. Pubmed, Scopus, Slideshare, Figshare, Github, Dryad, . . .) ORCID AS AN AUTHORITY LIST
  20. 20.  An underlying model that describes how all the information and concepts inherent in a resource are related to one another  Metadata Models  conceptualize the metadata schema itself (hierarchical relationships or other mappings between elements )  Domain Models  conceptualize domain in which the metadata schema operates (classes of things that are annotated and the relationships between them) C. CONCEPTUAL MODELS
  21. 21. EXAMPLE METADATA MODEL: LOM ELEMENT HIERARCHY The structure of the LOM is an example of a simple conceptual metadata model, which organizes elements into disjoint hierarchies
  22. 22.  The summary level describes the dataset in general  The version level describes a specific version  The distribution level describes a representation of a version EXAMPLE DOMAIN MODEL: HCLS DATASET ‘LEVELS’ Supports recommendations for how each should be described using the standard
  23. 23.  D. Practical specifications for use  provide guidance for how to apply metadata under a given schema  e.g. HCLS model provides recommendations when and how to apply certain elements to types of targets in the domain  E. Encoding specifications for presentation & exchange  rules for binding metadata to syntactic formats such as XML or RDF  e.g. LOM has precise specification for binding to XML or RDF D/E. SPECIFICATIONS
  24. 24. STORING AND ACCESSING RESOURCE METADATA  Typically lives separately from annotated resources, in databases and/or XML files  Can also be stored within a resource (e.g. photo metadata embedded in image file itself)  Increasing number of resource catalogs and repositories on the web provide access to metadata and often the resource itself  will have seen examples for books, images, and datasets  These repositories are indexed by search tools and provide programmatic interfaces to allow for resource discovery and re-use
  25. 25.  Serves same basic needs, but different scale and target of annotation, user base, and primary use cases II. METADATA IN DATABASE SYSTEMS  Two main categories: 1. Structural metadata  describes the structure of database objects and the relationships between them  commonly encoded externally as ER- diagrams, or internally as summary tables http://www.visn20.med.va.gov/VISN20/V20/DataWarehouse/Images/LabAutopsy.jpg Example ER diagram for VA autopsy data
  26. 26.  Serves same basic needs, but different scale and target of annotation, user base, and primary use cases II. METADATA IN DATABASE SYSTEMS  Two main categories: 2. Content metadata  describes meaning of data at a very fine granularity  specifies attributes of data elements , and rules for recoding their values  encoded internally or externally as ‘data dictionaries’ Example of a data set that needs a dictionary to interpret
  27. 27.  The notion of a ‘data element’ obtains a more precise meaning and specification in the context of a database.  elements can be specified at finer granularity in a databases holding structured data in a controlled operational system  Conceptually, a data element is comprised of a concept and a value domain  concept = the subject of the data recorded for a given element  value domain = the defined value set for how that data is recorded  Example: PT_ETHNIC  concept = patient ethnicity  value domain = [ E1 (caucasian), E2 (hispanic/latino), E3 (african), E4 (asian), E5 (mixed) ] DATA ELEMENTS
  28. 28.  Provide detailed metadata about data elements  element identifiers and name(s)  definitions and descriptions  value constraints  data type  default value  length  allowable values  value frequency (mandatory or not)  provenance and tracking  version number, entry and termination dates  indicate source table(s)  mappings to elements in other schema dictionaries DATA DICTIONARIES
  29. 29. DATA DICTIONARIES http://library.ahima.org/xpedio/groups/public/documents/ahima/bok1_048618.pdf Simple example of a data dictionary
  30. 30.  Key Functions  unambiguous and shared understanding of the data by all users (administrators, analysts, and clients)  consistent data representation and manipulation (addition, extraction, aggregation, and transformation)  maintenance of the data model  data integration, exchange, and re-use  Encoding  as an external document and/or represented as a table in the database itself DATA DICTIONARIES
  31. 31. 1. Clear and thorough element definitions and value set explanations are key 2. Give persistent identifiers to data elements 3. Map data elements to community standards where possible  common data elements (CDEs) 4. Specify value sets in terms of open controlled vocabularies CVs where possible 5. Provide notes and guidelines for context of use 6. Make dictionary easily accessible to all users DATA DICTIONARY BEST PRACTICES
  32. 32.  As research moves toward 'big data‘, information from diverse sources is being shared and aggregated for analysis.  A major challenge for managing this data is the diversity of ways that a given idea can be described in data elements  Sex/gender definitions can be based on genetics, phenotype, or self- identification. Values can be recorded as local codes, abbreviations, full labels, or community vocabularies. DATABASE METADATA INTEROPERABILITY
  33. 33.  The Common Data Element (CDE) movement aims to address this problem by providing standardized data elements that can be re-used across medical datasets  CDEs are  owned, managed, & curated by single authority (NINDS, NCI)  stored and managed in large repositories called CDE registries  available for diverse areas of clinical practice and research, and at very fine granularity  larger repositories hold up to 50,000 elements available  CDEs serving as a foundation for interoperability across data systems COMMON DATA ELEMENTS (CDE )s
  34. 34.  Metadata registries that collect common data elements for a defined domain  Resemble large scale data dictionaries, but with key differences:  Exposed in searchable public repositories with additional services to promote extraction and re-use  Coverage is wider as they are used across different domains and systems  Metadata element descriptions are far richer to support discovery, provenance, versioning, mappings, meta-modeling  The NIH maintains a portal to information about existing CDE initiatives, registries, and tools (http://www.nlm.nih.gov/cde/) CDE REGISTRIES
  35. 35.  Houses >20,000 CDEs  “Core” element set covers general concepts in medical domain  patient demographics, medical history, assessment & examinations, treatments & interventions, outcomes, and study protocol  “Supplementary” sets covering specific diseases/research areas  spinal injury, brain injury, epilepsy and stroke, Parkinson’s disease, ALS  Metadata schema captures 30 element attributes  this expanded set of attributes supports use cases of enabling discovery and community re-use across different implementations  Portal has search functionality and support for generating clinical forms (CRFs) with CDE mappings embedded in collected data NINDS CDE REGISTRY http://www.commondataelements.ninds.nih.gov/
  36. 36.  The National Cancer Institute cancer Data Standards Registry (caDSR) is the largest and most widely used CDE registry  >50,000 total elements  Integrates CDEs from several initiatives under a unified model and technical infrastructure  Broad and deep coverage to fine granularity (as with NINDS)  Metadata model is VERY complex  captures >100 distinct attributes describing each data element in the registry (vs 30 for NINDS)  implements a complex conceptual model based on the ISO/IEC 11179 metadata registry standard  decomposes data elements into component parts that are mapped to NCI thesaurus terms (formal encoding of semantics) NCI DSR CDE REGISTRYca https://cdebrowser.nci.nih.gov/CDEBrowser/
  37. 37. DSR CONCEPTUAL MODEL 1. To understand the data element table and explain why it is so expansive 2. Follows a standard for database metadata registries called ISO11179  commonly implemented in other efforts you may encounter  e.g. the Clinical Data Interchange Standards Consortium (CDISC), which has similar goals as the caDSR but across a broader domain 3. Is the basis for semantic mappings to ontologies such as the NCI thesaurus which are an important feature of the model ca
  38. 38. Data Element Concept Value Domain Value Represen- tation Valid Values Class Property DSR CONCEPTUAL MODELca
  39. 39. Data Element PT_GENDER_CODE Concept ‘patient gender’ Class ‘person’ Property ‘gender’ CONCEPT ELEMENT MAPPING Concept = idea represented by the data element, described independently of a particular representation Class = a set of real world objects with shared characteristics Property = a characteristic common to all members of an class
  40. 40. Data Element PT_GENDER_CODE Concept ‘patient gender’ Class ‘person’ C25190 Property ‘gender’ C17357 Class and property concepts are mapped to NCI taxonomy terms to formally encode their semantics Class Mapping • person = C25190 Property Mapping • gender = C17357 CONCEPT ELEMENT MAPPING
  41. 41. Data Element PT_GENDER_CODE Value Domain VALUE DOMAIN MAPPING Value domain = a set of attributes describing representational characteristics of instance data Value Representation = type of value the data represents (along different dimensions) Valid Values = the actual allowed values for a given value domain Value Rep. ‘person’, ‘gender’, ‘code’ Valid Values ‘0’,’1’,’2’,’9’
  42. 42. Data Element PT_GENDER_CODE Value Domain Value Rep. ‘person’, ‘gender’, ‘code’ Value Representation Mappings ‘person’ = C25190 ‘gender’ = C17357 ‘code’ = C25162 Valid Value Mappings 0 = unknown C17998 1 = female gender C46110 2 = male gender C46109 9 = unspecified n/a VALUE DOMAIN MAPPING Valid Values ‘0’,’1’,’2’,’9’
  43. 43. Concept Value Domain Value Represen- tation Valid Values Class Property "SEMANTICALLY UNAMBIGUOUS INTEROPERABILITY"  Semantic mappings of these four elements can support more sophisticated search and analysis  Computational tools can leverage logic in the NCI hierarchy for query expansion and data aggregation
  44. 44.  The structure of the NCI taxonomy supports synonym and hierarchical query expansion LEVERAGING SEMANTICS  User searches ‘cancer biology’ to view all CDEs related to this concept.  The query is expanded (1) to include any children of this term in the taxonomy, and (2) to include elements with text matching any synonym of cancer in the taxonomy NCI Thesaurus ‘Cancer Biology’ branch
  45. 45. Strategies Map elements in local data dictionaries to CDEs  Parkinson’s Disease Biomarkers Program (PDBP) data dictionary  NINDS registry form builder Build libraries of re-usable, pre-fabricated forms with embedded CDE metadata  NINDS Case Report Form (CRF) library  medical-data-models.org forms Initialize software with CDEs so that electronic forms automatically carry mappings when they are generated  caDSR registry and CDISC tools CDE IN PRACTICEs
  46. 46.  CDEs standardize data elements for use across multiple systems  Available in registries that vary in size and complexity  some resemble simple data dictionaries with expanded attributes to support discovery and provenance (NINDS)  some are implemented with complex conceptual models and semantic mappings (caDSR)  Tools and standards supporting practical application exist but are not yet state of the art  Worlds collide: the intersection of metadata for web resources and database systems  CDEs represent discoverable web resources, that are used in the context of data collection and description in database systems  Each registry defines a metadata framework/schema for a given domain CDE SUMMARY
  47. 47.  Promote standardized and systematic data collection  Improve data quality and consistency  Facilitate data sharing and integration  Reduce the cost and time needed to develop data collection tools  Improve opportunities for meta-analysis comparing results across studies  Increase the availability of data for the planning and design of new trials BENEFITS OF CDEs
  48. 48.  Data elements across efforts are not well aligned  Tooling support for discovery & application immature  Limited use of community taxonomy and ontology mappings  Navigating complexity and redundancy . . . of medical data itself  many ways to calculate and represent simple and complex measures such as tumor burden or medical prognosis . . . of metadata elements/schemas  thousands of elements with very nuanced meaning and use  redundant representation poses challenges for data collection, aggregation, and integrated analysis (even for simple measures) CHALLENGES FOR DATA INTEGRATION AND ANALYSIS
  49. 49. LINKS Schema Examples: DCMI: http://dublincore.org/documents/dcmi-terms/ IEEE-LOM: http://www.imsglobal.org/metadata/mdv1p3/imsmd_bestv1p3.html HCLS Dataset description standard: http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/ Data Dictionary Example: http://library.ahima.org/xpedio/groups/public/documents/ahima/bok1_048618.pdf CDE Sites: NIH CDE Portal: http://www.nlm.nih.gov/cde/ NINDS CE Registry: http://www.commondataelements.ninds.nih.gov/ caDSR browser: https://cdebrowser.nci.nih.gov/CDEBrowser/ caDSR tools: http://cbiit.nci.nih.gov/ncip/biomedical-informatics-resources/interoperability-and-semantics/metadata-and- models CDEs in Practice: PDBP gender data dictionary entry: https://dictionary.pdbp.ninds.nih.gov/portal/publicData/dataElementAction!view.action?dataElementId=5585 NINDS form builder http://www.commondataelements.ninds.nih.gov/CRF.aspx?source=formBuilder Downloadable forms (CRFs) from NINDs with embedded CDE links: http://www.commondataelements.ninds.nih.gov/CRF.aspx medical-data-models.org forms https://medical-data-models.org/forms/1049 Suite of tools on caDSR site http://cbiit.nci.nih.gov/ncip/biomedical-informatics-resources/interoperability-and- semantics/metadata-and-models