1. Provenance of Microarray Experiments for a
Better Understanding of Experiment Results
Helena F. Deus
University of
Texas
Jun Zhao
University of
Oxford
Satya Sahoo
Wright State
University
Matthias Samwald
DERI, Galway
Eric
Prud’hommeaux
W3C
Michael Miller
Tantric Designs
M. Scott Marshall
Leiden University
Medical Center
Kei-Hoi Cheung
Yale University
2. Outline
Background: microarrays, gene expression and why is provenance
important for experimental biomedical data
Objectives
Data: Microarray workflow and gene results
The provenance model
Demo
Future work
Summary
3. Introduction
High throughput experiments, such as microarray technologies,
have revolutionized the way we study disease and basic biology.
Microarray experiments allow scientists to quantify thousands of
genomic features in a single experiment
Source: http://www.scq.ubc.ca/
Affymetrix microarray
gene chips
Genes can be
used as
biomarkers for
disease
4. Introduction
Since 1997, the number of published results based on an analysis of
gene expression microarray data has grown from 30 to over
5,000 publications per year
Existing microarray data repositories and standards, but lack of
provenance and interoperable data access
Source:YJBM(2007)80(4):165-78
5. Introduction Cont.
A pilot study of the W3C
HCLS BioRDF task force
Bottom-up approach
Use Microarray
experiments for
Alzheimer’s Diseases as
the test-bed
Aggregate results
across microarray
experiments
Combine different
types of data
6. Objectives
To facilitate a better understanding of microarray gene results
Efficiently query gene results
Efficiently combine existing life science datasets
To transform Microarray gene results into Semantic Web format
To encode provenance information about these gene results in the
same format as the data itself
7. Microarray Workflow
Biological question
Differentially expressed genes
Sample gathering etc.
Experiment design
Microarray experiment
Image analysis
Normalization
Estimation Clustering
Discriminatio
n
T-test… …
Data extraction
Data analysis and modeling
10. What microarray experiments analyze
samples taken from the entorhinal
cortex region of Alzheimer's patients?
11. What genes are overexpressed in the
entorhinal cortex region and what is
their expression fold change and
associated p-value?
12. What other diseases may be
associated with the same genes
found to be linked to AD?
13. A Bottom-up Approach
Separate concerns/perspectives
Too many existing vocabularies to choose from
Lack of standardization among existing provenance vocabularies
Lack of a clear understanding of what needs to be captured
Process
Identify user query
Define terms
Test the query using test data
15. A Bottom-up Approach
Raw Data
Results
Questions
Which genes are markers
for neurodegenerative
diseases?
Was gene ALG2
differentially expressed in
multiple experiments?
What software was used
to analyse the data?
How can the experiment
be replicated?
16. A Bottom-up Approach
Raw Data
Results
Questions
Which genes are markers
for neurodegenerative
diseases?
Was gene ALG2
differentially expressed in
multiple experiments?
Provenance of
Microarray
experiment
What software was used
to analyse the data?
How can the experiment
be replicated?
17. A Bottom-up Approach
Provenance
models
Workflow,
experimental design
Domain ontologies
(DO, GO…)
Community
models
Raw Data
Results
Questions
Which genes are markers
for neurodegenerative
diseases?
Was gene ALG2
differentially expressed in
multiple experiments?
Provenance of
Microarray
experiment
What software was used
to analyse the data?
How can the experiment
be replicated?
18. The Provenance Data Model: Four Types of
Provenance
http://purl.org/net/biordfmicroarray/ns#
19. RDF genelist representation
Institutional level: metadata associated with each genelist such as the laboratory
where the experiments were performed or the reference to the genelist.
Experimental context level: experimental protocols such as the region of the
brain and the disease (terms were partially mapped to MGED, DO and NIF).
20. RDF genelist representation
Data analysis and significance: statistical analysis methodology for selecting the
relevant genes
Dataset descriptions: version of a source dataset, who published the dataset.
The vocabulary of interlinked datasets (voiD) and dublin core terms (dct) were
used.
25. Query federation with diseasome Is there a gene
network for AD?
Source: PNAS 104:21, 8685 (2007)
26. Demo
Go to http://purl.org/net/biordfmicroarray/demo
27. Conclusions
Levels of provenance: 1) institutional; 2) experimental context; 3)
Statistical analysis and significance; 4) dataset description
Provenance as RDF: SPARQL queries to express contrains both
about the origins and context of the data
Data model is driven by the biological question: a bottom-
up approach shields the model from rapidly evolving ontologies while
enabling linking to widely used ontologies
Mapping is facilitated: Mapping to existing provenance
vocabularies, like OPM, PML, Provenir is facilitated by:
biordf:has_input_value, which can be made a sub-property of the
inverse of OPM property used
biordf:derives_from_region, which can become a sub-property of
OPM property wasDerivedFrom.
28. Summary and Future Work
Provenance modeling in a semantic web application
Query genes gathered from specific samples, in a given condition
or from given organizations
Query genes produced through particular statistical analysis process
Query for information about genes from a most recent dataset
The bottom-up approach
Separate concerns of interests
Create a minimum set of terms required for motivation queries
Future work
To integrate our model with provenance information generated in
scientific workflow workbench
To integrate provenance information as part of the Excel
Spreadsheet where most biologists report their results
29. Acknowledgement
W3C BioRDF group
Kei Cheung, Michael Miller, M. Scott Marshall, Eric Prud’hommeaux,
Satya Sahoo, Matthias Samwald
The HCLS IG as well as Helen Parkinson, James Malone, Misha
Kapushesky and Jonas Almeida.
Hinweis der Redaktion
A revolution in biology and biomedicine has taken place in the past 10 years with the invention of microarray technologies. A microarray is a small chip (point to figure on the left) where fragments of every known gene have been mapped into precise positions on the chip. Microarrays makes it possible to measure how much of each of those genes is being expressed in cells. This technique is very often used, for example, to determine which genes may be the cause of specific diseases (point to figure on the right).
A simplistic view of performing a microarray experiment involves collecting sample cells from both the experimental case (for example, a disease sample) and a control. Inside the cell, DNA is converted into mRNA, which is the active molecule of the cell that will be transcribed into whatever biological product that is causing or preventing the disease. mRNA molecules are complementary to the DNA that is bound to the array. To measure the gene expression, the mRNA molecules are first labelled with fluorescent dyes and deposited either in a single array (represented in the figure) or in separate arrays for control vs test. mRNA will naturally bind to the DNA of the arrays and, because the dyes are fluorescent, it will emit light according to the amount of mRNA present in the cell. As a result, it is possible to compare the amount of DNA being expressed in the disease sample, when compared to the normal sample, simply based on the bightness of each spot in the array.
Microarray technologies have been so succesfull in predicting disease that they have become standard practice in most biomedical studies. This has resulted in over 5000 results being published each year. Many of these datasets are made publicly available in repositories such as EBI’s array express or NCBI’s gene expression omnibus. Beyond studying a single disease, the data deposited in these repositores is often used, for example, for translational studies where invidivual genes are studied across multiple experiments in an attempt to understand the underlying molecular biology.
Although some standards such as MIAME exist for reporting the results of microarray experiments, there is still a lack of interoperable data access
This work has resulted from ongoing efforts at the W3C HCLS bioRDF task force to understand neurodegenerative diseases. For our pilot study, we haved taken a bottom-up approach to understand what genes affect alzheimer’s patients and to what extent do those results agree over many experiments. For example, one characterization of AD is the formation of intracellular neurofibrillary tangles that affect neurons in brain regions involved in the memory function. It is therefore important to have meta-data such as the cell type(s), cell histopathology, and brain region(s) for comparing/integrating the results across different AD microarray experiments. It is important to consider the (raw) data source and the types of analysis performed on the data to arrive at meaningful interpretations. Finally, gene expression data may be combined with other types of data including genomic functions, pathways, and associated diseases to broaden the spectrum of integrative data analysis.
Separate concerns of interests
Understanding the needs for provenance v.s. a steep learning curve of existing ontolgoies
Local, stable terminologies v.s. less stable, external vocabularies
Identify the minimum set of terms for required users’ queries
In order to explicitelly represent microarray experiments in semantic web formats, we needed first to understand the microarray workflow and how experiment translates into results. A microarray experiment usually starts with a question. For example, are there genes that affect the incidence of Alzheimer’s disease? Why are certain areas of the brain more affected by neurofibrillary tangles and could there be a list of genes responsible for that? According to the question, the experiment is designed to collect samples from specific areas of the brain from disease and normal subjects. These samples are then used in the microarray experiment, which is only the first step of data processing. The microarray are scanned, resulting in an image which has to be analysed to expreact the intensity signals. Before they can be used, the raw results need to be normalized in order to reduce the noise. Finally, a variety of statistical techniques and methods are used to trim the list of tens of thousans of genes to only a few highly significant ones that may be responsible for the disease.
The results from a microarray experiment is a list of differentially expressed genes. The standard methodology to reporting these results is using a heatmap with the samples as columns and the genes as lines. These resulst are normaly reported as table in a pdf or, if we are lucky, the experimentalists provide the list of differentially expressed genes as a text file in supplementary material. Very seldom is the workflow for data collection/data processing and data analysis recorded in any format other than in the methodology section of an article.
Different studies also tend to report their results using different data structured, making it harder to compare the experimental results. For example, in the three experiments that we selected, although the software package used to analyse the data was the same and the microarray workflow was very similar, the results were reported in very different ways. This, is part, reflects the necessity to report not only the sample specifice gene expression values, indicated in dark blue, but also the annotations available for each of the genes discovered, indicated in light blue, as well as the aggregated gene expression values, indicated in green. In some cases, the average signal was used to derive the final list of genes (point to genelist 2) and in other, only the fold change and it’s associated p-value were used (genelist 3).
For our bottom-up approach we have identified and created a namespace for the terminologies necessary to describe in each of these genelists the portions that are most informative about the procedure to preproecess and to analyse the data, without creating an extensive list of terms that would multiply the amount of data to be represented.
These multiple meta-data types or layers of information will allows us to answer specific biomedical questions. For example, a biologist or physician looking into comparing several experiments about the same disease might use information about the samples that were analysed to answer questions such as “What microarray experiments….”
Information about the statistical fold-change and p-value, on the other hand, enabe a very dfferent class of questions that are much more oriented by a need to understand the reliability of the results, such as “What genes are overexpresed…” A statistitian might need to look at these value before cross-experiment comparison is possible.
Finally, for a researcher interested in a deeper understanding of the molecular biology behind diseases, the gene annotation information is necessary to answer questions like “what other diseases may be…”.
It is in this type of question tha the advantage of using RDF representations of genelist is most obvious.
To represent the genelists as RDF we chose to follow a bottom-up approach, that is, our model is derived from the data but was can still link to upper ontologies. The reason for this was two fold – although there are several provenance ontologies, none of them was granular enough to represent our data. The second reason was because our model arose from the experimental results but most importantly, we wanted to be able to respond to biological questions. Because the process of knowledge discovery often leads back to the raw data, it is not uncommon for data models to be incremented with new data derived from the collection of new experimental results. Furthermore, the bottom up approach shields our model from rapidly evolving ontologies while at the same time enabling linking to other community accepted ontologies.
To represent the genelists as RDF we chose to follow a bottom-up approach, that is, our model is derived from the data but was can still link to upper ontologies. The reason for this was two fold – although there are several provenance ontologies, none of them was granular enough to represent our data. The second reason was because our model arose from the experimental results but most importantly, we wanted to be able to respond to biological questions. Because the process of knowledge discovery often leads back to the raw data, it is not uncommon for data models to be incremented with new data derived from the collection of new experimental results. Furthermore, the bottom up approach shields our model from rapidly evolving ontologies while at the same time enabling linking to other community accepted ontologies.
To represent the genelists as RDF we chose to follow a bottom-up approach, that is, our model is derived from the data but was can still link to upper ontologies. The reason for this was two fold – although there are several provenance ontologies, none of them was granular enough to represent our data. The second reason was because our model arose from the experimental results but most importantly, we wanted to be able to respond to biological questions. Because the process of knowledge discovery often leads back to the raw data, it is not uncommon for data models to be incremented with new data derived from the collection of new experimental results. Furthermore, the bottom up approach shields our model from rapidly evolving ontologies while at the same time enabling linking to other community accepted ontologies.
To represent the genelists as RDF we chose to follow a bottom-up approach, that is, our model is derived from the data but was can still link to upper ontologies. The reason for this was two fold – although there are several provenance ontologies, none of them was granular enough to represent our data. The second reason was because our model arose from the experimental results but most importantly, we wanted to be able to respond to biological questions. Because the process of knowledge discovery often leads back to the raw data, it is not uncommon for data models to be incremented with new data derived from the collection of new experimental results. Furthermore, the bottom up approach shields our model from rapidly evolving ontologies while at the same time enabling linking to other community accepted ontologies.
To answer our varied classes of questions, we have identified four types of provenance for our data model. Starting from the center, the provenance levels are the institutional leve, the eexperiment protocol level, the data analysis and significance level and the dataset description level.
The institutional level contains metadata about the genelists, such as the laboratory where they were performed, or the link to the publication where the results were published. These help determine the trusthworthiness of the data. The experimental context level includes data about the brain region where samples where collected from or the histology of the cells
The data analysis and significance level is concerned with information about the statistical procedures that were used to convert the raw data into a list of diffentially expressed genes. This information helps determine the statistical significance of the results. Finally, the dataset description level is the description of the RDF dataset itself such as the version, the published or the URL location where it is made available. The dataset description level makes use of the vocabulary of interlinked datasets and dublic core terms
Each of the the provenance levels described is usefull to answer a specific class of questions, depending on the perspective of who is querying the data. Someone interested in knowing the laboratory where the experiment was performed or where the raw data is published, for example, would need institutional level procenance information. Other types of questions, such as finding all the experiments that used saples from the Enrothinal cortex of Alzheimer’s disease patients, would use experimental context level data. To answer questions such as the p-value associated with a measurement of genetic fold change, data from the data analysis and significance level is necessary. Finally, the dataset description level can be used to find more RDF datasets that contain information about genes and diseases.
Each of the the provenance levels described is usefull to answer a specific class of questions, depending on the perspective of who is querying the data. Someone interested in knowing the laboratory where the experiment was performed or where the raw data is published, for example, would need institutional level procenance information. Other types of questions, such as finding all the experiments that used saples from the Enrothinal cortex of Alzheimer’s disease patients, would use experimental context level data. To answer questions such as the p-value associated with a measurement of genetic fold change, data from the data analysis and significance level is necessary. Finally, the dataset description level can be used to find more RDF datasets that contain information about genes and diseases.
Each of the the provenance levels described is usefull to answer a specific class of questions, depending on the perspective of who is querying the data. Someone interested in knowing the laboratory where the experiment was performed or where the raw data is published, for example, would need institutional level procenance information. Other types of questions, such as finding all the experiments that used saples from the Enrothinal cortex of Alzheimer’s disease patients, would use experimental context level data. To answer questions such as the p-value associated with a measurement of genetic fold change, data from the data analysis and significance level is necessary. Finally, the dataset description level can be used to find more RDF datasets that contain information about genes and diseases.
Each of the the provenance levels described is usefull to answer a specific class of questions, depending on the perspective of who is querying the data. Someone interested in knowing the laboratory where the experiment was performed or where the raw data is published, for example, would need institutional level procenance information. Other types of questions, such as finding all the experiments that used saples from the Enrothinal cortex of Alzheimer’s disease patients, would use experimental context level data. To answer questions such as the p-value associated with a measurement of genetic fold change, data from the data analysis and significance level is necessary. Finally, the dataset description level can be used to find more RDF datasets that contain information about genes and diseases.
As part of this work, we have created federated queries that link our AD – related genes to genes in the diseasome. The diseasome is a community effort to link known human diseases to genes known to be related to those diseases. By federating our list of AD – related genes with the diseasome, we were able to find other disease that are affected by the same genes. This type of discovery is greatly facilitated by the RDF representation of the genelists.