How to analyse large data sets

Improved Medical Education in Basic
Sciences
for Better Medical Practicing
ImproveMEd
Systems biology for medicine
III. How to analyze the big data sets?

The systems biology studies
often start with expression
profile (drug treated versus non-
treated cell, normal versus
cancer cell, cells in different
developmental stage)…using
microarray or
RNAseq…microarray is cost-
effective approach…
And we got this…

A microarray can fit 10 000 spots. Let’s assume that each
spot is a gene – how do we organize spots/genes in order
to extract result?
A laser scanner measures one fluorescent label than
another and superimpose one over another… each spot is
measured twice!
intensity of fluorescent signal = quantity of bound DNA
Each spot can be substituted with a number representing
relative change from ‘normal’ levels.
N = R/G …..1 means equal expression in both samples
R=red fluorescence (tumor)
G=green fluorescence (normal cell)

Colors are converted to numbers, because numbers are easier to
organize!
Each spot can be substituted with a number representing relative
change from ‘normal’ levels.
R=red fluorescence (tumor)
G=green fluorescence (normal cell)
N = R/G
N=1 equal expression in both samples
N›1 induction
N‹1 repression
http://www.hhmi.org/biointeractive/how-analyze-dna-microarray-
data
http://www.hhmi.org/biointeractive/scanning-lifes-matrix-genes-
proteins-and-small-molecules
We can compare many samples….or
we can follow one over time - human
fibroblastst stimulated with serum
and followed for 24 hours (Iyer et al.
1999)
And organize genes so that
induced one are clustered at
one end-opposite from
repressed one…
Such presentation of data is called Heat Map

For extracting knowledge from big data
we need statistical methods!
Commonly used – R statistical package
LIMMA
To identify clusters we can use –
cluster analysis!
Original numbers are logaritmized (by
base 2 or 10) and than we proceed by
calculating similarity scores – using a
computer program accompanying
microarray platform.
For visual presentation of data we turn
numbers again into colors, but this
time green means repression and red
means induction.

Another way of presenting data
is Volcano plot (common for
GWS studies).
The data are presented in
‘scatter-plot’ in order to quickly
find the most interesting e.g.
gene candidate in some
disease.
Combines two statistical tests:
e.g., a p value from an ANOVA
model with the magnitude of
the change.
Quick visual identification of
data (genes, etc.) that display
large magnitude changes that
are also statistically significant.
The border
between
p>0.05 &
p<0.05
Difference between same parameters in two samples
presented as ‘fold change’
In grey are changes smaller then 2x.
http://genomicsclass.github.io/book/pages/using_limma.html
Statistical significance
Interesting data

Both, Heat Map and Volcano Plot (and statistical analysis
behind them), are the first step toward identifying and ranking
genes/proteins behind observed phenotype. Generated the
lists of genes, responsible for observed mechanisms or
potential therapy targets, are further processed by different
bioinformatics tools.
The gene list can be fed into: Gene Ontology, Gene Set Enrichment
Analysis, Transcription Factor Analysis…
Generated lists have to use the unique nomenclature in order to be mutually
comparable.

Gene Ontology – http://geneontology.org/
Bioinformatics tool useful for assigning the right
name to sequence and connecting molecular
changes to cellular processes
Genes and proteins are conserved in the most living
organisms and have shared functions. Finding role of
a gene in one organism can help illuminating its role
in another. Gene Ontology Consortium deals with
gene nomenclature.
Sets are organized according to:
-Biological process
-Molecular function
-Cellular compartment
The Gene Ontology Consortium, Nature, 2000.
Biological process like : cell growth, proliferation,
translation or cAMP synthesis…

Cellular compartment
Parent nodes Children nodes

Systematic ORF
name
The standard
gene name
GO biological
process
Molecular function
Cellular component

Gene set enrichment analysis – GSEA
Analytical method designed for finding and interpreting
sets of genes.
Looking for genes that change together
- determining levels of proteins participating in the same
signaling pathway
- looking for molecules participating in the same
biological process
Free software package with initial database of 1,325
biologically defined gene sets.
http://software.broadinstitute.org/gsea/index.jsp
Subramanian et al. (2005) PNAS 102:15545
1. Sort the genes according to a criterion e.g. expression
level
2. Compare your list to some already existing lists and
allocate individual genes to ‘erichrichment score' - overly
represented or excessively reduced genes according to
Kolmogorov-Smirnov type statistics
3. The Max Enrichment Score (MES) is a relevance indicator
of an existing gene set for a new data-set just being
investigated

Transcription Factor Analysis
Genes that have changed the level of expression may
have been regulated by the same transcription
factor.
Genes are identified by combining omics data and
prior knowledge.
ChEA database currently links 159 transcription
factors to more than 30,000 genes - a total of 361
299 interactions – extracted from 157 publications.
TRANSFAC, PAINT, JASPAR – other databases for ChIP
Kinase Enrichment Analysis (KEA)
Web-base command- line software that links list of
mammalian proteins with protein kinases that likely
phosphorylate them. The database containes 436
kinases and 14 374 interactions from 3469
publications.
http://amp.pharm.mssm.edu/Enrichr/
https://www.ncbi.nlm.nih.gov/pmc/articl
es/PMC2944209/

A number of transcription factors acts at the
same time on the same promoter…

Chromatin
immunoprecipitation is
the method of choice for
finding all sequences
interacting with
proteins. Data from all
ChIP-seq experiments
can be fed in the same
database (ChEA)…
https://galaxyproject.org/tutorials/chip/

Expression2Kinases –X2K
The software which combines different databases
and tools .
INPUT: the list of differently expressed genes
OUTPUT: protein kinases, transcription factors and
protein complexes that are putative regulators of
inputted genes.
Using such sotwere we can construct hypothetical
regulatory pathways and construct protein
interaction networks.
The results need experimental prove of concept!
The work-flow of X2K
Chen et al. (2012) Bioinformatics 28:105

What we really want is to transform list into a network
– often used to present interactions between cellular
components
Euler, 1700s, Seven Bridges of Konigsberg
Node
molecule
Edge
interaction

Types of networks relevant to systems biology
1. Cell Signaling Networks
- cancer signaling network
doi:10.1038/psp.2013.38
2. Protein-Protein Interaction Networks
- Dystrophin protein-protein intersctions
http://parendogen677s10.weebly.com/protein-protein-interactions.html
3. Gene Regulatory Networks
- Development od Drosophila eye
http://dev.biologists.org/content/140/1/82

Genes2Networks
Lists2Networks
Combines experimental data (mRNA
expression microarray, genome-wide
ChI-X, RNAi screens, proteomics &
phosphoproteomics) with a bacground
network of all known interactions (prior
biological knowladge)
http://www.lists2networks.org

Additional sofwers exist for visualisation and analysis of
networks:
Pajek (Vladimir Batagelj & Andrej Mrvar, Ljubljana,
Slovenia)
http://vlado.fmf.uni-
lj.si/pub/networks/doc/gd.01/Pajek2.png
http://vlado.fmf.uni-lj.si/pub/networks/doc/pajek.pdf
Cytoscape (Trey Ideker, Shannon et al.,2003.))
http://www.cytoscape.org/
SNAVI (Ma’ayan et al. 2009)
yEd…..
Identification of pathways, subnetworks, clusters, special
features of network…

Molecular data could be further
integrated with structural data in
order to produce 3D models
(macromolecular complexes,
virtual cells)….
Patwardhan et al. 2017, DOI:
10.7554/eLife.25835
(erytrocytes infected with
plasmodium)

1. Statistical analysis is critical in extracting knowladge about
system from a big data sets. Statistical analysis generates a list of
genes/proteins/RNAs relevant for the study.
2. The list of genes can be fed into software (bioinformatics' tools)
and combined with prior knowledge in order to find theoretical
new pathways, subnetworks, regulatory mechanism…
3. Integration of experimental big data and prior knowledge
(multiple databases) allows multiscale understanding of
physiological functions, pathophysiology or pharmacokinetics.
4. Computationally generated predictions have to be
experimentally proved.

How to analyse large data sets

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie How to analyse large data sets

Ähnlich wie How to analyse large data sets (20)

Mehr von improvemed

Mehr von improvemed (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

How to analyse large data sets