SlideShare ist ein Scribd-Unternehmen logo
1 von 144
Gene Expression - Microarrays Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May 2010
 
 
 
 
Compare gene expression  in this cell type… … after drug  treatment … at a later  developmental time … in a different  body region … after viral  infection … in samples from patients … relative  to a knockout
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Gene expression is context-dependent, and is regulated in several basic ways Page 297
Outline: microarray data analysis Gene expression Microarrays Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA)
Microarrays: tools for gene expression A microarray is a solid support (such as a membrane or glass microscope slide) on which DNA of known sequence is deposited in a grid-like array. Page 312
Microarrays: tools for gene expression The most common form of microarray is used to measure  gene expression. RNA is isolated from matched samples  of interest. The RNA is typically converted to cDNA,  labeled with fluorescence (or radioactivity), then hybridized  to microarrays in order to measure the expression levels of thousands of genes.
[email_address] Measuring RNA abundances
[email_address] How it works ,[object Object],[object Object],[object Object]
[email_address] Spotted Arrays ,[object Object],[object Object]
[email_address] Spotted Arrays ,[object Object],[object Object]
[email_address] Spotted Arrays ,[object Object]
[email_address] Oligonucleotide Arrays ,[object Object],[object Object],[object Object],[object Object],[object Object]
[email_address] Oligonucleotide Arrays ,[object Object]
Fast   Data on >20,000 transcripts within weeks Comprehensive  Entire yeast or mouse genome on a chip Flexible  Custom arrays can be made  to represent genes of interest Easy   Submit RNA samples to a core facility Cheap?    Chip representing 20,000 genes for $300 Advantages of microarray experiments
Cost ■ Some researchers can’t afford to do   appropriate numbers of controls, replicates RNA ■  The final product of gene expression is  protein significance ■  “Pervasive transcription” of the genome is   poorly understood (ENCODE project) ■  There are many noncoding RNAs not yet   represented on microarrays Quality ■ Impossible to assess elements on array surface control ■  Artifacts with image analysis ■  Artifacts with data analysis ■  Not enough attention to experimental design ■  Not enough collaboration with statisticians Disadvantages of microarray experiments
Biological insight Sample acquisition Data acquisition Data  analysis Data  confirmation
Stage 1:   Experimental design Stage 3:  Hybridization to DNA arrays  Stage 2:   RNA and probe preparation   Stage 4:  Image analysis  Stage 5:  Microarray data analysis  Stage 6:  Biological confirmation Stage 7:  Microarray databases
Stage 1: Experimental design [1] Biological samples: technical and biological replicates: determine the data analysis approach at the outset [2] RNA extraction, conversion, labeling, hybridization: except for RNA isolation, routinely performed at core facilities [3] Arrangement of array elements on a surface: randomization can reduce spatially-based artifacts Page 314
Stage 2: RNA preparation  For Affymetrix chips, need total RNA (about 5 ug) Confirm purity by running agarose gel Measure a260/a280 to confirm purity, quantity One of the greatest sources of error in microarray experiments is artifacts associated with RNA isolation; appropriately balanced, randomized experimental design is necessary.
Stage 3: Hybridization to DNA arrays  The array consists of cDNA or oligonucleotides Oligonucleotides can be deposited by photolithography The sample is converted to cRNA or cDNA (Note that the terms “probe” and “target” may refer to the element immobilized on the surface of the microarray, or to the labeled biological sample; for clarity, it may be simplest to avoid both terms.)
Stage 4: Image analysis  RNA transcript levels are quantitated Fluorescence intensity is measured with a scanner.
Rett Control Differential Gene Expression on a cDNA Microarray    B Crystallin  is over-expressed  in Rett Syndrome
 
Fig. 8.21 Page 319
 
Fig. 8.21 Page 319
Stage 5: Microarray data analysis  Page 318 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Stage 6: Biological confirmation Page 320 Microarray experiments can be thought of as “ hypothesis-generating” experiments. The differential up- or down-regulation of specific RNA transcripts can be measured using independent assays such as -- Northern blots -- polymerase chain reaction (RT-PCR) -- in situ hybridization
Stage 7: Microarray databases There are two main repositories: Gene Expression Omnibus (GEO) at NCBI ArrayExpress at the European Bioinformatics Institute (EBI)
Microarray Overview I Microbial ORFs Design PCR Primers PCR Products Eukaryotic Genes Select cDNA clones PCR Products For each plate set, many identical replicas Microarray Slide (with 60,000 or more spotted genes) + Microtiter Plate Many different plates  containing different genes
Microarray Overview II Prepare Fluorescently Labeled Probes Control Test Hybridize, Wash Measure Fluorescence in 2 channels red / green Analyze the data to identify patterns of gene expression
Affymetrix GeneChip™ Expression Analysis Obtain RNA Samples Prepare Fluorescently Labeled Probes Control Test Scan chips Analyze PM MM Hybridize and wash chips
Gene Microarray Expression Analysis Spots  on an Array Fluorescence Intensity Expression Measurement Tissue Selection Differential State/Stage Selection RNA Preparation and Labeling Competitive Hybridization
Steps in the Process ,[object Object],[object Object],[object Object],[object Object],[object Object]
MIAME  In an effort to standardize microarray data presentation and analysis, Alvis Brazma and colleagues at 17 institutions introduced Minimum Information About a Microarray Experiment (MIAME). The MIAME framework standardizes six areas of information: ► experimental design ► microarray design ► sample preparation ► hybridization procedures ► image analysis ► controls for normalization Visit http://www.mged.org
Interpretation of RNA analyses The relationship of DNA, RNA, and protein: DNA is transcribed to RNA. RNA quantities and half-lives vary. There tends to be a low positive correlation between RNA and protein levels. The pervasive nature of transcription: The Encyclopedia of DNA Elements (ENCODE) project identified functional features of genomic DNA, initially in 30 megabases (1% of the human genome). One of its observations was the “pervasive nature of transcription”: the vast majority of DNA is transcribed, although the function is unknown.
Outline: microarray data analysis Gene expression Microarrays Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA)
Microarray data analysis •  begin with a data matrix (gene expression values versus samples) genes (RNA transcript levels)
Microarray data analysis •  begin with a data matrix (gene expression values versus samples) Typically, there are many genes (>> 20,000) and  few samples ( ~  10) Fig. 9.1 Page 333
Microarray data analysis •  begin with a data matrix (gene expression values versus samples) Preprocessing Inferential statistics Descriptive statistics
Microarray data analysis: preprocessing ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Microarray data analysis: preprocessing The main goal of data preprocessing is to remove the systematic bias in the data as completely as possible, while preserving the variation in gene expression that occurs because of biologically relevant changes in transcription. A basic assumption of most normalization procedures is that the average gene expression level does not change in an experiment.
Data analysis: global normalization Global normalization is used to correct two or more data sets. In one common scenario, samples are labeled with Cy3 (green dye) or Cy5 (red dye) and hybridized to DNA elements on a microrarray. After washing, probes are excited with a laser and detected with a scanning confocal microscope.
Data analysis: global normalization Global normalization is used to correct two or more data sets Example: total fluorescence in  Cy3 channel = 4 million units Cy 5 channel = 2 million units Then the uncorrected ratio for a gene could show 2,000 units versus 1,000 units. This would artifactually appear to show 2-fold regulation.
Data analysis: global normalization Global normalization procedure Step 1: subtract background intensity values (use a blank region of the array) Step 2: globally normalize so that the average ratio = 1 (apply this to 1-channel or 2-channel data sets)
Scatter plots Useful to represent gene expression values from two microarray experiments (e.g. control, experimental) Each dot corresponds to a gene expression value Most dots fall along a line Outliers represent up-regulated or down-regulated genes
Brain Astrocyte Astrocyte Fibroblast Differential Gene Expression in Different Tissue and Cell Types
expression level high low up down Expression level (sample 1) Expression level (sample 2)
Log-log  transformation
Scatter plots Typically, data are plotted on log-log coordinates Visually, this spreads out the data and offers symmetry raw ratio log 2  ratio time   behavior  value value t=0 basal 1.0 0.0 t=1h no change 1.0 0.0 t=2h 2-fold up 2.0 1.0 t=3h 2-fold down 0.5 -1.0
expression level high low up down Mean log intensity Log ratio
You can make these plots in Excel… … but for many bioinformatics applications use R. Visit http://www.r-project.org to download it.
 
There are limits to what you can measure
The Limits of log-ratios: The space we explore
The Limits of log-ratios: The space we explore
The Limits of log-ratios: The space we explore
Good Data
Bad Data from Parts Unknown Gary Churchill Each “pin group” is colored differently
Lowess Normalization ,[object Object],[object Object],[object Object],A SD = 0.346
Ratio Cy3/Cy5 for the same RNA  sorted from least most expressed
LOWESS Results
Affymetrix Chips
Mismatch (MM) probes ,[object Object],[object Object],[object Object]
Computing expression summaries: a three-step process ,[object Object],[object Object],[object Object]
Background/Signal Adjustment ,[object Object],[object Object],[object Object],[object Object],[object Object]
Normalization Methods ,[object Object],[object Object],[object Object],[object Object],[object Object]
Summarization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Robust multi-array analysis (RMA) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
GCRMA ,[object Object],[object Object]
 
A A M M After RMA (a normalization procedure), the median is near zero, and skewing is corrected. Scatterplots display the effects of normalization.
vsn: variance stabilizing normalization ,[object Object],[object Object],[object Object],[object Object],[object Object]
vsn: post-normalization plot
array log signal intensity array log signal intensity Histograms of raw intensity values for 14 arrays (plotted in R) before and after RMA was applied.
RMA can adjust for the effect of GC content GC content log intensity
Robust multi-array analysis (RMA) RMA offers a large increase in precision (relative to Affymetrix MAS 5.0 software). precision average log expression log expression SD RMA MAS 5.0
Robust multi-array analysis (RMA) RMA offers comparable accuracy to MAS 5.0. log nominal concentration observed log expression accuracy
Outline: microarray data analysis Gene expression Microarrays Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA)
Inferential statistics Inferential statistics are used to make inferences about a population from a sample.  Hypothesis testing is a common form of inferential statistics. A null hypothesis is stated, such as: “ There is no difference in signal intensity for the gene expression measurements in normal and diseased samples.” The alternative hypothesis is that there is a difference.  We use a test statistic to decide whether to accept or  reject the null hypothesis. For many applications,  we set the significance level    to p < 0.05.
[1] Obtain a matrix of genes (rows) and expression values columns. Here there are 20,000 rows of genes of which the first six are shown. There are three control samples and three disease samples. Calculate the mean value for each gene (transcript) for the controls and the disease (experimental) samples. Analyzing expression data Question: for each of my 20,000 transcripts, decide whether it is significantly regulated in some disease.  control disease
[2] Calculate the ratios of control versus disease.  Also note that some ratios, such as 2.00, appear to be dramatic while others are not. Some researchers set a cut-off for changes of interest such as two-fold. Analyzing expression data
A significant difference Probably not
Inferential statistics A t-test is a commonly used test statistic to assess the difference in mean values between two groups.  t =   =  Questions Is the sample size (n) adequate? Are the data normally distributed? Is the variance of the data known? Is the variance the same in the two groups? Is it appropriate to set the significance level to p < 0.05? x 1  – x 2 SE difference between mean values variability (standard error of the difference)
Inferential statistics ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],x 1  – x 2 SE difference between mean values variability (standard error of the difference)
[3] Perform a t-test. Hypothesis is that the transcript in the disease group is up (or down) relative to controls. Analyzing expression data
[3] Note the results: you can have… a small p value (<0.05) with a big ratio difference a small p value (<0.05) with a trivial ratio difference a large p value (>0.05) with a big ratio difference a large p value (>0.05) with a trivial ratio difference Analyzing expression data
Inferential statistics Is it appropriate to set the significance level to p < 0.05? If you hypothesize that a specific gene is up-regulated, you can set the probability value to 0.05. You might measure the expression of 10,000 genes and hope that  any  of them are up- or down-regulated. But you can expect to see 5% (500 genes) regulated at the p < 0.05 level by chance alone. To account for the thousands of repeated measurements you are making, some researchers apply a Bonferroni correction. The level for statistical significance is divided by the number of measurements, e.g. the criterion becomes: p < (0.05)/10,000  or  p < 5 x 10 -6 The Bonferroni correction is generally considered to be  too  conservative.
Inferential statistics: false discovery rate The false discovery rate (FDR) is a popular multiple corrections correction. A false positive (also called a type I error) is sometimes called a false discovery. The FDR equals the p value of the t-test times the number of genes measured (e.g. for 10,000 genes and a p value of 0.01, there are 100 expected false positives). You can adjust the false discovery rate. For example: FDR # regulated transcripts # false discoveries 0.1 100   10 0.05  45 3 0.01  20 1 Would you report 100 regulated transcripts of which 10 are likely to be false positives, or 20 transcripts of which one is likely to be a false positive?
Inferential statistics: other methods used ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
log fold change (treated/untreated) p value (treated versus control) A volcano plot displays both p values and fold change
Outline: microarray data analysis Gene expression Microarrays Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA)
 
 
 
Descriptive statistics Microarray data are highly dimensional: there are many thousands of measurements made from a small number of samples. Descriptive (exploratory) statistics help you to find meaningful patterns in the data. A first step is to arrange the data in a matrix. Next, use a distance metric to define the relatedness of the different data points. Two commonly used distance metrics are: -- Euclidean distance -- Pearson coefficient of correlation
What is a cluster? A cluster is a group that has  homogeneity  (internal cohesion) and  separation  (external isolation). The relationships between objects being studied are assessed by similarity or dissimilarity measures.
Data matrix (20 genes and  3 time points from Chu et al., 1998) Software: S-PLUS package genes samples (time points)
3D plot (using S-PLUS software) t=0 t=0.5 t=2.0
Descriptive statistics: clustering Clustering algorithms offer useful visual descriptions of microarray data. Genes may be clustered, or samples, or both. We will next describe hierarchical clustering. This may be agglomerative (building up the branches of a tree, beginning with the two most closely related objects) or divisive (building the tree by finding the most dissimilar objects first). In each case, we end up with a tree having branches and nodes. Page 355
Distance Is Defined by a Metric Euclidean   Pearson* Distance Metric: 6.0 1.4 +1.00 -0.05 D D
Distance is Defined by a Metric 4.2 1.4 -1.00 -0.90 Euclidean   Pearson(r*-1) Distance Metric: D D
[object Object],Distance Matrix Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Agglomerative clustering a b c d e a,b 4 3 2 1 0 Adapted from Kaufman and Rousseeuw (1990)
a b c d e a,b d,e 4 3 2 1 0 Agglomerative clustering
a b c d e a,b d,e c,d,e 4 3 2 1 0 Agglomerative clustering
a b c d e a,b d,e c,d,e a,b,c,d,e 4 3 2 1 0 Agglomerative clustering … tree is constructed
Divisive clustering a,b,c,d,e 4 3 2 1 0
Divisive clustering c,d,e a,b,c,d,e 4 3 2 1 0
Divisive clustering d,e c,d,e a,b,c,d,e 4 3 2 1 0
Divisive clustering a,b d,e c,d,e a,b,c,d,e 4 3 2 1 0
Divisive clustering a b c d e a,b d,e c,d,e a,b,c,d,e 4 3 2 1 0 … tree is constructed
divisive agglomerative a b c d e a,b d,e c,d,e a,b,c,d,e 4 3 2 1 0 4 3 2 1 0 Adapted from Kaufman and Rousseeuw (1990)
 
 
1 1 12 12 Agglomerative and  divisive clustering  sometimes give conflicting results, as shown here
Agglomerative Linkage Methods ,[object Object],[object Object],[object Object],[object Object],[object Object],(HCL-6)
Single Linkage Cluster-to-cluster distance is defined as the  minimum distance  between  members of one cluster and members of the another cluster. Single linkage tends to create ‘elongated’ clusters with individual genes chained onto clusters. D AB  = min ( d(u i , v j ) ) where u   A and v   B for all i = 1 to N A  and j = 1 to N B (HCL-7) D AB
Average Linkage Cluster-to-cluster distance is defined as the  average distance   between all members of one cluster and all members of another cluster. Average linkage has a slight tendency to produce clusters of similar variance. D AB  =  1/(N A N B )    ( d(u i , v j ) )   where u   A and v   B for all i = 1 to N A  and j = 1 to N B (HCL-8) D AB
Complete Linkage Cluster-to-cluster distance is defined as the  maximum distance   between members of one cluster and members of the another cluster. Complete linkage tends to create clusters of similar size and variability. D AB  = max ( d(u i , v j ) ) where u   A and v   B for all i = 1 to N A  and j = 1 to N B (HCL-9) D AB
Comparison of Linkage Methods Single Average Complete
Two-way  clustering of genes (y-axis) and cell lines (x-axis) (Alizadeh et al., 2000)
A B x 1 x 2 1 1 0.5 0.5 1.5 A’ B’ a 1 b 1 a’ 1 b’ 1 a’ 2 b 2 a 2 b’ 2    Euclidean distance Chord distance Angle distance
K-Means/Medians Clustering – 1 1. Specify number of  clusters , e.g., 5.  2. Randomly assign genes to clusters. G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13
K-Means/Medians Clustering – 2 3. Calculate mean/median expression profile of each cluster. 4. Shuffle genes among clusters such that each gene is now in the cluster whose mean expression profile (calculated in step 3) is the closest to that gene’s expression profile. 5. Repeat steps 3 and 4 until genes cannot be shuffled around any more, OR a user-specified number of iterations has been reached.  k -means is most useful when the user has an  a priori  hypothesis about the number of clusters the genes should belong to. G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13
K-Means / K-Medians Support (KMS) ,[object Object],[object Object],[object Object],[object Object]
An exploratory technique used to reduce the dimensionality of the data set to 2D or 3D For a matrix of  m  genes x  n  samples, create a new covariance matrix of size  n  x  n Thus transform some large number of variables into a smaller number of uncorrelated variables called principal components (PCs).  Principal components analysis (PCA)
Principal components analysis (PCA): objectives •  to reduce dimensionality •  to determine the linear combination of variables •  to choose the most useful variables (features) •  to visualize multidimensional data •  to identify groups of objects (e.g. genes/samples) •  to identify outliers
http://www.okstate.edu/artsci/botany/ordinate/PCA.htm
http://www.okstate.edu/artsci/botany/ordinate/PCA.htm
http://www.okstate.edu/artsci/botany/ordinate/PCA.htm
http://www.okstate.edu/artsci/botany/ordinate/PCA.htm
 
 
1 12
[email_address] High-throughput methods beyond microarrays
[email_address] RNA-seq ,[object Object],[object Object]
[email_address] RNA-seq ,[object Object],[object Object],[object Object]
[email_address] RNA-seq
Acknowledgements ,[object Object],[object Object],[object Object],[object Object],[email_address]

Weitere ähnliche Inhalte

Was ist angesagt? (20)

Dna microarray mehran- u of toronto
Dna microarray  mehran- u of torontoDna microarray  mehran- u of toronto
Dna microarray mehran- u of toronto
 
Microarray technique
Microarray techniqueMicroarray technique
Microarray technique
 
DNA Microarray and Analysis of Metabolic Control
DNA Microarray and Analysis of Metabolic ControlDNA Microarray and Analysis of Metabolic Control
DNA Microarray and Analysis of Metabolic Control
 
dna microarray
dna microarraydna microarray
dna microarray
 
Nucleic acid microarray
Nucleic acid microarrayNucleic acid microarray
Nucleic acid microarray
 
DNA microarray
DNA microarrayDNA microarray
DNA microarray
 
Dna chip
Dna chipDna chip
Dna chip
 
Microarray
MicroarrayMicroarray
Microarray
 
Dna chips and microarrays
Dna chips and microarraysDna chips and microarrays
Dna chips and microarrays
 
Study of microarray
Study of microarrayStudy of microarray
Study of microarray
 
Micro array analysis
Micro array analysisMicro array analysis
Micro array analysis
 
Tools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisTools for Transcriptome Data Analysis
Tools for Transcriptome Data Analysis
 
DNA MICROARRAY TECHNIQUES
DNA MICROARRAY TECHNIQUESDNA MICROARRAY TECHNIQUES
DNA MICROARRAY TECHNIQUES
 
Single cell RNA sequencing; Methods and applications
Single cell RNA sequencing; Methods and applicationsSingle cell RNA sequencing; Methods and applications
Single cell RNA sequencing; Methods and applications
 
Gene expression profiling i
Gene expression profiling  iGene expression profiling  i
Gene expression profiling i
 
Protein Microarrays: Approaches to Printing
Protein Microarrays: Approaches to PrintingProtein Microarrays: Approaches to Printing
Protein Microarrays: Approaches to Printing
 
DNA microarray final ppt.
DNA microarray final ppt.DNA microarray final ppt.
DNA microarray final ppt.
 
155 dna microarray
155 dna microarray155 dna microarray
155 dna microarray
 
Microarray
Microarray  Microarray
Microarray
 
Microarray CGH
Microarray CGHMicroarray CGH
Microarray CGH
 

Andere mochten auch

3 Afr. J. Biotech. Tariq%20et%20al
3  Afr. J. Biotech. Tariq%20et%20al3  Afr. J. Biotech. Tariq%20et%20al
3 Afr. J. Biotech. Tariq%20et%20alMohammad Tariq
 
RAPD Analysis Of Rapidly Multiplied In Vitro Plantlets of Anthurium Andreanum...
RAPD Analysis Of Rapidly Multiplied In Vitro Plantlets of Anthurium Andreanum...RAPD Analysis Of Rapidly Multiplied In Vitro Plantlets of Anthurium Andreanum...
RAPD Analysis Of Rapidly Multiplied In Vitro Plantlets of Anthurium Andreanum...IOSR Journals
 
RAPD_ISSR_cymbopogon_paper
RAPD_ISSR_cymbopogon_paperRAPD_ISSR_cymbopogon_paper
RAPD_ISSR_cymbopogon_paperSneha Ghosh
 
Annovi HRM and Epigenetics
Annovi HRM and EpigeneticsAnnovi HRM and Epigenetics
Annovi HRM and EpigeneticsGiulia Annovi
 
Somaclonal Variation: A new dimension for sugarcane improvement
Somaclonal Variation: A new dimension for sugarcane improvementSomaclonal Variation: A new dimension for sugarcane improvement
Somaclonal Variation: A new dimension for sugarcane improvementDr. siddhant
 
Random Amplified polymorphic DNA. RAPD
Random Amplified polymorphic DNA. RAPDRandom Amplified polymorphic DNA. RAPD
Random Amplified polymorphic DNA. RAPDUniversity of Mumbai
 
Dna markers lecture
Dna markers lectureDna markers lecture
Dna markers lectureBruno Mmassy
 

Andere mochten auch (14)

3 Afr. J. Biotech. Tariq%20et%20al
3  Afr. J. Biotech. Tariq%20et%20al3  Afr. J. Biotech. Tariq%20et%20al
3 Afr. J. Biotech. Tariq%20et%20al
 
RAPD Analysis Of Rapidly Multiplied In Vitro Plantlets of Anthurium Andreanum...
RAPD Analysis Of Rapidly Multiplied In Vitro Plantlets of Anthurium Andreanum...RAPD Analysis Of Rapidly Multiplied In Vitro Plantlets of Anthurium Andreanum...
RAPD Analysis Of Rapidly Multiplied In Vitro Plantlets of Anthurium Andreanum...
 
RAPD_ISSR_cymbopogon_paper
RAPD_ISSR_cymbopogon_paperRAPD_ISSR_cymbopogon_paper
RAPD_ISSR_cymbopogon_paper
 
Pcr, rapd dan rflp
Pcr, rapd dan rflpPcr, rapd dan rflp
Pcr, rapd dan rflp
 
Annovi HRM and Epigenetics
Annovi HRM and EpigeneticsAnnovi HRM and Epigenetics
Annovi HRM and Epigenetics
 
Somaclonal Variation: A new dimension for sugarcane improvement
Somaclonal Variation: A new dimension for sugarcane improvementSomaclonal Variation: A new dimension for sugarcane improvement
Somaclonal Variation: A new dimension for sugarcane improvement
 
DNA CHIPS
DNA CHIPSDNA CHIPS
DNA CHIPS
 
SCoT and RAPD
SCoT and RAPDSCoT and RAPD
SCoT and RAPD
 
Random Amplified polymorphic DNA. RAPD
Random Amplified polymorphic DNA. RAPDRandom Amplified polymorphic DNA. RAPD
Random Amplified polymorphic DNA. RAPD
 
Dna markers lecture
Dna markers lectureDna markers lecture
Dna markers lecture
 
Microarray
MicroarrayMicroarray
Microarray
 
Rapd ppt
Rapd pptRapd ppt
Rapd ppt
 
RFLP & RAPD
RFLP & RAPDRFLP & RAPD
RFLP & RAPD
 
Dna fingerprinting
Dna fingerprintingDna fingerprinting
Dna fingerprinting
 

Ähnlich wie 20100509 bioinformatics kapushesky_lecture03-04_0

Functional genomics
Functional genomicsFunctional genomics
Functional genomicsajay301
 
Genomics_Aishwarya Teli.pptx
Genomics_Aishwarya Teli.pptxGenomics_Aishwarya Teli.pptx
Genomics_Aishwarya Teli.pptxAishwaryaTeli5
 
Dna chips technology and utility.dr.yanal alkuddsi
Dna chips  technology and utility.dr.yanal alkuddsiDna chips  technology and utility.dr.yanal alkuddsi
Dna chips technology and utility.dr.yanal alkuddsiDr. Yanal A. Alkuddsi
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGLong Pei
 
Microarray @ujjwal sirohi
Microarray @ujjwal sirohiMicroarray @ujjwal sirohi
Microarray @ujjwal sirohiujjwal sirohi
 
Next generation sequencing by Muhammad Abbas
Next generation sequencing by Muhammad AbbasNext generation sequencing by Muhammad Abbas
Next generation sequencing by Muhammad AbbasMuhammadAbbaskhan9
 
115 genomic and proteomic screen
115 genomic and proteomic screen115 genomic and proteomic screen
115 genomic and proteomic screenSHAPE Society
 
Acc 2002 microarray mehran for print
Acc 2002 microarray mehran for printAcc 2002 microarray mehran for print
Acc 2002 microarray mehran for printSHAPE Society
 
A comprehensive study of microarray
A comprehensive study of microarrayA comprehensive study of microarray
A comprehensive study of microarrayPRABAL SINGH
 
DNA microarray.pdfghghvjjsjsjdhdhdhddhdhdjdhd
DNA microarray.pdfghghvjjsjsjdhdhdhddhdhdjdhdDNA microarray.pdfghghvjjsjsjdhdhdhddhdhdjdhd
DNA microarray.pdfghghvjjsjsjdhdhdhddhdhdjdhdMusaMusa68
 
DNA CHIPS AND MICROARRAY.pptx
DNA CHIPS AND MICROARRAY.pptxDNA CHIPS AND MICROARRAY.pptx
DNA CHIPS AND MICROARRAY.pptxShabnum
 

Ähnlich wie 20100509 bioinformatics kapushesky_lecture03-04_0 (20)

12 arrays
12 arrays12 arrays
12 arrays
 
12 arrays
12 arrays12 arrays
12 arrays
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
Genomics_Aishwarya Teli.pptx
Genomics_Aishwarya Teli.pptxGenomics_Aishwarya Teli.pptx
Genomics_Aishwarya Teli.pptx
 
Seminar
SeminarSeminar
Seminar
 
Dna chips technology and utility.dr.yanal alkuddsi
Dna chips  technology and utility.dr.yanal alkuddsiDna chips  technology and utility.dr.yanal alkuddsi
Dna chips technology and utility.dr.yanal alkuddsi
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEG
 
Microarray @ujjwal sirohi
Microarray @ujjwal sirohiMicroarray @ujjwal sirohi
Microarray @ujjwal sirohi
 
Next generation sequencing by Muhammad Abbas
Next generation sequencing by Muhammad AbbasNext generation sequencing by Muhammad Abbas
Next generation sequencing by Muhammad Abbas
 
115 genomic and proteomic screen
115 genomic and proteomic screen115 genomic and proteomic screen
115 genomic and proteomic screen
 
Genomic and proteomic screen
Genomic and proteomic screenGenomic and proteomic screen
Genomic and proteomic screen
 
Acc 2002 microarray mehran for print
Acc 2002 microarray mehran for printAcc 2002 microarray mehran for print
Acc 2002 microarray mehran for print
 
Acc 2002 microarray mehran for print
Acc 2002 microarray mehran for printAcc 2002 microarray mehran for print
Acc 2002 microarray mehran for print
 
115 genomic and proteomic screen
115 genomic and proteomic screen115 genomic and proteomic screen
115 genomic and proteomic screen
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
31931 31941
31931 3194131931 31941
31931 31941
 
A comprehensive study of microarray
A comprehensive study of microarrayA comprehensive study of microarray
A comprehensive study of microarray
 
DNA microarray.pdfghghvjjsjsjdhdhdhddhdhdjdhd
DNA microarray.pdfghghvjjsjsjdhdhdhddhdhdjdhdDNA microarray.pdfghghvjjsjsjdhdhdhddhdhdjdhd
DNA microarray.pdfghghvjjsjsjdhdhdhddhdhdjdhd
 
DNA CHIPS AND MICROARRAY.pptx
DNA CHIPS AND MICROARRAY.pptxDNA CHIPS AND MICROARRAY.pptx
DNA CHIPS AND MICROARRAY.pptx
 

Mehr von Computer Science Club

20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugsComputer Science Club
 
20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugs20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugsComputer Science Club
 
20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugsComputer Science Club
 
20140511 parallel programming_kalishenko_lecture12
20140511 parallel programming_kalishenko_lecture1220140511 parallel programming_kalishenko_lecture12
20140511 parallel programming_kalishenko_lecture12Computer Science Club
 
20140427 parallel programming_zlobin_lecture11
20140427 parallel programming_zlobin_lecture1120140427 parallel programming_zlobin_lecture11
20140427 parallel programming_zlobin_lecture11Computer Science Club
 
20140420 parallel programming_kalishenko_lecture10
20140420 parallel programming_kalishenko_lecture1020140420 parallel programming_kalishenko_lecture10
20140420 parallel programming_kalishenko_lecture10Computer Science Club
 
20140413 parallel programming_kalishenko_lecture09
20140413 parallel programming_kalishenko_lecture0920140413 parallel programming_kalishenko_lecture09
20140413 parallel programming_kalishenko_lecture09Computer Science Club
 
20140329 graph drawing_dainiak_lecture02
20140329 graph drawing_dainiak_lecture0220140329 graph drawing_dainiak_lecture02
20140329 graph drawing_dainiak_lecture02Computer Science Club
 
20140329 graph drawing_dainiak_lecture01
20140329 graph drawing_dainiak_lecture0120140329 graph drawing_dainiak_lecture01
20140329 graph drawing_dainiak_lecture01Computer Science Club
 
20140310 parallel programming_kalishenko_lecture03-04
20140310 parallel programming_kalishenko_lecture03-0420140310 parallel programming_kalishenko_lecture03-04
20140310 parallel programming_kalishenko_lecture03-04Computer Science Club
 
20140216 parallel programming_kalishenko_lecture01
20140216 parallel programming_kalishenko_lecture0120140216 parallel programming_kalishenko_lecture01
20140216 parallel programming_kalishenko_lecture01Computer Science Club
 

Mehr von Computer Science Club (20)

20141223 kuznetsov distributed
20141223 kuznetsov distributed20141223 kuznetsov distributed
20141223 kuznetsov distributed
 
Computer Vision
Computer VisionComputer Vision
Computer Vision
 
20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs
 
20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugs20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugs
 
20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs
 
20140511 parallel programming_kalishenko_lecture12
20140511 parallel programming_kalishenko_lecture1220140511 parallel programming_kalishenko_lecture12
20140511 parallel programming_kalishenko_lecture12
 
20140427 parallel programming_zlobin_lecture11
20140427 parallel programming_zlobin_lecture1120140427 parallel programming_zlobin_lecture11
20140427 parallel programming_zlobin_lecture11
 
20140420 parallel programming_kalishenko_lecture10
20140420 parallel programming_kalishenko_lecture1020140420 parallel programming_kalishenko_lecture10
20140420 parallel programming_kalishenko_lecture10
 
20140413 parallel programming_kalishenko_lecture09
20140413 parallel programming_kalishenko_lecture0920140413 parallel programming_kalishenko_lecture09
20140413 parallel programming_kalishenko_lecture09
 
20140329 graph drawing_dainiak_lecture02
20140329 graph drawing_dainiak_lecture0220140329 graph drawing_dainiak_lecture02
20140329 graph drawing_dainiak_lecture02
 
20140329 graph drawing_dainiak_lecture01
20140329 graph drawing_dainiak_lecture0120140329 graph drawing_dainiak_lecture01
20140329 graph drawing_dainiak_lecture01
 
20140310 parallel programming_kalishenko_lecture03-04
20140310 parallel programming_kalishenko_lecture03-0420140310 parallel programming_kalishenko_lecture03-04
20140310 parallel programming_kalishenko_lecture03-04
 
20140223-SuffixTrees-lecture01-03
20140223-SuffixTrees-lecture01-0320140223-SuffixTrees-lecture01-03
20140223-SuffixTrees-lecture01-03
 
20140216 parallel programming_kalishenko_lecture01
20140216 parallel programming_kalishenko_lecture0120140216 parallel programming_kalishenko_lecture01
20140216 parallel programming_kalishenko_lecture01
 
20131106 h10 lecture6_matiyasevich
20131106 h10 lecture6_matiyasevich20131106 h10 lecture6_matiyasevich
20131106 h10 lecture6_matiyasevich
 
20131027 h10 lecture5_matiyasevich
20131027 h10 lecture5_matiyasevich20131027 h10 lecture5_matiyasevich
20131027 h10 lecture5_matiyasevich
 
20131027 h10 lecture5_matiyasevich
20131027 h10 lecture5_matiyasevich20131027 h10 lecture5_matiyasevich
20131027 h10 lecture5_matiyasevich
 
20131013 h10 lecture4_matiyasevich
20131013 h10 lecture4_matiyasevich20131013 h10 lecture4_matiyasevich
20131013 h10 lecture4_matiyasevich
 
20131006 h10 lecture3_matiyasevich
20131006 h10 lecture3_matiyasevich20131006 h10 lecture3_matiyasevich
20131006 h10 lecture3_matiyasevich
 
20131006 h10 lecture3_matiyasevich
20131006 h10 lecture3_matiyasevich20131006 h10 lecture3_matiyasevich
20131006 h10 lecture3_matiyasevich
 

20100509 bioinformatics kapushesky_lecture03-04_0

  • 1. Gene Expression - Microarrays Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May 2010
  • 2.  
  • 3.  
  • 4.  
  • 5.  
  • 6. Compare gene expression in this cell type… … after drug treatment … at a later developmental time … in a different body region … after viral infection … in samples from patients … relative to a knockout
  • 7.
  • 8. Outline: microarray data analysis Gene expression Microarrays Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA)
  • 9. Microarrays: tools for gene expression A microarray is a solid support (such as a membrane or glass microscope slide) on which DNA of known sequence is deposited in a grid-like array. Page 312
  • 10. Microarrays: tools for gene expression The most common form of microarray is used to measure gene expression. RNA is isolated from matched samples of interest. The RNA is typically converted to cDNA, labeled with fluorescence (or radioactivity), then hybridized to microarrays in order to measure the expression levels of thousands of genes.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18. Fast Data on >20,000 transcripts within weeks Comprehensive Entire yeast or mouse genome on a chip Flexible Custom arrays can be made to represent genes of interest Easy Submit RNA samples to a core facility Cheap? Chip representing 20,000 genes for $300 Advantages of microarray experiments
  • 19. Cost ■ Some researchers can’t afford to do appropriate numbers of controls, replicates RNA ■ The final product of gene expression is protein significance ■ “Pervasive transcription” of the genome is poorly understood (ENCODE project) ■ There are many noncoding RNAs not yet represented on microarrays Quality ■ Impossible to assess elements on array surface control ■ Artifacts with image analysis ■ Artifacts with data analysis ■ Not enough attention to experimental design ■ Not enough collaboration with statisticians Disadvantages of microarray experiments
  • 20. Biological insight Sample acquisition Data acquisition Data analysis Data confirmation
  • 21. Stage 1: Experimental design Stage 3: Hybridization to DNA arrays Stage 2: RNA and probe preparation Stage 4: Image analysis Stage 5: Microarray data analysis Stage 6: Biological confirmation Stage 7: Microarray databases
  • 22. Stage 1: Experimental design [1] Biological samples: technical and biological replicates: determine the data analysis approach at the outset [2] RNA extraction, conversion, labeling, hybridization: except for RNA isolation, routinely performed at core facilities [3] Arrangement of array elements on a surface: randomization can reduce spatially-based artifacts Page 314
  • 23. Stage 2: RNA preparation For Affymetrix chips, need total RNA (about 5 ug) Confirm purity by running agarose gel Measure a260/a280 to confirm purity, quantity One of the greatest sources of error in microarray experiments is artifacts associated with RNA isolation; appropriately balanced, randomized experimental design is necessary.
  • 24. Stage 3: Hybridization to DNA arrays The array consists of cDNA or oligonucleotides Oligonucleotides can be deposited by photolithography The sample is converted to cRNA or cDNA (Note that the terms “probe” and “target” may refer to the element immobilized on the surface of the microarray, or to the labeled biological sample; for clarity, it may be simplest to avoid both terms.)
  • 25. Stage 4: Image analysis RNA transcript levels are quantitated Fluorescence intensity is measured with a scanner.
  • 26. Rett Control Differential Gene Expression on a cDNA Microarray  B Crystallin is over-expressed in Rett Syndrome
  • 27.  
  • 29.  
  • 31.
  • 32. Stage 6: Biological confirmation Page 320 Microarray experiments can be thought of as “ hypothesis-generating” experiments. The differential up- or down-regulation of specific RNA transcripts can be measured using independent assays such as -- Northern blots -- polymerase chain reaction (RT-PCR) -- in situ hybridization
  • 33. Stage 7: Microarray databases There are two main repositories: Gene Expression Omnibus (GEO) at NCBI ArrayExpress at the European Bioinformatics Institute (EBI)
  • 34. Microarray Overview I Microbial ORFs Design PCR Primers PCR Products Eukaryotic Genes Select cDNA clones PCR Products For each plate set, many identical replicas Microarray Slide (with 60,000 or more spotted genes) + Microtiter Plate Many different plates containing different genes
  • 35. Microarray Overview II Prepare Fluorescently Labeled Probes Control Test Hybridize, Wash Measure Fluorescence in 2 channels red / green Analyze the data to identify patterns of gene expression
  • 36. Affymetrix GeneChip™ Expression Analysis Obtain RNA Samples Prepare Fluorescently Labeled Probes Control Test Scan chips Analyze PM MM Hybridize and wash chips
  • 37. Gene Microarray Expression Analysis Spots on an Array Fluorescence Intensity Expression Measurement Tissue Selection Differential State/Stage Selection RNA Preparation and Labeling Competitive Hybridization
  • 38.
  • 39. MIAME In an effort to standardize microarray data presentation and analysis, Alvis Brazma and colleagues at 17 institutions introduced Minimum Information About a Microarray Experiment (MIAME). The MIAME framework standardizes six areas of information: ► experimental design ► microarray design ► sample preparation ► hybridization procedures ► image analysis ► controls for normalization Visit http://www.mged.org
  • 40. Interpretation of RNA analyses The relationship of DNA, RNA, and protein: DNA is transcribed to RNA. RNA quantities and half-lives vary. There tends to be a low positive correlation between RNA and protein levels. The pervasive nature of transcription: The Encyclopedia of DNA Elements (ENCODE) project identified functional features of genomic DNA, initially in 30 megabases (1% of the human genome). One of its observations was the “pervasive nature of transcription”: the vast majority of DNA is transcribed, although the function is unknown.
  • 41. Outline: microarray data analysis Gene expression Microarrays Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA)
  • 42. Microarray data analysis • begin with a data matrix (gene expression values versus samples) genes (RNA transcript levels)
  • 43. Microarray data analysis • begin with a data matrix (gene expression values versus samples) Typically, there are many genes (>> 20,000) and few samples ( ~ 10) Fig. 9.1 Page 333
  • 44. Microarray data analysis • begin with a data matrix (gene expression values versus samples) Preprocessing Inferential statistics Descriptive statistics
  • 45.
  • 46. Microarray data analysis: preprocessing The main goal of data preprocessing is to remove the systematic bias in the data as completely as possible, while preserving the variation in gene expression that occurs because of biologically relevant changes in transcription. A basic assumption of most normalization procedures is that the average gene expression level does not change in an experiment.
  • 47. Data analysis: global normalization Global normalization is used to correct two or more data sets. In one common scenario, samples are labeled with Cy3 (green dye) or Cy5 (red dye) and hybridized to DNA elements on a microrarray. After washing, probes are excited with a laser and detected with a scanning confocal microscope.
  • 48. Data analysis: global normalization Global normalization is used to correct two or more data sets Example: total fluorescence in Cy3 channel = 4 million units Cy 5 channel = 2 million units Then the uncorrected ratio for a gene could show 2,000 units versus 1,000 units. This would artifactually appear to show 2-fold regulation.
  • 49. Data analysis: global normalization Global normalization procedure Step 1: subtract background intensity values (use a blank region of the array) Step 2: globally normalize so that the average ratio = 1 (apply this to 1-channel or 2-channel data sets)
  • 50. Scatter plots Useful to represent gene expression values from two microarray experiments (e.g. control, experimental) Each dot corresponds to a gene expression value Most dots fall along a line Outliers represent up-regulated or down-regulated genes
  • 51. Brain Astrocyte Astrocyte Fibroblast Differential Gene Expression in Different Tissue and Cell Types
  • 52. expression level high low up down Expression level (sample 1) Expression level (sample 2)
  • 54. Scatter plots Typically, data are plotted on log-log coordinates Visually, this spreads out the data and offers symmetry raw ratio log 2 ratio time behavior value value t=0 basal 1.0 0.0 t=1h no change 1.0 0.0 t=2h 2-fold up 2.0 1.0 t=3h 2-fold down 0.5 -1.0
  • 55. expression level high low up down Mean log intensity Log ratio
  • 56. You can make these plots in Excel… … but for many bioinformatics applications use R. Visit http://www.r-project.org to download it.
  • 57.  
  • 58. There are limits to what you can measure
  • 59. The Limits of log-ratios: The space we explore
  • 60. The Limits of log-ratios: The space we explore
  • 61. The Limits of log-ratios: The space we explore
  • 63. Bad Data from Parts Unknown Gary Churchill Each “pin group” is colored differently
  • 64.
  • 65. Ratio Cy3/Cy5 for the same RNA sorted from least most expressed
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.  
  • 76. A A M M After RMA (a normalization procedure), the median is near zero, and skewing is corrected. Scatterplots display the effects of normalization.
  • 77.
  • 79. array log signal intensity array log signal intensity Histograms of raw intensity values for 14 arrays (plotted in R) before and after RMA was applied.
  • 80. RMA can adjust for the effect of GC content GC content log intensity
  • 81. Robust multi-array analysis (RMA) RMA offers a large increase in precision (relative to Affymetrix MAS 5.0 software). precision average log expression log expression SD RMA MAS 5.0
  • 82. Robust multi-array analysis (RMA) RMA offers comparable accuracy to MAS 5.0. log nominal concentration observed log expression accuracy
  • 83. Outline: microarray data analysis Gene expression Microarrays Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA)
  • 84. Inferential statistics Inferential statistics are used to make inferences about a population from a sample. Hypothesis testing is a common form of inferential statistics. A null hypothesis is stated, such as: “ There is no difference in signal intensity for the gene expression measurements in normal and diseased samples.” The alternative hypothesis is that there is a difference. We use a test statistic to decide whether to accept or reject the null hypothesis. For many applications, we set the significance level  to p < 0.05.
  • 85. [1] Obtain a matrix of genes (rows) and expression values columns. Here there are 20,000 rows of genes of which the first six are shown. There are three control samples and three disease samples. Calculate the mean value for each gene (transcript) for the controls and the disease (experimental) samples. Analyzing expression data Question: for each of my 20,000 transcripts, decide whether it is significantly regulated in some disease. control disease
  • 86. [2] Calculate the ratios of control versus disease. Also note that some ratios, such as 2.00, appear to be dramatic while others are not. Some researchers set a cut-off for changes of interest such as two-fold. Analyzing expression data
  • 87. A significant difference Probably not
  • 88. Inferential statistics A t-test is a commonly used test statistic to assess the difference in mean values between two groups. t = = Questions Is the sample size (n) adequate? Are the data normally distributed? Is the variance of the data known? Is the variance the same in the two groups? Is it appropriate to set the significance level to p < 0.05? x 1 – x 2 SE difference between mean values variability (standard error of the difference)
  • 89.
  • 90. [3] Perform a t-test. Hypothesis is that the transcript in the disease group is up (or down) relative to controls. Analyzing expression data
  • 91. [3] Note the results: you can have… a small p value (<0.05) with a big ratio difference a small p value (<0.05) with a trivial ratio difference a large p value (>0.05) with a big ratio difference a large p value (>0.05) with a trivial ratio difference Analyzing expression data
  • 92. Inferential statistics Is it appropriate to set the significance level to p < 0.05? If you hypothesize that a specific gene is up-regulated, you can set the probability value to 0.05. You might measure the expression of 10,000 genes and hope that any of them are up- or down-regulated. But you can expect to see 5% (500 genes) regulated at the p < 0.05 level by chance alone. To account for the thousands of repeated measurements you are making, some researchers apply a Bonferroni correction. The level for statistical significance is divided by the number of measurements, e.g. the criterion becomes: p < (0.05)/10,000 or p < 5 x 10 -6 The Bonferroni correction is generally considered to be too conservative.
  • 93. Inferential statistics: false discovery rate The false discovery rate (FDR) is a popular multiple corrections correction. A false positive (also called a type I error) is sometimes called a false discovery. The FDR equals the p value of the t-test times the number of genes measured (e.g. for 10,000 genes and a p value of 0.01, there are 100 expected false positives). You can adjust the false discovery rate. For example: FDR # regulated transcripts # false discoveries 0.1 100 10 0.05 45 3 0.01 20 1 Would you report 100 regulated transcripts of which 10 are likely to be false positives, or 20 transcripts of which one is likely to be a false positive?
  • 94.
  • 95. log fold change (treated/untreated) p value (treated versus control) A volcano plot displays both p values and fold change
  • 96. Outline: microarray data analysis Gene expression Microarrays Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA)
  • 97.  
  • 98.  
  • 99.  
  • 100. Descriptive statistics Microarray data are highly dimensional: there are many thousands of measurements made from a small number of samples. Descriptive (exploratory) statistics help you to find meaningful patterns in the data. A first step is to arrange the data in a matrix. Next, use a distance metric to define the relatedness of the different data points. Two commonly used distance metrics are: -- Euclidean distance -- Pearson coefficient of correlation
  • 101. What is a cluster? A cluster is a group that has homogeneity (internal cohesion) and separation (external isolation). The relationships between objects being studied are assessed by similarity or dissimilarity measures.
  • 102. Data matrix (20 genes and 3 time points from Chu et al., 1998) Software: S-PLUS package genes samples (time points)
  • 103. 3D plot (using S-PLUS software) t=0 t=0.5 t=2.0
  • 104. Descriptive statistics: clustering Clustering algorithms offer useful visual descriptions of microarray data. Genes may be clustered, or samples, or both. We will next describe hierarchical clustering. This may be agglomerative (building up the branches of a tree, beginning with the two most closely related objects) or divisive (building the tree by finding the most dissimilar objects first). In each case, we end up with a tree having branches and nodes. Page 355
  • 105. Distance Is Defined by a Metric Euclidean Pearson* Distance Metric: 6.0 1.4 +1.00 -0.05 D D
  • 106. Distance is Defined by a Metric 4.2 1.4 -1.00 -0.90 Euclidean Pearson(r*-1) Distance Metric: D D
  • 107.
  • 108. Agglomerative clustering a b c d e a,b 4 3 2 1 0 Adapted from Kaufman and Rousseeuw (1990)
  • 109. a b c d e a,b d,e 4 3 2 1 0 Agglomerative clustering
  • 110. a b c d e a,b d,e c,d,e 4 3 2 1 0 Agglomerative clustering
  • 111. a b c d e a,b d,e c,d,e a,b,c,d,e 4 3 2 1 0 Agglomerative clustering … tree is constructed
  • 113. Divisive clustering c,d,e a,b,c,d,e 4 3 2 1 0
  • 114. Divisive clustering d,e c,d,e a,b,c,d,e 4 3 2 1 0
  • 115. Divisive clustering a,b d,e c,d,e a,b,c,d,e 4 3 2 1 0
  • 116. Divisive clustering a b c d e a,b d,e c,d,e a,b,c,d,e 4 3 2 1 0 … tree is constructed
  • 117. divisive agglomerative a b c d e a,b d,e c,d,e a,b,c,d,e 4 3 2 1 0 4 3 2 1 0 Adapted from Kaufman and Rousseeuw (1990)
  • 118.  
  • 119.  
  • 120. 1 1 12 12 Agglomerative and divisive clustering sometimes give conflicting results, as shown here
  • 121.
  • 122. Single Linkage Cluster-to-cluster distance is defined as the minimum distance between members of one cluster and members of the another cluster. Single linkage tends to create ‘elongated’ clusters with individual genes chained onto clusters. D AB = min ( d(u i , v j ) ) where u  A and v  B for all i = 1 to N A and j = 1 to N B (HCL-7) D AB
  • 123. Average Linkage Cluster-to-cluster distance is defined as the average distance between all members of one cluster and all members of another cluster. Average linkage has a slight tendency to produce clusters of similar variance. D AB = 1/(N A N B )  ( d(u i , v j ) ) where u  A and v  B for all i = 1 to N A and j = 1 to N B (HCL-8) D AB
  • 124. Complete Linkage Cluster-to-cluster distance is defined as the maximum distance between members of one cluster and members of the another cluster. Complete linkage tends to create clusters of similar size and variability. D AB = max ( d(u i , v j ) ) where u  A and v  B for all i = 1 to N A and j = 1 to N B (HCL-9) D AB
  • 125. Comparison of Linkage Methods Single Average Complete
  • 126. Two-way clustering of genes (y-axis) and cell lines (x-axis) (Alizadeh et al., 2000)
  • 127. A B x 1 x 2 1 1 0.5 0.5 1.5 A’ B’ a 1 b 1 a’ 1 b’ 1 a’ 2 b 2 a 2 b’ 2    Euclidean distance Chord distance Angle distance
  • 128. K-Means/Medians Clustering – 1 1. Specify number of clusters , e.g., 5. 2. Randomly assign genes to clusters. G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13
  • 129. K-Means/Medians Clustering – 2 3. Calculate mean/median expression profile of each cluster. 4. Shuffle genes among clusters such that each gene is now in the cluster whose mean expression profile (calculated in step 3) is the closest to that gene’s expression profile. 5. Repeat steps 3 and 4 until genes cannot be shuffled around any more, OR a user-specified number of iterations has been reached. k -means is most useful when the user has an a priori hypothesis about the number of clusters the genes should belong to. G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13
  • 130.
  • 131. An exploratory technique used to reduce the dimensionality of the data set to 2D or 3D For a matrix of m genes x n samples, create a new covariance matrix of size n x n Thus transform some large number of variables into a smaller number of uncorrelated variables called principal components (PCs). Principal components analysis (PCA)
  • 132. Principal components analysis (PCA): objectives • to reduce dimensionality • to determine the linear combination of variables • to choose the most useful variables (features) • to visualize multidimensional data • to identify groups of objects (e.g. genes/samples) • to identify outliers
  • 137.  
  • 138.  
  • 139. 1 12
  • 141.
  • 142.
  • 144.

Hinweis der Redaktion

  1. This is an MA plot from one of the TIGR loop expt arrays I removed the zeros It is a beauty. Very slight curvature reflect a mild dye effect that we can correct by shifting before taking logs Bulk of genes lie on zero-line but there is evidence of differential expression
  2. For sake of comparison this an array from an anonymous source This MA plots shows several of the ‘bad’ things can happen with an array There is low-end truncation High end saturation There are three ‘pin groups’ in different colors and each has it’s own unique curvature -&gt; dye effect is pin group specific Amazingly enough - as part of a larger experiment with replication of samples across arrays there is information here that can extracted - albeit with some effort required. PS This is worst case of the 48 arrays in the expt.
  3. affymetrix arrays uses 25 base-pairs Oligonucleotides to probe genes. There are two types of probes, the perfect match and the mismatch probes. 11-20 of these probe pairs of PP and MM represent a genes.
  4. The MM probes are designed to measure non-specific binding and optical noise components in PM. Many expression measures are based on PM-MM, with the intention of correcting for non-specific binding and background noise. • Problems: – MMsare PMsfor some genes, – removing the middle base does not make a difference for some probes . – Subtracting MM adds variance. Especially at low end.
  5. Y-axis, log intensity of individual probe The G-C information could be accounted for in the analysis
  6. RPKM – reads per kilo base of exon per million mapped reads Estimation of gene expression (RPKM) values was performed as follows:1. Count the number of reads which map to constitutive exon bodies. The set of constitutive exons was derived from Ensembl genes (hg18, UCSC genome browser), where an exon was defined to be constitutive if present in all transcripts for a given gene.2. Determine the number of uniquely mappable positions in the same set of constitutive exons. &amp;quot;Uniquely mappable&amp;quot; was defined as being a unique 32-mer in the genome and our junction database.3. Count the total number of uniquely mapping reads in each tissue or sample.4. Compute RPKM as the number of reads which map per kilobase of exon model per million mapped reads for each gene, for each tissue or sample.
  7. RPKM – reads per kilo base of exon per million mapped reads Estimation of gene expression (RPKM) values was performed as follows:1. Count the number of reads which map to constitutive exon bodies. The set of constitutive exons was derived from Ensembl genes (hg18, UCSC genome browser), where an exon was defined to be constitutive if present in all transcripts for a given gene.2. Determine the number of uniquely mappable positions in the same set of constitutive exons. &amp;quot;Uniquely mappable&amp;quot; was defined as being a unique 32-mer in the genome and our junction database.3. Count the total number of uniquely mapping reads in each tissue or sample.4. Compute RPKM as the number of reads which map per kilobase of exon model per million mapped reads for each gene, for each tissue or sample.