SlideShare ist ein Scribd-Unternehmen logo
1 von 17
S.Prasanth Kumar, Bioinformatician Gene Expression Studies Gene Expression Profiling-II Microarray Data Analysis: Supervised Learning Algorithms   S.Prasanth Kumar, Bioinformatician S.Prasanth Kumar   Dept. of Bioinformatics  Applied Botany Centre (ABC)  Gujarat University, Ahmedabad, INDIA www.facebook.com/Prasanth Sivakumar FOLLOW ME ON  ACCESS MY RESOURCES IN SLIDESHARE prasanthperceptron CONTACT ME [email_address]
Dimension reduction with principal components analysis Like clustering algorithms, dimensional reduction algorithms also reduce the complexity of the data Dimension reduction involves removing or consolidating features in the data set Why features are removed ? Do not provide any significant incremental information Can confuse the analysis or make it unnecessarily complex Choose a subset of conditions that contains ‘‘independent’’ information
Dimension reduction with principal components analysis Dimension reduction can be accomplished by principal components analysis (PCA) Principal component analysis (PCA) automatically detects redundancies in the data and defines a new (smaller) set of hybrid features, or components, that are  guaranteed not to be redundant
Dimension reduction with principal components analysis m  GENES n  CONDITIONS m x n matrix Center the data so that for each condition the mean expression is  zero e.g.  cond1 cond2 Mean =0 Mean=0 -4 -2 -1 -6 8 -6 -2 -4 -13 4 -2 2 -5 -5 -15 10 5 7 8 -5
Dimension reduction with principal components analysis calculate the covariance matrix ax + by + c = 0 e.g. a = 25 b= 34 c= 8 25x + 34b + 8 = eigen value calculate covariance -4 -2 -1 -6 8 -6 -2 -4 -13 4 -2 2 -5 -5 -15 10 5 7 8 -5 -4 -2 -1 -6 8 -6 -2 -4 -13 4 -2 2 -5 -5 -15 10 5 7 8 -5
Dimension reduction with principal components analysis Eigen value for condition 1 & 2  e.g. eigen value = 12 Remember: n    conditions  25x + 34b + 8 -12  != 0 25x + 34b – 4 = I Eigen vector Calculate n x n Eigen vectors each eigen value = principal component  eigenvectors with large eigenvalues contain most of the information; eigenvectors with small eigen values are uninformative
Visualization of 148-dimensional lymphoma data in two dimensions using principal component analysis Dimension reduction with principal components analysis Germinal cases activated subtype cases
Combining expression data with external information Supervised Machine Learning Incorporating outside knowledge e.g. a set of known cases Require a set of examples of expression profiles that are labeled with some phenotype or categorization Predicts properties of unseen expression profiles Known cancer profile Gene  1  2  3  4  5  6  7  8  9  10 unknown profile Result :Unknown profile matches with condition profile 1
Application of Classification Algorithms Predicting the function of a gene by comparison of its expression profile to those of well studied genes Disease diagnosis based on the gene expression profile of a pathologic specimen taken from a patient’s biopsy Classification approaches requires the  selection of a positive and negative training set Training Sets contain the known cases. Positive Set contains examples that belong to the class, such as genes with a particular function. Negative Set contains examples of cases that do not belong to the class, such as genes that specifically are confirmed not to have that same function
Nearest neighbor classification Test case x Training examples : x i  -> x 1 , x 2 , ….. x n Distance value :  d 1 , d 2 , ….. d n d i  = D (x‚ x i ) The closest k training examples are identified If more than k/2 training examples are positive examples, predict that the test case belongs to the class represented by the positive training set
Linear Discriminant analysis Basic assumption of LDA is that the positive and negative training examples can be modeled with normal distributions Mean=0 Normal distribution The probability density function of a multivariate normal distribution Critical Parameters : mean on which the distribution is centered, mu, and the covariance matrix, S, that determines the shape of the distribution
Linear Discriminant analysis Let’s assume that our training examples are in a matrix, X, where each row represents a gene’s expression profile Calculate mean for the positive training set and negative training set separately (mu +  and mu - ) Construct covariance matrix for positive training set (S + )   Construct covariance matrix for negative training set (S - )   average of the covariance matrix S = ½ (S +  + S - )
unseen test case = x Linear Discriminant analysis Calculate the log of the ratio of the probability of x assuming that it was generated by the positive model to the probability of x assuming that it was generated by the negative model F+ is the normal distribution characterizing the positive training examples, and F- is the normal distribution characterizing the negative training examples
Linear Discriminant analysis If the value of the log likelihood is > zero assume that the test case, x, is consistent with the positive set Classified that unknown test profile, x  matches  with known profile Classified that unknown test profile, x  does not matches  with any known profile, NEW CONDITION PROFILE yes no
Linear Discriminant analysis positive training cases (white squares) negative training cases (dark squares) calculated mean for +ve TS calculated mean for -ve TS ND of +ve TS ND of -ve TS Values present in –ve covariance matrix Values present in +ve covariance matrix collection of points where the density of the +ve dist = the density of the -ve distribution
Microarray Data Analysis Software Commercial software packages by microarray manufacturers   Affymetrix currently offers MAS 5.0 software  BioDiscovery GeneSifter MATLAB,  Partek Genomics Suite Spotfire Commercial   Spreadsheet programs   Microsoft Excel  S-PLUS.  Statistics packages   STATA  SAS  Open Source Software   BioConductor  Simple analysis tools   GEO at NCBI ArrayExpress at EBI  CIBEX at DDBJ
Thank You For Your Attention !!!

Weitere ähnliche Inhalte

Was ist angesagt?

Softwares For Phylogentic Analysis
Softwares For Phylogentic AnalysisSoftwares For Phylogentic Analysis
Softwares For Phylogentic AnalysisPrasanthperceptron
 
Molecular Phylogenetics
Molecular PhylogeneticsMolecular Phylogenetics
Molecular PhylogeneticsMeghaj Mallick
 
Survey and Evaluation of Methods for Tissue Classification
Survey and Evaluation of Methods for Tissue ClassificationSurvey and Evaluation of Methods for Tissue Classification
Survey and Evaluation of Methods for Tissue Classificationperfj
 
Candidate Gene Approach in Crop Improvement
Candidate Gene Approach in Crop ImprovementCandidate Gene Approach in Crop Improvement
Candidate Gene Approach in Crop ImprovementBonipasAntony2
 
Phylogenetic prediction - maximum parsimony method
Phylogenetic prediction - maximum parsimony methodPhylogenetic prediction - maximum parsimony method
Phylogenetic prediction - maximum parsimony methodAfnan Zuiter
 
Association mapping
Association mappingAssociation mapping
Association mappingNivethitha T
 
Sequencealignmentinbioinformatics 100204112518-phpapp02
Sequencealignmentinbioinformatics 100204112518-phpapp02Sequencealignmentinbioinformatics 100204112518-phpapp02
Sequencealignmentinbioinformatics 100204112518-phpapp02PILLAI ASWATHY VISWANATH
 
Genome wide association studies seminar
Genome wide association studies seminarGenome wide association studies seminar
Genome wide association studies seminarVarsha Gayatonde
 
Gene Selection for Sample Classification in Microarray: Clustering Based Method
Gene Selection for Sample Classification in Microarray: Clustering Based MethodGene Selection for Sample Classification in Microarray: Clustering Based Method
Gene Selection for Sample Classification in Microarray: Clustering Based MethodIOSR Journals
 
BITS: Basics of sequence analysis
BITS: Basics of sequence analysisBITS: Basics of sequence analysis
BITS: Basics of sequence analysisBITS
 
Measuring Gene Expression
Measuring Gene ExpressionMeasuring Gene Expression
Measuring Gene ExpressionAtul Narkhede
 
Mapping and association mapping
Mapping and association mappingMapping and association mapping
Mapping and association mappingFAO
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformaticsdagunisa
 
Association mapping, GWAS, Mapping, natural population mapping
Association mapping, GWAS, Mapping, natural population mappingAssociation mapping, GWAS, Mapping, natural population mapping
Association mapping, GWAS, Mapping, natural population mappingMahesh Biradar
 
MAGIC :Multiparent advanced generation intercross and QTL discovery
MAGIC :Multiparent advanced generation intercross and  QTL discovery MAGIC :Multiparent advanced generation intercross and  QTL discovery
MAGIC :Multiparent advanced generation intercross and QTL discovery Senthil Natesan
 
Application of a Selective Gaussian Naïve Bayes Model for Diffuse-Large B-Cel...
Application of a Selective Gaussian Naïve Bayes Model for Diffuse-Large B-Cel...Application of a Selective Gaussian Naïve Bayes Model for Diffuse-Large B-Cel...
Application of a Selective Gaussian Naïve Bayes Model for Diffuse-Large B-Cel...NTNU
 
QTL mapping in genetic analysis
QTL mapping in genetic analysisQTL mapping in genetic analysis
QTL mapping in genetic analysisNikhilNik25
 

Was ist angesagt? (20)

Softwares For Phylogentic Analysis
Softwares For Phylogentic AnalysisSoftwares For Phylogentic Analysis
Softwares For Phylogentic Analysis
 
Molecular Phylogenetics
Molecular PhylogeneticsMolecular Phylogenetics
Molecular Phylogenetics
 
Survey and Evaluation of Methods for Tissue Classification
Survey and Evaluation of Methods for Tissue ClassificationSurvey and Evaluation of Methods for Tissue Classification
Survey and Evaluation of Methods for Tissue Classification
 
Candidate Gene Approach in Crop Improvement
Candidate Gene Approach in Crop ImprovementCandidate Gene Approach in Crop Improvement
Candidate Gene Approach in Crop Improvement
 
Phylogenetic prediction - maximum parsimony method
Phylogenetic prediction - maximum parsimony methodPhylogenetic prediction - maximum parsimony method
Phylogenetic prediction - maximum parsimony method
 
Association mapping
Association mappingAssociation mapping
Association mapping
 
Sequencealignmentinbioinformatics 100204112518-phpapp02
Sequencealignmentinbioinformatics 100204112518-phpapp02Sequencealignmentinbioinformatics 100204112518-phpapp02
Sequencealignmentinbioinformatics 100204112518-phpapp02
 
Genome wide association studies seminar
Genome wide association studies seminarGenome wide association studies seminar
Genome wide association studies seminar
 
Association mapping
Association mappingAssociation mapping
Association mapping
 
Gene Selection for Sample Classification in Microarray: Clustering Based Method
Gene Selection for Sample Classification in Microarray: Clustering Based MethodGene Selection for Sample Classification in Microarray: Clustering Based Method
Gene Selection for Sample Classification in Microarray: Clustering Based Method
 
BITS: Basics of sequence analysis
BITS: Basics of sequence analysisBITS: Basics of sequence analysis
BITS: Basics of sequence analysis
 
Measuring Gene Expression
Measuring Gene ExpressionMeasuring Gene Expression
Measuring Gene Expression
 
philogenetic tree
philogenetic treephilogenetic tree
philogenetic tree
 
Mapping and association mapping
Mapping and association mappingMapping and association mapping
Mapping and association mapping
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Association mapping, GWAS, Mapping, natural population mapping
Association mapping, GWAS, Mapping, natural population mappingAssociation mapping, GWAS, Mapping, natural population mapping
Association mapping, GWAS, Mapping, natural population mapping
 
MAGIC :Multiparent advanced generation intercross and QTL discovery
MAGIC :Multiparent advanced generation intercross and  QTL discovery MAGIC :Multiparent advanced generation intercross and  QTL discovery
MAGIC :Multiparent advanced generation intercross and QTL discovery
 
Application of a Selective Gaussian Naïve Bayes Model for Diffuse-Large B-Cel...
Application of a Selective Gaussian Naïve Bayes Model for Diffuse-Large B-Cel...Application of a Selective Gaussian Naïve Bayes Model for Diffuse-Large B-Cel...
Application of a Selective Gaussian Naïve Bayes Model for Diffuse-Large B-Cel...
 
QTL mapping in genetic analysis
QTL mapping in genetic analysisQTL mapping in genetic analysis
QTL mapping in genetic analysis
 
QTL MAPPING & ANALYSIS
QTL MAPPING & ANALYSIS  QTL MAPPING & ANALYSIS
QTL MAPPING & ANALYSIS
 

Andere mochten auch (20)

Gene expression profiling i
Gene expression profiling  iGene expression profiling  i
Gene expression profiling i
 
Dna microarray (dna chips)
Dna microarray (dna chips)Dna microarray (dna chips)
Dna microarray (dna chips)
 
DNA microarray
DNA microarrayDNA microarray
DNA microarray
 
DNA microarray
DNA microarrayDNA microarray
DNA microarray
 
MICROARRAY
MICROARRAYMICROARRAY
MICROARRAY
 
Snp
SnpSnp
Snp
 
coad_machine_learning
coad_machine_learningcoad_machine_learning
coad_machine_learning
 
Market Gardening: A Start-up Guide
Market Gardening: A Start-up GuideMarket Gardening: A Start-up Guide
Market Gardening: A Start-up Guide
 
PPT ON MICROBIAL GENOME
PPT ON MICROBIAL GENOMEPPT ON MICROBIAL GENOME
PPT ON MICROBIAL GENOME
 
Economic Gardening Booklet
Economic Gardening BookletEconomic Gardening Booklet
Economic Gardening Booklet
 
chemical synsis of Dna
chemical synsis of Dnachemical synsis of Dna
chemical synsis of Dna
 
Gene Expression Data Analysis
Gene Expression Data AnalysisGene Expression Data Analysis
Gene Expression Data Analysis
 
Introduction to Microarray in Gene Expression studies
Introduction to Microarray in Gene Expression studiesIntroduction to Microarray in Gene Expression studies
Introduction to Microarray in Gene Expression studies
 
SEQUENCE ANALYSIS
SEQUENCE ANALYSISSEQUENCE ANALYSIS
SEQUENCE ANALYSIS
 
Genome analysis
Genome analysisGenome analysis
Genome analysis
 
SNP Genotyping Technologies
SNP Genotyping TechnologiesSNP Genotyping Technologies
SNP Genotyping Technologies
 
Gentic polymorphism
Gentic polymorphismGentic polymorphism
Gentic polymorphism
 
Chemical synthesis of DNA By Prabhu Thirusangu
Chemical synthesis of DNA By Prabhu ThirusanguChemical synthesis of DNA By Prabhu Thirusangu
Chemical synthesis of DNA By Prabhu Thirusangu
 
Gene order
Gene orderGene order
Gene order
 
Microarray
Microarray Microarray
Microarray
 

Ähnlich wie Gene expression profiling ii

Business Analytics using R.ppt
Business Analytics using R.pptBusiness Analytics using R.ppt
Business Analytics using R.pptRohit Raj
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.pptbutest
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.pptbutest
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.pptbutest
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal
 
Data classification sammer
Data classification sammer Data classification sammer
Data classification sammer Sammer Qader
 
Discriminant Analysis in Sports
Discriminant Analysis in SportsDiscriminant Analysis in Sports
Discriminant Analysis in SportsJ P Verma
 
Unit-4 classification
Unit-4 classificationUnit-4 classification
Unit-4 classificationLokarchanaD
 
Logistic Regression in Case-Control Study
Logistic Regression in Case-Control StudyLogistic Regression in Case-Control Study
Logistic Regression in Case-Control StudySatish Gupta
 
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...cscpconf
 
unit classification.pptx
unit  classification.pptxunit  classification.pptx
unit classification.pptxssuser908de6
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..butest
 
classification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdfclassification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdf321106410027
 
Mncs 16-09-4주-변승규-introduction to the machine learning
Mncs 16-09-4주-변승규-introduction to the machine learningMncs 16-09-4주-변승규-introduction to the machine learning
Mncs 16-09-4주-변승규-introduction to the machine learningSeung-gyu Byeon
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selectionchenhm
 
IRJET- Disease Prediction using Machine Learning
IRJET-  Disease Prediction using Machine LearningIRJET-  Disease Prediction using Machine Learning
IRJET- Disease Prediction using Machine LearningIRJET Journal
 

Ähnlich wie Gene expression profiling ii (20)

Business Analytics using R.ppt
Business Analytics using R.pptBusiness Analytics using R.ppt
Business Analytics using R.ppt
 
Supervised algorithms
Supervised algorithmsSupervised algorithms
Supervised algorithms
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
 
Data classification sammer
Data classification sammer Data classification sammer
Data classification sammer
 
Discriminant Analysis in Sports
Discriminant Analysis in SportsDiscriminant Analysis in Sports
Discriminant Analysis in Sports
 
Unit-4 classification
Unit-4 classificationUnit-4 classification
Unit-4 classification
 
Logistic Regression in Case-Control Study
Logistic Regression in Case-Control StudyLogistic Regression in Case-Control Study
Logistic Regression in Case-Control Study
 
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
 
unit classification.pptx
unit  classification.pptxunit  classification.pptx
unit classification.pptx
 
18 ijcse-01232
18 ijcse-0123218 ijcse-01232
18 ijcse-01232
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..
 
classification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdfclassification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdf
 
Classification Continued
Classification ContinuedClassification Continued
Classification Continued
 
Classification Continued
Classification ContinuedClassification Continued
Classification Continued
 
Mncs 16-09-4주-변승규-introduction to the machine learning
Mncs 16-09-4주-변승규-introduction to the machine learningMncs 16-09-4주-변승규-introduction to the machine learning
Mncs 16-09-4주-변승규-introduction to the machine learning
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selection
 
IRJET- Disease Prediction using Machine Learning
IRJET-  Disease Prediction using Machine LearningIRJET-  Disease Prediction using Machine Learning
IRJET- Disease Prediction using Machine Learning
 

Mehr von Prasanthperceptron

Mehr von Prasanthperceptron (20)

Prasanth Chikungunya Viral nsP4
Prasanth Chikungunya Viral nsP4 Prasanth Chikungunya Viral nsP4
Prasanth Chikungunya Viral nsP4
 
Maize poster
Maize posterMaize poster
Maize poster
 
Structure determination
Structure determinationStructure determination
Structure determination
 
S. prasanth kumar young scientist awarded presentation
S. prasanth kumar young scientist awarded presentationS. prasanth kumar young scientist awarded presentation
S. prasanth kumar young scientist awarded presentation
 
Soft copy of abstracts
Soft copy of abstractsSoft copy of abstracts
Soft copy of abstracts
 
Protein stability manual
Protein stability manualProtein stability manual
Protein stability manual
 
ORVIL Manual
ORVIL ManualORVIL Manual
ORVIL Manual
 
2 d qsar model of dihydrofolate reductase (dhfr) inhibitors with activity in ...
2 d qsar model of dihydrofolate reductase (dhfr) inhibitors with activity in ...2 d qsar model of dihydrofolate reductase (dhfr) inhibitors with activity in ...
2 d qsar model of dihydrofolate reductase (dhfr) inhibitors with activity in ...
 
Epitope prediction and its algorithms
Epitope prediction and its algorithmsEpitope prediction and its algorithms
Epitope prediction and its algorithms
 
Gene order
Gene orderGene order
Gene order
 
Vls
VlsVls
Vls
 
The mechanism of protein folding
The mechanism of protein foldingThe mechanism of protein folding
The mechanism of protein folding
 
Sequence alignments complete coverage
Sequence alignments complete coverageSequence alignments complete coverage
Sequence alignments complete coverage
 
Sage technology
Sage technologySage technology
Sage technology
 
Proteome databases
Proteome databasesProteome databases
Proteome databases
 
Protein protein interactions
Protein protein interactionsProtein protein interactions
Protein protein interactions
 
Protein dna interaction practical
Protein dna interaction  practicalProtein dna interaction  practical
Protein dna interaction practical
 
Protein dna interaction
Protein dna interactionProtein dna interaction
Protein dna interaction
 
Primary databases ncbi
Primary databases ncbiPrimary databases ncbi
Primary databases ncbi
 
Pharmacophore identification
Pharmacophore identificationPharmacophore identification
Pharmacophore identification
 

Gene expression profiling ii

  • 1. S.Prasanth Kumar, Bioinformatician Gene Expression Studies Gene Expression Profiling-II Microarray Data Analysis: Supervised Learning Algorithms S.Prasanth Kumar, Bioinformatician S.Prasanth Kumar Dept. of Bioinformatics Applied Botany Centre (ABC) Gujarat University, Ahmedabad, INDIA www.facebook.com/Prasanth Sivakumar FOLLOW ME ON ACCESS MY RESOURCES IN SLIDESHARE prasanthperceptron CONTACT ME [email_address]
  • 2. Dimension reduction with principal components analysis Like clustering algorithms, dimensional reduction algorithms also reduce the complexity of the data Dimension reduction involves removing or consolidating features in the data set Why features are removed ? Do not provide any significant incremental information Can confuse the analysis or make it unnecessarily complex Choose a subset of conditions that contains ‘‘independent’’ information
  • 3. Dimension reduction with principal components analysis Dimension reduction can be accomplished by principal components analysis (PCA) Principal component analysis (PCA) automatically detects redundancies in the data and defines a new (smaller) set of hybrid features, or components, that are guaranteed not to be redundant
  • 4. Dimension reduction with principal components analysis m GENES n CONDITIONS m x n matrix Center the data so that for each condition the mean expression is zero e.g. cond1 cond2 Mean =0 Mean=0 -4 -2 -1 -6 8 -6 -2 -4 -13 4 -2 2 -5 -5 -15 10 5 7 8 -5
  • 5. Dimension reduction with principal components analysis calculate the covariance matrix ax + by + c = 0 e.g. a = 25 b= 34 c= 8 25x + 34b + 8 = eigen value calculate covariance -4 -2 -1 -6 8 -6 -2 -4 -13 4 -2 2 -5 -5 -15 10 5 7 8 -5 -4 -2 -1 -6 8 -6 -2 -4 -13 4 -2 2 -5 -5 -15 10 5 7 8 -5
  • 6. Dimension reduction with principal components analysis Eigen value for condition 1 & 2 e.g. eigen value = 12 Remember: n  conditions 25x + 34b + 8 -12 != 0 25x + 34b – 4 = I Eigen vector Calculate n x n Eigen vectors each eigen value = principal component eigenvectors with large eigenvalues contain most of the information; eigenvectors with small eigen values are uninformative
  • 7. Visualization of 148-dimensional lymphoma data in two dimensions using principal component analysis Dimension reduction with principal components analysis Germinal cases activated subtype cases
  • 8. Combining expression data with external information Supervised Machine Learning Incorporating outside knowledge e.g. a set of known cases Require a set of examples of expression profiles that are labeled with some phenotype or categorization Predicts properties of unseen expression profiles Known cancer profile Gene 1 2 3 4 5 6 7 8 9 10 unknown profile Result :Unknown profile matches with condition profile 1
  • 9. Application of Classification Algorithms Predicting the function of a gene by comparison of its expression profile to those of well studied genes Disease diagnosis based on the gene expression profile of a pathologic specimen taken from a patient’s biopsy Classification approaches requires the selection of a positive and negative training set Training Sets contain the known cases. Positive Set contains examples that belong to the class, such as genes with a particular function. Negative Set contains examples of cases that do not belong to the class, such as genes that specifically are confirmed not to have that same function
  • 10. Nearest neighbor classification Test case x Training examples : x i -> x 1 , x 2 , ….. x n Distance value : d 1 , d 2 , ….. d n d i = D (x‚ x i ) The closest k training examples are identified If more than k/2 training examples are positive examples, predict that the test case belongs to the class represented by the positive training set
  • 11. Linear Discriminant analysis Basic assumption of LDA is that the positive and negative training examples can be modeled with normal distributions Mean=0 Normal distribution The probability density function of a multivariate normal distribution Critical Parameters : mean on which the distribution is centered, mu, and the covariance matrix, S, that determines the shape of the distribution
  • 12. Linear Discriminant analysis Let’s assume that our training examples are in a matrix, X, where each row represents a gene’s expression profile Calculate mean for the positive training set and negative training set separately (mu + and mu - ) Construct covariance matrix for positive training set (S + ) Construct covariance matrix for negative training set (S - ) average of the covariance matrix S = ½ (S + + S - )
  • 13. unseen test case = x Linear Discriminant analysis Calculate the log of the ratio of the probability of x assuming that it was generated by the positive model to the probability of x assuming that it was generated by the negative model F+ is the normal distribution characterizing the positive training examples, and F- is the normal distribution characterizing the negative training examples
  • 14. Linear Discriminant analysis If the value of the log likelihood is > zero assume that the test case, x, is consistent with the positive set Classified that unknown test profile, x matches with known profile Classified that unknown test profile, x does not matches with any known profile, NEW CONDITION PROFILE yes no
  • 15. Linear Discriminant analysis positive training cases (white squares) negative training cases (dark squares) calculated mean for +ve TS calculated mean for -ve TS ND of +ve TS ND of -ve TS Values present in –ve covariance matrix Values present in +ve covariance matrix collection of points where the density of the +ve dist = the density of the -ve distribution
  • 16. Microarray Data Analysis Software Commercial software packages by microarray manufacturers Affymetrix currently offers MAS 5.0 software BioDiscovery GeneSifter MATLAB, Partek Genomics Suite Spotfire Commercial Spreadsheet programs Microsoft Excel S-PLUS. Statistics packages STATA SAS Open Source Software BioConductor Simple analysis tools GEO at NCBI ArrayExpress at EBI CIBEX at DDBJ
  • 17. Thank You For Your Attention !!!