1. S.Prasanth Kumar, Bioinformatician Gene Expression Studies Gene Expression Profiling-II Microarray Data Analysis: Supervised Learning Algorithms S.Prasanth Kumar, Bioinformatician S.Prasanth Kumar Dept. of Bioinformatics Applied Botany Centre (ABC) Gujarat University, Ahmedabad, INDIA www.facebook.com/Prasanth Sivakumar FOLLOW ME ON ACCESS MY RESOURCES IN SLIDESHARE prasanthperceptron CONTACT ME [email_address]
2. Dimension reduction with principal components analysis Like clustering algorithms, dimensional reduction algorithms also reduce the complexity of the data Dimension reduction involves removing or consolidating features in the data set Why features are removed ? Do not provide any significant incremental information Can confuse the analysis or make it unnecessarily complex Choose a subset of conditions that contains ‘‘independent’’ information
3. Dimension reduction with principal components analysis Dimension reduction can be accomplished by principal components analysis (PCA) Principal component analysis (PCA) automatically detects redundancies in the data and defines a new (smaller) set of hybrid features, or components, that are guaranteed not to be redundant
4. Dimension reduction with principal components analysis m GENES n CONDITIONS m x n matrix Center the data so that for each condition the mean expression is zero e.g. cond1 cond2 Mean =0 Mean=0 -4 -2 -1 -6 8 -6 -2 -4 -13 4 -2 2 -5 -5 -15 10 5 7 8 -5
5. Dimension reduction with principal components analysis calculate the covariance matrix ax + by + c = 0 e.g. a = 25 b= 34 c= 8 25x + 34b + 8 = eigen value calculate covariance -4 -2 -1 -6 8 -6 -2 -4 -13 4 -2 2 -5 -5 -15 10 5 7 8 -5 -4 -2 -1 -6 8 -6 -2 -4 -13 4 -2 2 -5 -5 -15 10 5 7 8 -5
6. Dimension reduction with principal components analysis Eigen value for condition 1 & 2 e.g. eigen value = 12 Remember: n conditions 25x + 34b + 8 -12 != 0 25x + 34b – 4 = I Eigen vector Calculate n x n Eigen vectors each eigen value = principal component eigenvectors with large eigenvalues contain most of the information; eigenvectors with small eigen values are uninformative
7. Visualization of 148-dimensional lymphoma data in two dimensions using principal component analysis Dimension reduction with principal components analysis Germinal cases activated subtype cases
8. Combining expression data with external information Supervised Machine Learning Incorporating outside knowledge e.g. a set of known cases Require a set of examples of expression profiles that are labeled with some phenotype or categorization Predicts properties of unseen expression profiles Known cancer profile Gene 1 2 3 4 5 6 7 8 9 10 unknown profile Result :Unknown profile matches with condition profile 1
9. Application of Classification Algorithms Predicting the function of a gene by comparison of its expression profile to those of well studied genes Disease diagnosis based on the gene expression profile of a pathologic specimen taken from a patient’s biopsy Classification approaches requires the selection of a positive and negative training set Training Sets contain the known cases. Positive Set contains examples that belong to the class, such as genes with a particular function. Negative Set contains examples of cases that do not belong to the class, such as genes that specifically are confirmed not to have that same function
10. Nearest neighbor classification Test case x Training examples : x i -> x 1 , x 2 , ….. x n Distance value : d 1 , d 2 , ….. d n d i = D (x‚ x i ) The closest k training examples are identified If more than k/2 training examples are positive examples, predict that the test case belongs to the class represented by the positive training set
11. Linear Discriminant analysis Basic assumption of LDA is that the positive and negative training examples can be modeled with normal distributions Mean=0 Normal distribution The probability density function of a multivariate normal distribution Critical Parameters : mean on which the distribution is centered, mu, and the covariance matrix, S, that determines the shape of the distribution
12. Linear Discriminant analysis Let’s assume that our training examples are in a matrix, X, where each row represents a gene’s expression profile Calculate mean for the positive training set and negative training set separately (mu + and mu - ) Construct covariance matrix for positive training set (S + ) Construct covariance matrix for negative training set (S - ) average of the covariance matrix S = ½ (S + + S - )
13. unseen test case = x Linear Discriminant analysis Calculate the log of the ratio of the probability of x assuming that it was generated by the positive model to the probability of x assuming that it was generated by the negative model F+ is the normal distribution characterizing the positive training examples, and F- is the normal distribution characterizing the negative training examples
14. Linear Discriminant analysis If the value of the log likelihood is > zero assume that the test case, x, is consistent with the positive set Classified that unknown test profile, x matches with known profile Classified that unknown test profile, x does not matches with any known profile, NEW CONDITION PROFILE yes no
15. Linear Discriminant analysis positive training cases (white squares) negative training cases (dark squares) calculated mean for +ve TS calculated mean for -ve TS ND of +ve TS ND of -ve TS Values present in –ve covariance matrix Values present in +ve covariance matrix collection of points where the density of the +ve dist = the density of the -ve distribution
16. Microarray Data Analysis Software Commercial software packages by microarray manufacturers Affymetrix currently offers MAS 5.0 software BioDiscovery GeneSifter MATLAB, Partek Genomics Suite Spotfire Commercial Spreadsheet programs Microsoft Excel S-PLUS. Statistics packages STATA SAS Open Source Software BioConductor Simple analysis tools GEO at NCBI ArrayExpress at EBI CIBEX at DDBJ