SlideShare a Scribd company logo
1 of 1
Download to read offline
Gene Selection via Significant Subset
                                                            using Silhouette Index
                                                                                                                     1,2                      1                                 1                                       1                                        1
                                                                       Juan Ignacio Pastore , Guillermo Abras , Diego Sebastían Comas , Marcel Brun , Virginia Ballarin
                                                                                           1
                                                                                            Laboratorio de Procesos y Medición de Señales, Facultad de Ingeniería, UNMdP
                                                                                                    2
                                                                                                          Comisión Nacional de Investigaciones Científicas y Técnicas CONICET,
                                                                                                                                 mbrun@fi.mdp.edu.ar

   Introduction
Gene selection is an important task in the area of bioinformatics, where significant genes are chosen using somecriterion of significance. In the case of classification, like disease vs. normal, tissue
type, etc, the criterion used is the ability to provide good features for the classification task. In other cases it is interesting to select large groups of genes with similar behavior, regardless of the class.
This task is usually carried on by clustering algorithm, where the whole family of genes, or a subset of them, is grouped into significant clusters. These techniques provide insight on possible co-
regulation between genes, but usually provide large, maybe enormous sets, depending on the number of clusters required. In this work we present a new algorithm that provides sets of genes with
very similar expression. This is possible by using the complete clustering tree provided by the hierarchicalclustering algorithm, and the Silhouette index for ranking of the subsets.


   Algorithm


                                                                       Microarray Data                                          Hierarchical                             Silhouette                                                                 Selected
                                                                                                                                Clustering                               Index                                                                      Sets



   Hierarchical Clustering                                                                                                      Silhouette Index

                                                                                                                                                                                    1 K é1                          ù
         Hierarchical clustering is an                                                                                           The Silhouette index measures not           S=       åê
                                                                                                                                                                                    K k =1 ë nk
                                                                                                                                                                                                   å S (x )ú
                                                                                                                                                                                                   xÎCk             û                    1                                    1                                1
 agglomerative partitioning algorithm that                                                                                           only the compacness of the
                                                                                                                                                                                            b (x ) - a (x )                             0.8                                  0.8                              0.8
 identifies compact subsets of the data, in                                                                                         clusters, but also the distance         S (x ) =
                                                                                                                                                                                       max é a (x ), b (x )ù
                                                                                                                                                                                           ë               û
                                                                                                                                                                                                                                        0.6                                  0.6                              0.6
                                                                                                                                   between them. The higher the




                                                                                                                                                                                                                                RA S2




                                                                                                                                                                                                                                                                     RA S2




                                                                                                                                                                                                                                                                                                      RA S2
  a iterative proceeding. The result of the                                                                                                                                                                                             0.4                                  0.4                              0.4
                                                                                                                                   index, the more compact and                           1
      algorithm is a dendrogram, a tree                                                                                                                                     a (x ) =            å d ( x, y )
                                                                                                                                                                                       nk - 1 yÎCk , y ¹ x                              0.2                                  0.2                              0.2

   structure informing all the steps of the                                                                                     separated from each other are the                                                                        0                                    0                                0

              grouping process.                                                                                                                 cluster.                                               é1                   ù                 0       0.5    1                     0       0.5    1                 0       0.5    1

                                                                                                                                                                           b (x ) =       min          ê
                                                                                                                                                                                       h =1,K, K ,h ¹ k n
                                                                                                                                                                                                             å d (x, y )ú                            RAS1

                                                                                                                                                                                                                                                  QI: 0.29
                                                                                                                                                                                                                                                                                          RAS1

                                                                                                                                                                                                                                                                                       QI: 0.43
                                                                                                                                                                                                                                                                                                                           RAS1

                                                                                                                                                                                                                                                                                                                        QI: 0.69
                                                                                                                                                                                                       ë h   yÎCh           û




   Experimental Data

Experiment : E-GEOD-15653                          Submitter(s) : Patti                            Lab : Joslin Diabetes Center Mary Elizabeth Patti.
(Generated description): Experiment with 18 hybridizations, using 18 samples of species [Homo sapiens], using 18 arrays of array design [Affymetrix GeneChip® Human Genome HG-U133A [HG-
U133A]], producing 18 raw data files and 18 transformed and/or normalized data files.
(Submitter's description 1): Hepatic lipid accumulation is an important complication of obesity linked to risk for type 2 diabetes. To identify novel transcriptional changes in human liver which could
contribute to hepatic lipid accumulation and associated insulin resistance and type 2 diabetes (DM2), we evaluated gene expression and gene set enrichment in surgical liver biopsies from 13
obese (9 with DM2) and 5 control subjects, obtained in the fasting state at the time of elective abdominal surgery for obesity or cholecystectomy. RNA was isolated for cRNA preparation and
hybridized to Affymetrix U133A microarrays. Experiment Overall Design: Human liver samples were obtained from 5 lean control subjects undergoing elective cholecystectomy and 13 obese
subjects (with or without Type 2 diabetes) undergoing gastric bypass surgery. Subjects with diabetes were classified as either well-controlled or poorly-controlled.

     Experiments
   We choose compact and separated clusters of genes by computing the Silhouette Index of Compactness[1,2,3] on every possible subset of the N genes. This approach may take an impractical
  amount of time, since there are 2N such sets; therefore we propose a sub-optimal search, limiting the computation of the index on the sets provided by the Hierarchical Clustering algorithm, not
  only on the final stage, but on every intermediate step. If there are N genes, there will be N such groupings, the first one with N clusters (subsets), and the last one with only 1 large cluster, making a
  total of N(N+1)/2 candidate subsets. Because of the overlapping, there are only 2*N different subsets to be processed, and because of the way the clustering algorithm works, most of them will be
  compact. The Silhouette index will ensure to select groups that are also separated from the other ones.


   Software Implementation




                                                                                     Microarray Data



                                                                                                                                Program’s interface for gene selection                                Performance results




    Results                                                                                                                                                                Conclusion
 In the sample data used for testing purposes the top selected sets showed consistency and many of the                                                                   The proposed tool may be a powerful tool for the biologists or computational
 genes of the groups were related by function. Below we can see one of the top sets, with a Silhouette index                                                             biology researchers interested on generating new hypothesis on co-expressed
 of 0.94,which consists of two probes for the same gene (GSTM1 ), and one probe for gene GSTM2, which are                                                                genes, which are not provided by more standard analysis tools.
 both members of the mu class of enzymes, which functions in the detoxification of electrophilic compounds.


                                                  Biological Process                                       Cellular Component
                                                                                                                                                                           References
    Probe Set      Gene Title       Gene Symbol                         Molecular Function Term
                                                        Term                                                      Term
       ID
   204550_x_at
                   glutathione S-
                                      GSTM1       metabolic process
                                                                       glutathione transferase activity
                                                                                                               cytoplasm                                                 [1] Rousseeuw, Peter J., "Silhouettes: A graphical aid to the interpretation and validation of cluster
                 transferase mu 1                                            transferase activity                                                                        analysis", Journal of Computational and Applied Mathematics , 20 (1) , pp.53-65 , 1987.
                   glutathione S-                                                                                                                                        [2] Pearson John V. et.al., "Identification of the Genetic Basis for Complex Disorders by Use of Pooling-
                                                                       glutathione transferase activity                                                                  Based Genomewide Single-NucleotideyPolimorphism Association Studies”, The American Journal of
   204418_x_at   transferase mu 2     GSTM2       metabolic process                                            cytoplasm
                                                                             transferase activity
                      (muscle)                                                                                                                                           Human Genetics, 80, pp. 126-139. 2007.
                                                                                                                                                                         [3] Jianping Hua, David W. Craig, Marcel Brun, Jennifer Webster, Victoria Zismann, Waibhav Tembe,
                   glutathione S-                                      glutathione transferase activity
   215333_x_at
                 transferase mu 1
                                      GSTM1       metabolic process
                                                                             transferase activity
                                                                                                               cytoplasm                                                 Keta Joshipura, Matthew J. Huentelman, Edward R. Dougherty, Dietrich A. Stephan: SNiPer-HD:
                                                                                                                                                                         improved genotype calling accuracy by an expectation-maximization algorithm for high-density SNP
                                                                                                                                                                         arrays. Bioinformatics 23(1): pp. 57-63 (2007).

More Related Content

Viewers also liked

Feature Detection in Ajax-enabled Web Applications
Feature Detection in Ajax-enabled Web ApplicationsFeature Detection in Ajax-enabled Web Applications
Feature Detection in Ajax-enabled Web Applications
Nikolaos Tsantalis
 
A Multidimensional Empirical Study on Refactoring Activity
A Multidimensional Empirical Study on Refactoring ActivityA Multidimensional Empirical Study on Refactoring Activity
A Multidimensional Empirical Study on Refactoring Activity
Nikolaos Tsantalis
 

Viewers also liked (10)

Discovery Of Functional Protein Linear Motifs Using a Greaddy Algorithm and I...
Discovery Of Functional Protein Linear Motifs Using a Greaddy Algorithm and I...Discovery Of Functional Protein Linear Motifs Using a Greaddy Algorithm and I...
Discovery Of Functional Protein Linear Motifs Using a Greaddy Algorithm and I...
 
La Unidad de Bioinformática del INTA
La Unidad de Bioinformática del INTALa Unidad de Bioinformática del INTA
La Unidad de Bioinformática del INTA
 
Cooperatividad en la Expresión Génica: Abordaje Estocástico
Cooperatividad en la Expresión Génica: Abordaje EstocásticoCooperatividad en la Expresión Génica: Abordaje Estocástico
Cooperatividad en la Expresión Génica: Abordaje Estocástico
 
Network Analysis with networkX : Fundamentals of network theory-1
Network Analysis with networkX : Fundamentals of network theory-1Network Analysis with networkX : Fundamentals of network theory-1
Network Analysis with networkX : Fundamentals of network theory-1
 
Structural Order and Disorder Dictate Sequence And Functional Evolution of th...
Structural Order and Disorder Dictate Sequence And Functional Evolution of th...Structural Order and Disorder Dictate Sequence And Functional Evolution of th...
Structural Order and Disorder Dictate Sequence And Functional Evolution of th...
 
Prediction of heparin binding sites on GAPDH
Prediction of heparin binding sites on GAPDHPrediction of heparin binding sites on GAPDH
Prediction of heparin binding sites on GAPDH
 
About using new descriptors for cheminformatics
About using new descriptors for cheminformaticsAbout using new descriptors for cheminformatics
About using new descriptors for cheminformatics
 
Feature Detection in Ajax-enabled Web Applications
Feature Detection in Ajax-enabled Web ApplicationsFeature Detection in Ajax-enabled Web Applications
Feature Detection in Ajax-enabled Web Applications
 
A Multidimensional Empirical Study on Refactoring Activity
A Multidimensional Empirical Study on Refactoring ActivityA Multidimensional Empirical Study on Refactoring Activity
A Multidimensional Empirical Study on Refactoring Activity
 
Simply shape
Simply shapeSimply shape
Simply shape
 

Similar to Gene selection via significant subset using silhouette index

Harvard_University_-_Linear_Al
Harvard_University_-_Linear_AlHarvard_University_-_Linear_Al
Harvard_University_-_Linear_Al
ramiljayureta
 
Harvard_University_-_Linear_Al
Harvard_University_-_Linear_AlHarvard_University_-_Linear_Al
Harvard_University_-_Linear_Al
Ramil Jay Ureta
 
TunUp final presentation
TunUp final presentationTunUp final presentation
TunUp final presentation
Gianmario Spacagna
 
11.performance evaluation of geometric active contour (gac) and enhanced geom...
11.performance evaluation of geometric active contour (gac) and enhanced geom...11.performance evaluation of geometric active contour (gac) and enhanced geom...
11.performance evaluation of geometric active contour (gac) and enhanced geom...
Alexander Decker
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 

Similar to Gene selection via significant subset using silhouette index (12)

Lecture 1
Lecture 1Lecture 1
Lecture 1
 
Principal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty DetectionPrincipal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty Detection
 
Harvard_University_-_Linear_Al
Harvard_University_-_Linear_AlHarvard_University_-_Linear_Al
Harvard_University_-_Linear_Al
 
Harvard_University_-_Linear_Al
Harvard_University_-_Linear_AlHarvard_University_-_Linear_Al
Harvard_University_-_Linear_Al
 
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkOBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
 
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkOBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
 
TunUp final presentation
TunUp final presentationTunUp final presentation
TunUp final presentation
 
11.performance evaluation of geometric active contour (gac) and enhanced geom...
11.performance evaluation of geometric active contour (gac) and enhanced geom...11.performance evaluation of geometric active contour (gac) and enhanced geom...
11.performance evaluation of geometric active contour (gac) and enhanced geom...
 
Performance evaluation of geometric active contour (gac) and enhanced geometr...
Performance evaluation of geometric active contour (gac) and enhanced geometr...Performance evaluation of geometric active contour (gac) and enhanced geometr...
Performance evaluation of geometric active contour (gac) and enhanced geometr...
 
Independent Component Analysis
Independent Component Analysis Independent Component Analysis
Independent Component Analysis
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Illustration Clamor Echelon Evaluation via Prime Piece Psychotherapy
Illustration Clamor Echelon Evaluation via Prime Piece PsychotherapyIllustration Clamor Echelon Evaluation via Prime Piece Psychotherapy
Illustration Clamor Echelon Evaluation via Prime Piece Psychotherapy
 

More from Asociación Argentina de Bioinformática y Biología Computacional

More from Asociación Argentina de Bioinformática y Biología Computacional (10)

Signals of Evolution: Conservation, Specificity Determining Positions and Coe...
Signals of Evolution: Conservation, Specificity Determining Positions and Coe...Signals of Evolution: Conservation, Specificity Determining Positions and Coe...
Signals of Evolution: Conservation, Specificity Determining Positions and Coe...
 
Predicting peptide/MHC interactions: Application to epitope identification an...
Predicting peptide/MHC interactions: Application to epitope identification an...Predicting peptide/MHC interactions: Application to epitope identification an...
Predicting peptide/MHC interactions: Application to epitope identification an...
 
Design of degenerated primers from bioinformatics online software for putativ...
Design of degenerated primers from bioinformatics online software for putativ...Design of degenerated primers from bioinformatics online software for putativ...
Design of degenerated primers from bioinformatics online software for putativ...
 
A structure-function analysis of s HSPs in plants
A structure-function analysis of s HSPs in plantsA structure-function analysis of s HSPs in plants
A structure-function analysis of s HSPs in plants
 
Modelado de la proteína p35 de toxoplasma gondii
Modelado de la proteína p35 de toxoplasma gondiiModelado de la proteína p35 de toxoplasma gondii
Modelado de la proteína p35 de toxoplasma gondii
 
Data balancing for phenotype classification based on SNPs
Data balancing for phenotype classification based on SNPsData balancing for phenotype classification based on SNPs
Data balancing for phenotype classification based on SNPs
 
Bolstered error estimation for discrete classifier applied to genomic signal ...
Bolstered error estimation for discrete classifier applied to genomic signal ...Bolstered error estimation for discrete classifier applied to genomic signal ...
Bolstered error estimation for discrete classifier applied to genomic signal ...
 
Biopython: Overview, State of the Art and Outlook
Biopython: Overview, State of the Art and OutlookBiopython: Overview, State of the Art and Outlook
Biopython: Overview, State of the Art and Outlook
 
¿Cuál es la estabilidad relevante de las proteínas?
¿Cuál es la estabilidad relevante de las proteínas?¿Cuál es la estabilidad relevante de las proteínas?
¿Cuál es la estabilidad relevante de las proteínas?
 
Biogeografía histórica y Análisis de Vicarianza: Una perspectiva computacional
Biogeografía histórica y Análisis de Vicarianza: Una perspectiva computacionalBiogeografía histórica y Análisis de Vicarianza: Una perspectiva computacional
Biogeografía histórica y Análisis de Vicarianza: Una perspectiva computacional
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Gene selection via significant subset using silhouette index

  • 1. Gene Selection via Significant Subset using Silhouette Index 1,2 1 1 1 1 Juan Ignacio Pastore , Guillermo Abras , Diego Sebastían Comas , Marcel Brun , Virginia Ballarin 1 Laboratorio de Procesos y Medición de Señales, Facultad de Ingeniería, UNMdP 2 Comisión Nacional de Investigaciones Científicas y Técnicas CONICET, mbrun@fi.mdp.edu.ar Introduction Gene selection is an important task in the area of bioinformatics, where significant genes are chosen using somecriterion of significance. In the case of classification, like disease vs. normal, tissue type, etc, the criterion used is the ability to provide good features for the classification task. In other cases it is interesting to select large groups of genes with similar behavior, regardless of the class. This task is usually carried on by clustering algorithm, where the whole family of genes, or a subset of them, is grouped into significant clusters. These techniques provide insight on possible co- regulation between genes, but usually provide large, maybe enormous sets, depending on the number of clusters required. In this work we present a new algorithm that provides sets of genes with very similar expression. This is possible by using the complete clustering tree provided by the hierarchicalclustering algorithm, and the Silhouette index for ranking of the subsets. Algorithm Microarray Data Hierarchical Silhouette Selected Clustering Index Sets Hierarchical Clustering Silhouette Index 1 K é1 ù Hierarchical clustering is an The Silhouette index measures not S= åê K k =1 ë nk å S (x )ú xÎCk û 1 1 1 agglomerative partitioning algorithm that only the compacness of the b (x ) - a (x ) 0.8 0.8 0.8 identifies compact subsets of the data, in clusters, but also the distance S (x ) = max é a (x ), b (x )ù ë û 0.6 0.6 0.6 between them. The higher the RA S2 RA S2 RA S2 a iterative proceeding. The result of the 0.4 0.4 0.4 index, the more compact and 1 algorithm is a dendrogram, a tree a (x ) = å d ( x, y ) nk - 1 yÎCk , y ¹ x 0.2 0.2 0.2 structure informing all the steps of the separated from each other are the 0 0 0 grouping process. cluster. é1 ù 0 0.5 1 0 0.5 1 0 0.5 1 b (x ) = min ê h =1,K, K ,h ¹ k n å d (x, y )ú RAS1 QI: 0.29 RAS1 QI: 0.43 RAS1 QI: 0.69 ë h yÎCh û Experimental Data Experiment : E-GEOD-15653 Submitter(s) : Patti Lab : Joslin Diabetes Center Mary Elizabeth Patti. (Generated description): Experiment with 18 hybridizations, using 18 samples of species [Homo sapiens], using 18 arrays of array design [Affymetrix GeneChip® Human Genome HG-U133A [HG- U133A]], producing 18 raw data files and 18 transformed and/or normalized data files. (Submitter's description 1): Hepatic lipid accumulation is an important complication of obesity linked to risk for type 2 diabetes. To identify novel transcriptional changes in human liver which could contribute to hepatic lipid accumulation and associated insulin resistance and type 2 diabetes (DM2), we evaluated gene expression and gene set enrichment in surgical liver biopsies from 13 obese (9 with DM2) and 5 control subjects, obtained in the fasting state at the time of elective abdominal surgery for obesity or cholecystectomy. RNA was isolated for cRNA preparation and hybridized to Affymetrix U133A microarrays. Experiment Overall Design: Human liver samples were obtained from 5 lean control subjects undergoing elective cholecystectomy and 13 obese subjects (with or without Type 2 diabetes) undergoing gastric bypass surgery. Subjects with diabetes were classified as either well-controlled or poorly-controlled. Experiments We choose compact and separated clusters of genes by computing the Silhouette Index of Compactness[1,2,3] on every possible subset of the N genes. This approach may take an impractical amount of time, since there are 2N such sets; therefore we propose a sub-optimal search, limiting the computation of the index on the sets provided by the Hierarchical Clustering algorithm, not only on the final stage, but on every intermediate step. If there are N genes, there will be N such groupings, the first one with N clusters (subsets), and the last one with only 1 large cluster, making a total of N(N+1)/2 candidate subsets. Because of the overlapping, there are only 2*N different subsets to be processed, and because of the way the clustering algorithm works, most of them will be compact. The Silhouette index will ensure to select groups that are also separated from the other ones. Software Implementation Microarray Data Program’s interface for gene selection Performance results Results Conclusion In the sample data used for testing purposes the top selected sets showed consistency and many of the The proposed tool may be a powerful tool for the biologists or computational genes of the groups were related by function. Below we can see one of the top sets, with a Silhouette index biology researchers interested on generating new hypothesis on co-expressed of 0.94,which consists of two probes for the same gene (GSTM1 ), and one probe for gene GSTM2, which are genes, which are not provided by more standard analysis tools. both members of the mu class of enzymes, which functions in the detoxification of electrophilic compounds. Biological Process Cellular Component References Probe Set Gene Title Gene Symbol Molecular Function Term Term Term ID 204550_x_at glutathione S- GSTM1 metabolic process glutathione transferase activity cytoplasm [1] Rousseeuw, Peter J., "Silhouettes: A graphical aid to the interpretation and validation of cluster transferase mu 1 transferase activity analysis", Journal of Computational and Applied Mathematics , 20 (1) , pp.53-65 , 1987. glutathione S- [2] Pearson John V. et.al., "Identification of the Genetic Basis for Complex Disorders by Use of Pooling- glutathione transferase activity Based Genomewide Single-NucleotideyPolimorphism Association Studies”, The American Journal of 204418_x_at transferase mu 2 GSTM2 metabolic process cytoplasm transferase activity (muscle) Human Genetics, 80, pp. 126-139. 2007. [3] Jianping Hua, David W. Craig, Marcel Brun, Jennifer Webster, Victoria Zismann, Waibhav Tembe, glutathione S- glutathione transferase activity 215333_x_at transferase mu 1 GSTM1 metabolic process transferase activity cytoplasm Keta Joshipura, Matthew J. Huentelman, Edward R. Dougherty, Dietrich A. Stephan: SNiPer-HD: improved genotype calling accuracy by an expectation-maximization algorithm for high-density SNP arrays. Bioinformatics 23(1): pp. 57-63 (2007).