SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
Data driven life sciences  
The Pyramids meet the Tower of Babel 
                           Rajarshi Guha 
            NIH Chemical Genomics Center 

               2010 ACS Na;onal Mee;ng, Boston, MA 
Characteris9cs 

•  Large sizes (but this is rela;ve) 
   –  Chemistry datasets are not really that big 
•  Mul;‐dimensional 
•  Mul;ple sources (and hence, types) 
•  Challenges 
   –  Handling and processing large datasets 
   –  Integra;ng mul;ple data types / sources 
   –  Get a coherent story out of it all 
How Useful is More Data? 

      •  Alterna;vely, can we stop doing science and 
         just do paMern recogni;on on increasingly 
         large datasets? 
      •  According to Chris Anderson, yes. 
                     There is now a better way. Petabytes allow us to say:
                     "Correlation is enough." We can stop looking for models. We
                     can analyze the data without hypotheses about what it might
                     show. We can throw the numbers into the biggest computing
                     clusters the world has ever seen and let statistical algorithms
                     find patterns where science cannot.



hMp://www.wired.com/science/discoveries/magazine/16‐07/pb_theory 
How Useful is More Data? 

•  The u;lity of more data is obvious in many 
   scenarios 
  –  Sta;s;cal models on 10 observa;ons is not a good 
     idea 
•  But can there be such a thing as too much 
   data? 
  –  Sta;s;cal models on 106 observa;ons may not be 
     a good idea 
Big Data for Some Problems 

       •  Halevy et al discuss the effec;veness of 
          extremely large datasets 
       •  Their applica;on focuses on machine 
          transla;on – see the Google n‐gram corpus 
       •  They suggest that such extremely large datasets 
          are useful because they effec;vely encompass 
          all n‐grams (phrases) commonly used 
       •  Domain is rela;vely constrained 

Halevy et al, IEEE Intelligent Systems, 2009, 24, 8‐12 
Google Scale in Chemistry? 

       •  What would be the equivalent of an n‐gram 
          corpus in chemistry? 
              –  Fragments 
              –  A more direct analogy can be made by using LINGO’s 
       •  It is possible to generate arbitrarily large (virtual) 
          compound and  fragment collec;ons 
       •  But would such a collec;on span all of 
          “commonly used” chemistry? 
              –  Depending on the ini;al compound set, yes 
              –  But we’re also interested in going beyond such a 
                 “commonly used” set 

Fink T, Reymond JL, J Chem Inf Model, 2007, 47, 342 
Fragment Diversity 

•  Consider a set of bioac;ves such as the LOPAC 
   collec;on, 1280 compounds 
•  Using exhaus;ve  
   fragmenta;on we get                        40


   2,460 unique fragments 

                           Percent of Total
                                              30


•  On the MLSMR  
   (~ 400K compounds),  
                                              20




   we get  164,583                            10




   fragments                                   0


                                                   0   1          2         3       4

                                                           log Fragment Frequency
Fragment Diversity 
       6                            All fragments             4
                                                                             Fragments occurring in  
                                                                             5 to 50 molecules 
       4
                                                              2


       2
PC 2




                                                              0




                                                       PC 2
       0


                                                              -2
       -2



                                                              -4
       -4



               -4   -2          0        2
                                                                   -4   -2        0      2       4
                         PC 1                                                    PC 1



            •  Distribu;on of MLSMR fragments in BCUT 
               space 
What Do We Do with Fragments? 

  •  Assuming we obtain fragments from a large 
     enough collec;on what do we do? 
         –  Learning from fragments – QSARs, genera;ve 
            models 
         –  Use fragments as  
            filters, alterna;ve  
            to clustering 
         –  Explore chemotypes 
            and ac;vity 


White, D and Wilson, RC, J Chem Inf Model, 2010, ASAP 
Scaffold Ac9vity Diagrams 

•  Network oriented view of fragment (scaffold) 
   collec;ons 
  –  Similar in idea to 
     Scaffold Hunter etc 
  –  Not purely hierarchical 
•  Color by arbitrary  
   proper;es 
•  Quickly assess u;lity 
   of a scaffold 
•  Try it online  
What Makes a Good Scaffold? 
•  What makes a good 
   scaffold? 
  –  Size, complexity, … 
  –  Do the members 
     represent an SAR or not? 
  –  Intui;on and experience 
     also play a role 
Scaffold QSAR 
                                                            Fit PLS or ridge 
                                                            regression model 




                                                  0
                                                                                                                  !


                                                                                                     !

                                                                                                     !!
                                                                                                          !




                                                  !2
                                                                              !                  !

                                                                          !
                                                                                        !




                                      Predicted
                                                                                    !




                                                  !4
                                                                        ! !       !!
                                                                                    !




Evaluate topological  
                                                                 !        !




and physicochemical  
                                                                !




                                                  !6
descriptors for the                                     !

                                                            !


R‐groups 


                                                  !8
                 Characterize the                      !8       !6        !4                !2                0
                                                                     Observed
                 SAR landscape 
Scaffold QSAR ‐ Drawbacks 

•  Many scaffolds have few (5 to 10) members 
•  Invariably, more features than observa;ons 
•  If the number of R‐groups is large, the feature 
   matrix can be very sparse 
  –  Less of a problem for combinatorial libraries 
•  A linear fit may not be the best approach to 
   correla;ng R‐groups to the ac;vi;es 
  –  Difficult to choose a model type a priori 
•  S;ll working on it … 
Fragments for Automa9on 
•  What is the mo;va;on for scaffold QSAR? 
•  Automate a high throughput screen 
•  Try and develop heuris;cs 
   to automa;cally push  
   chemotypes into secondary  
   screening 
Big Data and Chemistry  

       •  But in the end, the fundamental problem with 
          big data is the issue of domain applicability 
       •  Tradi;onal models are developed on small 
          datasets and perform well within the training 
          domain 
       •  But models trained on very large datasets will 
          not necessarily perform well, even though the 
          training domain is now much larger 


Helgee et al, J Chem Inf Model, 2010, 50, 677‐689 
Processing Large Datasets 

•  Most cheminforma;cs tasks are not 
   algorithmically parallel 
•  Rather, they are applied to large numbers of 
   inputs and hence embarrassingly parallel 
   –  Start up lots of jobs 
•  Hadoop is useful technology for those problems 
   that follow the map/reduce paradigm 
   –  Not aware of cheminforma;cs methods that work in 
      this manner 
   –  But can also be used like a job submission system 
Common HTS Analysis Tasks 
•  Analysis of Ac;vity 
  –    Concentra;on response across mul;ple phenotypes, mul;ple assays 
  –    Assay interference (differen;a;ng ac;vity from ar;facts) 
  –    Assay ontology (biological rela;onships, assay plaqorms) 
  –    Compound annota;ons, known ligand‐target network, prior art assessment 
  –    Profile data (PubChem, BindingDB, ChEMBL, PDSP, etc, physical proper;es) 


•  Iden;fica;on of Series and Singletons 
  –  Clustering of ac;ves, iden;fica;on of top scaffolds 
  –  Profiling of series across all assays 
  –  Series and singleton priori;za;on 

•  Compound Selec;on for Followup 
  –  Assessment of structure ac;vity rela;onships  
  –  Rapid iden;fica;on of key compounds to confirm, new compounds to test 
  –  Mining of commercially available chemical libraries 



How do we beMer automate such tasks? 
A Smorgasbord of Data 
Data Integra9on 

•  It’s nice to simplify data, but we can s;ll be faced 
   with a mul;tude of data types 
•  We want to explore these data in a linked fashion 
•  How we explore and what we explore is generally 
   influenced by the task at hand 
•  At one point, make inferences over all the data 
Data Integra9on 
User’s Network 
                           Content: 
                               ‐ Drugs 
                               ‐ Compounds 
                               ‐ Scaffolds 
                               ‐ Assays 
                               ‐ Genes 
                               ‐ Targets 
                               ‐ Pathways 
                               ‐ Diseases 
                               ‐ Clinical Trials 
                               ‐ Documents 


                           Links: 
Network of Public Data          ‐Manually curated 
                                ‐Derived from algorithms 
Record View of an Assay 
Access Disease Hierarchy & Network 
Ar9cles, Patents, Drug Labels, … 
Going Beyond Explora9on? 

       •  Simply being able to explore data in an 
          integrated manner is useful  as an idea 
          generator 
       •  Can we integrate heterogenous data types & 
          sources to get a systems level view? 
               –  Current research problem in genomics and 
                  systems biology 
               –  Some aMempts have been made to merge 
                  chemical data with other data types 

Young, D.W. et al, Nat. Chem. Biol., 2008, 4, 59‐68 
RNAi & Compound Screens 


                                                                    What targets mediate ac;vity of 
                                                                    siRNA  and compound 


                                                                    Pathway elucida;on, iden;fica;on 
•  Reuse pre‐exis;ng MLI data                                       of interac;ons 
•  Develop new annotated libraries 
         CAGCATGAGTACTACAGGCCA 
         TACGGGAACTACCATAATTTA 
                                                                    Target ID and valida;on 


                                                                    Link RNAi generated pathway 
                                                                    peturba;ons to small molecule 
                                                                    ac;vi;es. Could provide insight into 
                                                                    polypharmacology 



•  Run parallel RNAi screen 




                     Goal: Develop systems level view of small molecule acDvity 
Small Molecule HTS Summary 

         •  2,899 FDA‐approved                                                 !
                                                                                   Most Potent AcDves 
                                                                                        !
                                                                                             !    !                               Proscillaridin A 

            compounds screened 




                                                                   0
                                                                                    !
                                                                          !




                                                                   !20
                                                        Activity
         •  55 compounds retested ac;ve 
                                                                                                       !




                                                                   !40
                                                                                                                !

                                                                                                                         !
                                                                                                                             !        !
                                                                                                                                               !
                                                                                                                                                   !
                                                                                                                                                                     !




                                                                   !60
                                                                                                                                                            !



                                                                                   !9            !8                 !7                    !6                    !5




         •  Which components of the NF‐
                                                                                                  log Concentration (uM)

                                                                               !    !
                                                                                                                                               Trabec;din 




                                                                   0
                                                                          !             !
                                                                                             !




                                                                   !20
            κB pathway do they hit? 
                                                                                                  !




                                                        Activity
                                                                   !60
                                                                                                       !




                  –  17 molecules have target/




                                                                   !100
                                                                                                                !
                                                                                                                         !
                                                                                                                             !
                                                                                                                                      !        !   !        !        !



                                                                              !9            !8             !7                    !6                    !5



                     pathway informa;on in GeneGO 
                                                                                                  log Concentration (uM)
                                                                          !
                                                                                    !   !
                                                                                                                                                       Digoxin 




                                                                   0
                                                                                             !



                                                                               !




                  –  Literature searches list a few 
                                                                                                  !




                                                                   !20
                                                        Activity
                     more 

                                                                   !40
                                                                                                       !        !
                                                                                                                         !




                                                                                                                             !        !

                                                                                                                                                   !




                                                                   !60
                                                                                                                                               !                     !
                                                                                                                                                            !



                                                                              !9            !8             !7                    !6                    !5
                                                                                                  log Concentration (uM)




Miller, S.C. et al, Biochem. Pharmacol., 2010, ASAP 
RNAi HTS Summary 

•  Qiagen HDG library – 6886 genes, 4 siRNA’s 
   per gene 
•  A total of 567 genes were knocked 
   down by 1 or more siRNA’s 
  –  We consider >= 2 as a “reliable” hit 
  –  16 reliable hits 
  –  Added in 66 genes for  
     follow up via triage procedure 
RNAi & Small Molecule 

•  Based on reporter assays, the only conclusions 
   one can draw are the obvious ones 
•  Limited by 1‐D signal 
•  Going to high content gives us much richer 
   data, but more complexity 
  –  Shown to be useful for compounds 
  –  Much more difficult when the phenotypic 
     parameters come from different systems 
Summary 

•  Mul;ple data types are probably the most 
   challenging aspect of data driven discovery 
•  Size issues can be addressed with more 
   hardware or wai;ng (a bit) longer 
•  Integra;on issues require new approaches 
   both at the presenta;on & algorithmic levels 
Acknowledgements 

•    Ruili Huang 
•    Ajit Jadhav 
•    Trung Ngyuen 
•    Noel Southall 
Job Openings at NCGC/NCTT 

•  Sowware development (focusing on Tripod) 
     –  Java, Swing UI, algorithms 
•  Research Informa;cs Scien;st   
     –  Generalist, cheminforma;cs, comp chem, med 
        chem 
•    Collaborate with chemists, biologists 
•    Cuxng edge problems 
•    Lots of fresh data 
•    Fun! 

Weitere ähnliche Inhalte

Mehr von Rajarshi Guha

Pharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomePharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomeRajarshi Guha
 
Pharos: Putting targets in context
Pharos: Putting targets in contextPharos: Putting targets in context
Pharos: Putting targets in contextRajarshi Guha
 
Pharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomePharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomeRajarshi Guha
 
Pharos - Face of the KMC
Pharos - Face of the KMCPharos - Face of the KMC
Pharos - Face of the KMCRajarshi Guha
 
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & RRajarshi Guha
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical StructuresRajarshi Guha
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Rajarshi Guha
 
When the whole is better than the parts
When the whole is better than the partsWhen the whole is better than the parts
When the whole is better than the partsRajarshi Guha
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Rajarshi Guha
 
Pushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesPushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesRajarshi Guha
 
Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Rajarshi Guha
 
The BioAssay Research Database
The BioAssay Research DatabaseThe BioAssay Research Database
The BioAssay Research DatabaseRajarshi Guha
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsRajarshi Guha
 
Chemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleChemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleRajarshi Guha
 
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Rajarshi Guha
 
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in RRajarshi Guha
 
PMML for QSAR Model Exchange
PMML for QSAR Model Exchange PMML for QSAR Model Exchange
PMML for QSAR Model Exchange Rajarshi Guha
 
Small Molecules and siRNA: Methods to Explore Bioactivity Data
Small Molecules and siRNA: Methods to Explore Bioactivity DataSmall Molecules and siRNA: Methods to Explore Bioactivity Data
Small Molecules and siRNA: Methods to Explore Bioactivity DataRajarshi Guha
 
Predicting Activity Cliffs - Can Machine Learning Handle Special Cases?
Predicting Activity Cliffs - Can Machine Learning Handle Special Cases?Predicting Activity Cliffs - Can Machine Learning Handle Special Cases?
Predicting Activity Cliffs - Can Machine Learning Handle Special Cases?Rajarshi Guha
 
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}Rajarshi Guha
 

Mehr von Rajarshi Guha (20)

Pharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomePharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark Genome
 
Pharos: Putting targets in context
Pharos: Putting targets in contextPharos: Putting targets in context
Pharos: Putting targets in context
 
Pharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomePharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark Genome
 
Pharos - Face of the KMC
Pharos - Face of the KMCPharos - Face of the KMC
Pharos - Face of the KMC
 
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & R
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical Structures
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
 
When the whole is better than the parts
When the whole is better than the partsWhen the whole is better than the parts
When the whole is better than the parts
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
 
Pushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesPushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the Pipes
 
Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...
 
The BioAssay Research Database
The BioAssay Research DatabaseThe BioAssay Research Database
The BioAssay Research Database
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of Cheminformatics
 
Chemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleChemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & Reproducible
 
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?
 
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in R
 
PMML for QSAR Model Exchange
PMML for QSAR Model Exchange PMML for QSAR Model Exchange
PMML for QSAR Model Exchange
 
Small Molecules and siRNA: Methods to Explore Bioactivity Data
Small Molecules and siRNA: Methods to Explore Bioactivity DataSmall Molecules and siRNA: Methods to Explore Bioactivity Data
Small Molecules and siRNA: Methods to Explore Bioactivity Data
 
Predicting Activity Cliffs - Can Machine Learning Handle Special Cases?
Predicting Activity Cliffs - Can Machine Learning Handle Special Cases?Predicting Activity Cliffs - Can Machine Learning Handle Special Cases?
Predicting Activity Cliffs - Can Machine Learning Handle Special Cases?
 
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
 

Kürzlich hochgeladen

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 

Kürzlich hochgeladen (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Data driven life sciences   The Pyramids meet the Tower of Babel 

  • 1. Data driven life sciences   The Pyramids meet the Tower of Babel  Rajarshi Guha  NIH Chemical Genomics Center  2010 ACS Na;onal Mee;ng, Boston, MA 
  • 2. Characteris9cs  •  Large sizes (but this is rela;ve)  –  Chemistry datasets are not really that big  •  Mul;‐dimensional  •  Mul;ple sources (and hence, types)  •  Challenges  –  Handling and processing large datasets  –  Integra;ng mul;ple data types / sources  –  Get a coherent story out of it all 
  • 3. How Useful is More Data?  •  Alterna;vely, can we stop doing science and  just do paMern recogni;on on increasingly  large datasets?  •  According to Chris Anderson, yes.  There is now a better way. Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot. hMp://www.wired.com/science/discoveries/magazine/16‐07/pb_theory 
  • 4. How Useful is More Data?  •  The u;lity of more data is obvious in many  scenarios  –  Sta;s;cal models on 10 observa;ons is not a good  idea  •  But can there be such a thing as too much  data?  –  Sta;s;cal models on 106 observa;ons may not be  a good idea 
  • 5. Big Data for Some Problems  •  Halevy et al discuss the effec;veness of  extremely large datasets  •  Their applica;on focuses on machine  transla;on – see the Google n‐gram corpus  •  They suggest that such extremely large datasets  are useful because they effec;vely encompass  all n‐grams (phrases) commonly used  •  Domain is rela;vely constrained  Halevy et al, IEEE Intelligent Systems, 2009, 24, 8‐12 
  • 6. Google Scale in Chemistry?  •  What would be the equivalent of an n‐gram  corpus in chemistry?  –  Fragments  –  A more direct analogy can be made by using LINGO’s  •  It is possible to generate arbitrarily large (virtual)  compound and  fragment collec;ons  •  But would such a collec;on span all of  “commonly used” chemistry?  –  Depending on the ini;al compound set, yes  –  But we’re also interested in going beyond such a  “commonly used” set  Fink T, Reymond JL, J Chem Inf Model, 2007, 47, 342 
  • 7. Fragment Diversity  •  Consider a set of bioac;ves such as the LOPAC  collec;on, 1280 compounds  •  Using exhaus;ve   fragmenta;on we get   40 2,460 unique fragments  Percent of Total 30 •  On the MLSMR   (~ 400K compounds),   20 we get  164,583   10 fragments  0 0 1 2 3 4 log Fragment Frequency
  • 8. Fragment Diversity  6 All fragments  4 Fragments occurring in   5 to 50 molecules  4 2 2 PC 2 0 PC 2 0 -2 -2 -4 -4 -4 -2 0 2 -4 -2 0 2 4 PC 1 PC 1 •  Distribu;on of MLSMR fragments in BCUT  space 
  • 9. What Do We Do with Fragments?  •  Assuming we obtain fragments from a large  enough collec;on what do we do?  –  Learning from fragments – QSARs, genera;ve  models  –  Use fragments as   filters, alterna;ve   to clustering  –  Explore chemotypes  and ac;vity  White, D and Wilson, RC, J Chem Inf Model, 2010, ASAP 
  • 10. Scaffold Ac9vity Diagrams  •  Network oriented view of fragment (scaffold)  collec;ons  –  Similar in idea to  Scaffold Hunter etc  –  Not purely hierarchical  •  Color by arbitrary   proper;es  •  Quickly assess u;lity  of a scaffold  •  Try it online  
  • 11. What Makes a Good Scaffold?  •  What makes a good  scaffold?  –  Size, complexity, …  –  Do the members  represent an SAR or not?  –  Intui;on and experience  also play a role 
  • 12. Scaffold QSAR  Fit PLS or ridge  regression model  0 ! ! !! ! !2 ! ! ! ! Predicted ! !4 ! ! !! ! Evaluate topological   ! ! and physicochemical   ! !6 descriptors for the   ! ! R‐groups  !8 Characterize the   !8 !6 !4 !2 0 Observed SAR landscape 
  • 13. Scaffold QSAR ‐ Drawbacks  •  Many scaffolds have few (5 to 10) members  •  Invariably, more features than observa;ons  •  If the number of R‐groups is large, the feature  matrix can be very sparse  –  Less of a problem for combinatorial libraries  •  A linear fit may not be the best approach to  correla;ng R‐groups to the ac;vi;es  –  Difficult to choose a model type a priori  •  S;ll working on it … 
  • 14. Fragments for Automa9on  •  What is the mo;va;on for scaffold QSAR?  •  Automate a high throughput screen  •  Try and develop heuris;cs  to automa;cally push   chemotypes into secondary   screening 
  • 15. Big Data and Chemistry   •  But in the end, the fundamental problem with  big data is the issue of domain applicability  •  Tradi;onal models are developed on small  datasets and perform well within the training  domain  •  But models trained on very large datasets will  not necessarily perform well, even though the  training domain is now much larger  Helgee et al, J Chem Inf Model, 2010, 50, 677‐689 
  • 16. Processing Large Datasets  •  Most cheminforma;cs tasks are not  algorithmically parallel  •  Rather, they are applied to large numbers of  inputs and hence embarrassingly parallel  –  Start up lots of jobs  •  Hadoop is useful technology for those problems  that follow the map/reduce paradigm  –  Not aware of cheminforma;cs methods that work in  this manner  –  But can also be used like a job submission system 
  • 17. Common HTS Analysis Tasks  •  Analysis of Ac;vity  –  Concentra;on response across mul;ple phenotypes, mul;ple assays  –  Assay interference (differen;a;ng ac;vity from ar;facts)  –  Assay ontology (biological rela;onships, assay plaqorms)  –  Compound annota;ons, known ligand‐target network, prior art assessment  –  Profile data (PubChem, BindingDB, ChEMBL, PDSP, etc, physical proper;es)  •  Iden;fica;on of Series and Singletons  –  Clustering of ac;ves, iden;fica;on of top scaffolds  –  Profiling of series across all assays  –  Series and singleton priori;za;on  •  Compound Selec;on for Followup  –  Assessment of structure ac;vity rela;onships   –  Rapid iden;fica;on of key compounds to confirm, new compounds to test  –  Mining of commercially available chemical libraries  How do we beMer automate such tasks? 
  • 19. Data Integra9on  •  It’s nice to simplify data, but we can s;ll be faced  with a mul;tude of data types  •  We want to explore these data in a linked fashion  •  How we explore and what we explore is generally  influenced by the task at hand  •  At one point, make inferences over all the data 
  • 20. Data Integra9on  User’s Network  Content:  ‐ Drugs  ‐ Compounds  ‐ Scaffolds  ‐ Assays  ‐ Genes  ‐ Targets  ‐ Pathways  ‐ Diseases  ‐ Clinical Trials  ‐ Documents  Links:  Network of Public Data  ‐Manually curated  ‐Derived from algorithms 
  • 24. Going Beyond Explora9on?  •  Simply being able to explore data in an  integrated manner is useful  as an idea  generator  •  Can we integrate heterogenous data types &  sources to get a systems level view?  –  Current research problem in genomics and  systems biology  –  Some aMempts have been made to merge  chemical data with other data types  Young, D.W. et al, Nat. Chem. Biol., 2008, 4, 59‐68 
  • 25. RNAi & Compound Screens  What targets mediate ac;vity of  siRNA  and compound  Pathway elucida;on, iden;fica;on  •  Reuse pre‐exis;ng MLI data  of interac;ons  •  Develop new annotated libraries  CAGCATGAGTACTACAGGCCA  TACGGGAACTACCATAATTTA  Target ID and valida;on  Link RNAi generated pathway  peturba;ons to small molecule  ac;vi;es. Could provide insight into  polypharmacology  •  Run parallel RNAi screen  Goal: Develop systems level view of small molecule acDvity 
  • 26. Small Molecule HTS Summary  •  2,899 FDA‐approved  ! Most Potent AcDves  ! ! ! Proscillaridin A  compounds screened  0 ! ! !20 Activity •  55 compounds retested ac;ve  ! !40 ! ! ! ! ! ! ! !60 ! !9 !8 !7 !6 !5 •  Which components of the NF‐ log Concentration (uM) ! ! Trabec;din  0 ! ! ! !20 κB pathway do they hit?  ! Activity !60 ! –  17 molecules have target/ !100 ! ! ! ! ! ! ! ! !9 !8 !7 !6 !5 pathway informa;on in GeneGO  log Concentration (uM) ! ! ! Digoxin  0 ! ! –  Literature searches list a few  ! !20 Activity more  !40 ! ! ! ! ! ! !60 ! ! ! !9 !8 !7 !6 !5 log Concentration (uM) Miller, S.C. et al, Biochem. Pharmacol., 2010, ASAP 
  • 27. RNAi HTS Summary  •  Qiagen HDG library – 6886 genes, 4 siRNA’s  per gene  •  A total of 567 genes were knocked  down by 1 or more siRNA’s  –  We consider >= 2 as a “reliable” hit  –  16 reliable hits  –  Added in 66 genes for   follow up via triage procedure 
  • 28. RNAi & Small Molecule  •  Based on reporter assays, the only conclusions  one can draw are the obvious ones  •  Limited by 1‐D signal  •  Going to high content gives us much richer  data, but more complexity  –  Shown to be useful for compounds  –  Much more difficult when the phenotypic  parameters come from different systems 
  • 29. Summary  •  Mul;ple data types are probably the most  challenging aspect of data driven discovery  •  Size issues can be addressed with more  hardware or wai;ng (a bit) longer  •  Integra;on issues require new approaches  both at the presenta;on & algorithmic levels 
  • 30. Acknowledgements  •  Ruili Huang  •  Ajit Jadhav  •  Trung Ngyuen  •  Noel Southall 
  • 31. Job Openings at NCGC/NCTT  •  Sowware development (focusing on Tripod)  –  Java, Swing UI, algorithms  •  Research Informa;cs Scien;st    –  Generalist, cheminforma;cs, comp chem, med  chem  •  Collaborate with chemists, biologists  •  Cuxng edge problems  •  Lots of fresh data  •  Fun!