This document discusses challenges and opportunities in analyzing large and diverse datasets in life sciences. It notes that while life sciences datasets are large, they are still relatively small compared to other domains. Integrating multiple data types and sources from different studies presents challenges in obtaining a coherent understanding. Large datasets can be useful for statistical modeling and pattern recognition, but may not provide insights into underlying mechanisms. The document also discusses using fragment-based approaches and scaffold analysis to explore structure-activity relationships in large compound collections. Overall, the key point is that while large datasets enable new analyses, traditional hypothesis-driven science is still needed to understand biological systems.
2. Characteris9cs
• Large sizes (but this is rela;ve)
– Chemistry datasets are not really that big
• Mul;‐dimensional
• Mul;ple sources (and hence, types)
• Challenges
– Handling and processing large datasets
– Integra;ng mul;ple data types / sources
– Get a coherent story out of it all
3. How Useful is More Data?
• Alterna;vely, can we stop doing science and
just do paMern recogni;on on increasingly
large datasets?
• According to Chris Anderson, yes.
There is now a better way. Petabytes allow us to say:
"Correlation is enough." We can stop looking for models. We
can analyze the data without hypotheses about what it might
show. We can throw the numbers into the biggest computing
clusters the world has ever seen and let statistical algorithms
find patterns where science cannot.
hMp://www.wired.com/science/discoveries/magazine/16‐07/pb_theory
5. Big Data for Some Problems
• Halevy et al discuss the effec;veness of
extremely large datasets
• Their applica;on focuses on machine
transla;on – see the Google n‐gram corpus
• They suggest that such extremely large datasets
are useful because they effec;vely encompass
all n‐grams (phrases) commonly used
• Domain is rela;vely constrained
Halevy et al, IEEE Intelligent Systems, 2009, 24, 8‐12
6. Google Scale in Chemistry?
• What would be the equivalent of an n‐gram
corpus in chemistry?
– Fragments
– A more direct analogy can be made by using LINGO’s
• It is possible to generate arbitrarily large (virtual)
compound and fragment collec;ons
• But would such a collec;on span all of
“commonly used” chemistry?
– Depending on the ini;al compound set, yes
– But we’re also interested in going beyond such a
“commonly used” set
Fink T, Reymond JL, J Chem Inf Model, 2007, 47, 342
7. Fragment Diversity
• Consider a set of bioac;ves such as the LOPAC
collec;on, 1280 compounds
• Using exhaus;ve
fragmenta;on we get 40
2,460 unique fragments
Percent of Total
30
• On the MLSMR
(~ 400K compounds),
20
we get 164,583 10
fragments 0
0 1 2 3 4
log Fragment Frequency
8. Fragment Diversity
6 All fragments 4
Fragments occurring in
5 to 50 molecules
4
2
2
PC 2
0
PC 2
0
-2
-2
-4
-4
-4 -2 0 2
-4 -2 0 2 4
PC 1 PC 1
• Distribu;on of MLSMR fragments in BCUT
space
9. What Do We Do with Fragments?
• Assuming we obtain fragments from a large
enough collec;on what do we do?
– Learning from fragments – QSARs, genera;ve
models
– Use fragments as
filters, alterna;ve
to clustering
– Explore chemotypes
and ac;vity
White, D and Wilson, RC, J Chem Inf Model, 2010, ASAP
15. Big Data and Chemistry
• But in the end, the fundamental problem with
big data is the issue of domain applicability
• Tradi;onal models are developed on small
datasets and perform well within the training
domain
• But models trained on very large datasets will
not necessarily perform well, even though the
training domain is now much larger
Helgee et al, J Chem Inf Model, 2010, 50, 677‐689
16. Processing Large Datasets
• Most cheminforma;cs tasks are not
algorithmically parallel
• Rather, they are applied to large numbers of
inputs and hence embarrassingly parallel
– Start up lots of jobs
• Hadoop is useful technology for those problems
that follow the map/reduce paradigm
– Not aware of cheminforma;cs methods that work in
this manner
– But can also be used like a job submission system
17. Common HTS Analysis Tasks
• Analysis of Ac;vity
– Concentra;on response across mul;ple phenotypes, mul;ple assays
– Assay interference (differen;a;ng ac;vity from ar;facts)
– Assay ontology (biological rela;onships, assay plaqorms)
– Compound annota;ons, known ligand‐target network, prior art assessment
– Profile data (PubChem, BindingDB, ChEMBL, PDSP, etc, physical proper;es)
• Iden;fica;on of Series and Singletons
– Clustering of ac;ves, iden;fica;on of top scaffolds
– Profiling of series across all assays
– Series and singleton priori;za;on
• Compound Selec;on for Followup
– Assessment of structure ac;vity rela;onships
– Rapid iden;fica;on of key compounds to confirm, new compounds to test
– Mining of commercially available chemical libraries
How do we beMer automate such tasks?
19. Data Integra9on
• It’s nice to simplify data, but we can s;ll be faced
with a mul;tude of data types
• We want to explore these data in a linked fashion
• How we explore and what we explore is generally
influenced by the task at hand
• At one point, make inferences over all the data
20. Data Integra9on
User’s Network
Content:
‐ Drugs
‐ Compounds
‐ Scaffolds
‐ Assays
‐ Genes
‐ Targets
‐ Pathways
‐ Diseases
‐ Clinical Trials
‐ Documents
Links:
Network of Public Data ‐Manually curated
‐Derived from algorithms
24. Going Beyond Explora9on?
• Simply being able to explore data in an
integrated manner is useful as an idea
generator
• Can we integrate heterogenous data types &
sources to get a systems level view?
– Current research problem in genomics and
systems biology
– Some aMempts have been made to merge
chemical data with other data types
Young, D.W. et al, Nat. Chem. Biol., 2008, 4, 59‐68
25. RNAi & Compound Screens
What targets mediate ac;vity of
siRNA and compound
Pathway elucida;on, iden;fica;on
• Reuse pre‐exis;ng MLI data of interac;ons
• Develop new annotated libraries
CAGCATGAGTACTACAGGCCA
TACGGGAACTACCATAATTTA
Target ID and valida;on
Link RNAi generated pathway
peturba;ons to small molecule
ac;vi;es. Could provide insight into
polypharmacology
• Run parallel RNAi screen
Goal: Develop systems level view of small molecule acDvity
26. Small Molecule HTS Summary
• 2,899 FDA‐approved !
Most Potent AcDves
!
! ! Proscillaridin A
compounds screened
0
!
!
!20
Activity
• 55 compounds retested ac;ve
!
!40
!
!
! !
!
!
!
!60
!
!9 !8 !7 !6 !5
• Which components of the NF‐
log Concentration (uM)
! !
Trabec;din
0
! !
!
!20
κB pathway do they hit?
!
Activity
!60
!
– 17 molecules have target/
!100
!
!
!
! ! ! ! !
!9 !8 !7 !6 !5
pathway informa;on in GeneGO
log Concentration (uM)
!
! !
Digoxin
0
!
!
– Literature searches list a few
!
!20
Activity
more
!40
! !
!
! !
!
!60
! !
!
!9 !8 !7 !6 !5
log Concentration (uM)
Miller, S.C. et al, Biochem. Pharmacol., 2010, ASAP
27. RNAi HTS Summary
• Qiagen HDG library – 6886 genes, 4 siRNA’s
per gene
• A total of 567 genes were knocked
down by 1 or more siRNA’s
– We consider >= 2 as a “reliable” hit
– 16 reliable hits
– Added in 66 genes for
follow up via triage procedure
28. RNAi & Small Molecule
• Based on reporter assays, the only conclusions
one can draw are the obvious ones
• Limited by 1‐D signal
• Going to high content gives us much richer
data, but more complexity
– Shown to be useful for compounds
– Much more difficult when the phenotypic
parameters come from different systems
29. Summary
• Mul;ple data types are probably the most
challenging aspect of data driven discovery
• Size issues can be addressed with more
hardware or wai;ng (a bit) longer
• Integra;on issues require new approaches
both at the presenta;on & algorithmic levels