SMOTE and K-Fold Cross Validation-Presentation.pptx
Big Data in Pharma - Overview and Use Cases
1. Big Data Analyses in Pharma
An Overview
Josef Scheiber, PhD
Managing Director
July 2015
2. Geographie
Startup Center in Waldsassen
Main site
Data Analyses and Software
Development
Westpark Center
Garmischer Str. in Munich
Scientific ActivitiesSince Jan 1, 2015
Basel/Switzerland
Data Curation and customer-
related activities
Prag
150 km
München
200 km
Berlin
300 km
Frankfurt
250 km
3. BioVariance at a Glance –
Get most out of your complex data
Curate.Integrate
Analyze.Model
Visualize.Explore
DECIDE
7. What do we need out of Big Data?
1. What are the inhibitors of kinase X and the five most similar
kinases with IC50 < 1 μM and with MW < 500 from all internal and
external data sources?
2. What assay technologies have been used against my kinase?
Which cell lines?
3. What other proteins are in the same kinase branch as target X,
where there were validated chemical hits from external or
internal sources?
4. If I hit a particular kinase, what would the potential side-effect
profile look like? Which known inhibitor of this kinase has the
best safety profile and the fewest known IC50s?
5. Have I identified other compounds with a bioactivity profile
similar to compound X and with the same core substructure?
6. Can we create a phylochemical tree of kinases and for a new
kinase target place it into the tree on the basis of activity against a
reference panel of compounds?
7. Have I identified all kinases with an x-ray structure (in-house or
external) that are in pathway X?
Bridging Chemical and Biological Data: Implications for Pharmaceutical Drug Discovery
JL Jenkins, J Scheiber, D Mikhailov, A Bender, A Schuffenhauer, B Cornett, V Chan, J
Kondracki, B Rohde, JW Davies (2012) In: Computational Approaches in Cheminformatics and
Bioinformatics Edited by:A Bender, R Guha. 25-56 John Wiley & Sons, Inc.
ANSWERS
10. Descriptive:
What happened?
Diagnostic:
Why did it happen?
Predictive:
What will happen?
Prescriptive:
How can we make it
happen?
Better data for better analytics
Hindsight Insight Foresight
11. Need for interpretation
33,3
10
20
30
70
33,3 80
70
60
10
33,3
10 10 10
20
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Before molecular
biology
Molecular biology
golden age
Genomics age Deep sequencing
age
Very soon
Data Analysis Experiment Experimental Design
21. Velocity
• Mutations in tumor
• Resistance mechanisms in patients
• long term/short term AE
• compliance
• Nutrition and microbiome
• Data from wearables relevant for drugs
26. A simplified overview –
Molecules in Man
Adapted from Gohlke JM, Portier CJ.
Environ. Health Perspect. 115:1261-1263 (2007)
27. A question of complexity –They all
interact …
Biology
Chemistry
Physics
28. Dealing with a very complex environment –
i.e. many opportunities
DNA
RNA
Protein
Interactions
Clinical parameters
Treatment History
Tissue anatomy
Surgical History
Epigenetic Profiles from many
patients at different
timeponits
Target
Off-targets
Metabolites
Additional indications
Unspecific effects
Similar drugs
Adapted from: J. Scheiber; How can we enable drug discovery informatics for personalized healthcare?
Expert Opinion on Drug Discovery, 1-6; 2/2011
39. Data integration strategy
a) A central vocabulary/pointer server (information
stored are preferred names and synonyms plus
pointers to data servers, where to find what)
b) semantic integration layer with domain-specific
terminology and referential data
c) A database for each datatype collected, storing only
preferred names along with raw measurements
d) Clearly defined APIs for further integration with
public data sources and to enable large-scale
analyses
41. Answering workflow
Vocabulary
Vocabulary server acts as
translator, aggregator and
locator, i.e. knows where
the respective facts can be
found
Firmicutes produce alpha-Linolein and thereby cause gut irritation
species
metabolite
Further
Data of each type is
stored in a specific
database to
enhance
performance of
large-scale analyses
Expert tools talk to
data directly or via
webservices
API
API
API
API
Enduserinterfaceand
visualization
49. Not very successful
Alignment of the 3D
structures of mutant
number 52 (yellow) and
PDB 4EY7 AChE protein
(green). The only changed
residue is the Y150
(magenta) to H150 (red).
The white surface
represents the molecular
surface of donepezil.
50. Why is this a bad example?
AChE a key enzyme in human biology these are
the most highly conserved, even interspecies
Learning: Look at that stuff before investing
time
57. Top 200 drugs
- Cutoff is at 1500 tweets that a
few drugs easily surpass (although
it's mostly only pharmacies
advertizing)
- Others are not mentioned once
(probably a synonym issue as I
restricted to English as language). -
- top drugs are tweeted more
often, but e.g. Tarceva (in 2006) at
the very bottom also reaches the
top number of tweets (109 on list).
58. 089 – 189 6582 – 80
Garmischer Str. 4/V
80339 München
josef.scheiber@biovariance.com:
09632 – 9248 325
Konnersreuther Str. 6g
95652 Waldsassen
Questions?