SlideShare a Scribd company logo
1 of 20
Download to read offline
Perspective



    When Pharmaceutical Companies Publish Large Datasets: An Abundance Of Riches Or

                                         Fool’s Gold




                         Sean Ekins1, 2, 3, 4 and Antony J. Williams5



1
    Collaborations in Chemistry, 601 Runnymede Avenue, Jenkintown, PA 19046, U.S.A.
2
    Collaborative Drug Discovery, 1633 Bayshore Highway, Suite 342, Burlingame, CA

94403.
3
    Department of Pharmaceutical Sciences, University of Maryland, MD 21201, U.S.A.
4
    Department of Pharmacology, University of Medicine & Dentistry of New Jersey

(UMDNJ)-Robert Wood Johnson Medical School, 675 Hoes lane, Piscataway, NJ 08854.
5
    Royal Society of Chemistry, 904 Tamaras Circle, Wake Forest, NC-27587.




Running Head: Pharma Screening Dataset analysis




Corresponding Authors: Sean Ekins, Collaborations in Chemistry, 601 Runnymede

Ave, Jenkintown, PA 19046, Email ekinssean@yahoo.com, Tel 215-687-1320. Antony J.

Williams, Royal Society of Chemistry, 904 Tamaras Circle, Wake Forest, NC-27587.

antony.williams@chemspider.com
Abstract

With the recent announcement that GlaxoSmithKline have released a huge tranche of

whole cell malaria screening data to the public domain, accompanied by a corresponding

publication, this raises some issues for consideration before this exemplar instance

becomes a trend. We have examined the data from a high level, by studying the

molecular properties, and consider the various alerts presently in use by major pharma

companies. We acknowledge the potential value of such data but also raise the issue of

the actual value of such datasets released into the public domain. We also suggest

approaches that could enhance the value of such datasets to the community and

theoretically offer more immediate benefit to the search for leads for other neglected

diseases.
Introduction



We are currently witnessing significant shifts in the ways that pharmaceutical research

can be accelerated. These approaches include decentralizing research as well as

engagement with the external research communities through crowdsourcing. There is a

marked trend towards collaboration of all kinds [1-5]. In parallel there is a renewed

interest in neglected disease research (on malaria, tuberculosis (TB), kinetoplastids etc

[6]) due to the significant influence of the US National Institutes of Health (NIH),

foundations such as the Bill and Melinda Gates Foundation, The European Commission

and increasing investment from pharmaceutical companies and others [6,7]. In the past

drug companies generally only published small chunks of data in the form of molecular

structures and biological (pharmacology) data when it was convenient as needed to

influence investors, the public, FDA approval or after a compound or project was

terminated. With the combinatorial chemistry and high throughput screening that has

seen explosive growth over the past decade, each large pharmaceutical company has

created massive proprietary data bases and invested enormous resources in the purchase

or development of complex informatics platforms. These software systems probably

contain more data than can be realistically mined as they have pursued their quest for

blockbuster targeted therapies. The quality of the data and the applicability of the assays

can also add a lot of noise into the system. While it is not unusual for academics (and

occasionally pharmaceutical companies) to publish and collate relatively large datasets

(several hundred to thousands of compounds) primarily for quantitative structure activity

analysis (http://www.cheminformatics.org/datasets/index.shtml,
http://www.qsarworld.com/qsar-datasets.php) or compile datasets from NIH funded

screening programs (http://pdsp.med.unc.edu/indexR.html) [8-12], pharmaceutical

companies have been less willing to make larger screening datasets available to the

public and even with such huge efforts there has been little productivity in screening for

antibiotics [13]. This has now changed with the unprecedented release by

GlaxoSmithKline (GSK) of >13,500 in vitro screening hits against Malaria using

Plasmodium falciparum along with their associated cytotoxicity (in HepG2 cells) data

from an initial screen of over 2 million compounds ([14] which follows an earlier press

release from GSK http://www.nature.com/news/2010/100120/full/news.2010.20.html).



Hosted GSK malaria data

Three data bases initially all hosted the data. These include the European Bioinformatics

Institute-European Molecular Biology Laboratory (EBI-EMBL, ChEMBL

http://www.ebi.ac.uk/chembl/), PubChem (http://pubchem.ncbi.nlm.nih.gov/) and

Collaborative Drug Discovery (CDD, www.collaborativedrug.com). What happens now

the data are hosted and announced to the community is open to prediction; 1. the malaria

community may ignore it, which is highly unlikely or 2. the malaria community will be

excited by the availability of the data and will use the data in their research and,

potentially, may find a new drug. This is also perhaps statistically a low probability

which will realistically take many years unless we find ways to accelerate the process.

The question is therefore whether such data depositions are an abundance of riches or

fool’s gold? They may actually be neither.
This massive contribution of data to the community creates a precedent although

it is expected that other companies will follow suit

(http://www.nature.com/news/2010/100120/full/news.2010.20.html). It also raises many

other questions which we can only begin to pose responses for. Who should get to host

such public data? If the data are truly “open” then anyone can download the data and

reuse, repurpose and host the data. It will be interesting to see what “licensing” is applied

to the different types of data when it is exposed by various companies. The GSK malaria

screening hits data were added to the ChEMBL dataset and the chemical structures are

available for download. In the CDD data base, data associated with public datasets are

generally made available for similarity, substructure and Boolean searching

(http://www.collaborativedrug.com/register). The sdf file of structures and data is also

available for download as well. In the United States PubChem has established itself as the

de facto repository for screening data, so deposition here represents a drop in the ocean of

over 27 million unique molecule structures, although admittedly just a fraction are

associated with bioactivity data. The deposition and hosting of the data in the three

repositories (ChEMBL, PubChem and CDD) does leverage what each data base has to

offer in terms of integration to existing information on the compounds. We also believe

that deposition into other data bases with links out to others, including the original

hosting organizations, also offers great benefits. We believe that a great model for this

approach is also the ChemSpider data base from the Royal Society of Chemistry

(www.chemspider.com) which has already demonstrated the feasibility of such an

approach and now also hosts the GSK data. We suggest there should be connectivity

between all the data bases hosting this data and others released in the future.
Questions and opportunities upon opening up data

       We can predict that there may be increased competition to woo pharmaceutical

companies to exclusively deposit data in various data bases. There may however be more

utility if the various data bases collaborated to convince companies that releasing data

could be a powerful force for change, with each group contributing their technologies and

own expertise to the task. Instead of having standards for deposition a lá microarray

MIAME [15] etc. there are no such standards for chemical structures and biological data.

Who will ensure the quality of the data and act as a host for future annotation and

potential structure validation? It is unlikely to be the data bases clamoring to host the

data. Would any organization be interested in funding future data uploads and data

cleansing? Will companies only deposit compounds into the public domain that are of

less interest to them and have been demonstrated to be inactive against screens while

retaining drug-like hits and negative data which can be valuable for SAR creation? While

we applaud the pharmaceutical companies for donating data to the community we posit

the question as to whether these contributions may create more noise and less signal in

the public compound data bases. Will people really want to mine the data if it is

perceived as cast-off data, or just represents commercially available screening libraries

with little novel in the way of chemical entities? In many cases we judge that most

researchers will treat these data as of minor importance since in this case it is malaria-

related. However, this may be a mistake as GSK suggest that many of these compounds

are acting on malaria kinases or proteases that could be important for the treatment of

other diseases and which could lead to a drug for other blockbuster diseases. Will there

be any incentives to pursue such neglected disease data? For the time-being
pharmaceutical companies are investing more in such neglected diseases [6], although

much less than on other major diseases.

       This brings to mind the United States Environmental Agency’s (EPA) ToxCast

project [16] which created a data base, at major public expense, and then invited

academics to build computational models or mine the data. Ultimately what is derived

from the data will be dependent on the quality of the data populated into the data base.

The only incentive is if something of interest is found then there may be some funding to

investigate further. If pharmaceutical companies are to put data into the public domain

then it may be of value for these or other organizations to consider incentivizing

researchers to mine it, or offer challenges and awards.


GSK malaria screening data analysis

       Driven purely by curiosity regarding the nature of the malaria data recently

deposited into the public domain by GSK [14], we have undertaken a preliminary

evaluation using a simple descriptor analysis as was performed previously for some very

large tuberculosis datasets originally funded by the NIH [17]. In addition we have used

some readily available substructure alerts or “filters” to identify potentially reactive

molecules. Pharmaceutical companies use such computational filters to clean up

screening sets and remove undesirable molecules from vendor libraries [18]. Examples

include filters from GSK [19], and Abbott [20-22]. An academic group also developed an

extensive series of over 400 substructural features for the removal of pan assay

interference compounds from screening libraries [23], which have yet to be integrated

into a public resource. Our simplistic analysis compares the GSK dataset [14] to widely
available drug-like molecules from the MicroSource US drugs dataset

(http://www.msdiscovery.com/usdrugs.html).

       As we would expect, the percentage of GSK malaria screening hit molecules

failing the published “GSK filters” [19] is close to zero while the percentage failing the

Pfizer and Abbott filters [20] is considerably higher (~57-76%, respectively) as these

appear to be more conservative (Table 1). This could be interpreted as representing

different business rule decisions instituted according to their own criteria which we are

not in a position to critically judge. The percentage of failures for the set of US FDA

drugs is lower for both Pfizer and Abbott filters (Table 1), suggesting that these

compounds are by no means perfect but also perhaps setting a threshold. A recent

published study filtered a set of > 1000 marketed drugs and at least 26% failed filters for

molecular features undesirable for high throughput screening [24]. We have also recently

used the same rules to filter sets of compounds with activity against tuberculosis [11,12],

with 81-92% failing the Abbott filters [25] which may be related to mechanism of action.

In the GSK paper they suggest that they did not find any non-specific inhibitors of lactate

dehydrogenase, while cytotoxicity was seen in 1,982 compounds [14]. A detailed analysis

of our calculated molecular descriptors for the GSK malaria hits [14] shows that most are

normally distributed apart from the skewed Lipinski violations data and the bimodal

molecular weight. Table 2 shows the means and standard deviations for each descriptor.

Interestingly 3,269 (24.3%) of the compounds fail more than one of the Lipinski rules of

5 (MW ≤ 500, logP ≤ 5, HBD ≤ 5, HBA ≤ 10) [26] using the descriptors calculated in the

CDD data base. It should be noted that the performance of logP calculators differ

between various software vendors and this should be taken into account when
considering the calculated values. The GSK screening hits are generally large and very

hydrophobic as is also suggested in their publication [14], and although they suggested

this may be important to reach intracellular targets, there is no discussion of the

limitations of such compounds. One could imagine that these molecular properties may

also present significant solubility challenges [27]. These molecular properties can also be

compared with the published lead-like rules (MW < 350, LogP< 3, Affinity ~0.1uM)

[28,29], the natural product lead-like rules (MW < 460, Log P< 4.2, Log Sol -5, RBN ≤

10, Rings ≤ 4, HBD ≤5, HBA ≤ 9) [30] and the mean molecular properties for the

respiratory drugs which are either delivered by inhalation or intranasal routes (MW ~370,

PSA ~89.2, LogP ~1.7) [31]. In summary, the malaria screening hits in total [14] may not

be ‘lead-like’ and are closest to ‘natural product lead-like’. Although the malaria paper

[14] suggests that the compounds are “drug-like” the evidence for this is weak (and is a

statement open to wide interpretation) in the absence of comparison with drugs or even in

vivo data for any of their hits. These antimalarial hits as a group are also vastly different

to the mean molecular properties of compounds that have shown activity against TB,

which are generally of lower molecular weight, less hydrophobic and with lower pKa and

fewer RBN [17]. The GSK compounds may be used with computational models to find

active compounds that inhibit this disease in vitro [17]. We have taken Bayesian models

generated from previously published very large screening datasets for TB, and used them

to filter the GSK malaria hits. The two separate models retrieved 4,841 and 6,030

compounds for the single point and dose response models, respectively (Supplemental

data, also available at www.collaborativedrug.com) that would be predicted as active in

the Mtb phenotypic screen from which the data was derived from [11,12]. This may
provide a starting point for cherry picking compounds for follow up in vitro. It remains to

be seen whether GSK will publish Mtb screening data for the same compounds but we

would encourage this.



Awareness of breaking the rules

       Some companies try to avoid compounds that have reactive groups and fail

related alerts or simply fail ‘the Rule of Five’, as a starting point before screening,

although many chemists believe even this is too stringent and there are several successful

drugs and development candidates that fall outside these requirements and ‘break the

rules’. Depending on the stringency of computational filtering this would suggest at a

minimum 24 – 76% of the GSK malaria screening hits would be filtered out or flagged as

problematic at the very least using these simple methods. Structural alerts can be difficult

to assess because the presence of a potentially problematic functionality must be assessed

in the context of the molecule under consideration e.g. is it required for the activity

pharmacophore. Our results would simply indicate that the user should be aware of the

properties and features in the compounds in this dataset [14] before embarking upon lead

optimization. It remains to be seen if datasets deposited by other research groups will also

follow this trend and is something we are currently assessing. Certainly, such

computational approaches could be readily used without to much effort to assess large

datasets deposited in the future.



Concluding thoughts
While attitudes on the quality of a compound vary between medicinal chemists, as

a community we need to be vigilant of the data in any public or commercial data base and

this is ultimately the responsibility of the user. At one level the structure fidelity should

be the responsibility of the depositor, but as this study suggests some readily available

tools can be used to evaluate such datasets for chemical properties and can point to issues

that need to be considered before a great deal of time is spent further mining the data.

However, for many who may be unaware of such technologies or their limitations how do

they start processing such freely available data? There may be needles in this set of

malaria actives and they may require computational approaches to pull them out of the

haystack, but at the same time they need to be mindful of the compound quality at the

start to avoid blind alleys due to aggregation [32], false positive issues [33-39] or artifacts

[40]. Whoever funds such high throughput screening whether industry, government, not

for profit or academia, needs to seriously consider which compounds are screened (and

use appropriate filtering criteria before hand) in vitro and then ultimately where and how

the data is deposited. We sincerely hope that this provides a starting point for discussion

of the many important issues raised above as pharmaceutical companies start depositing

their datasets (whether compound screening, solubility, metabolism, toxicity etc) in data

bases funded by non-profits or private concerns. It may then have accomplished

something more with considerable positive implications, than was originally envisaged

by the company. We have waited a long time for this type of landmark data contribution

to the community by GSK and would like those pharmaceutical companies that follow

this exemplary lead to ensure they deliver even higher value and actionable data to the

scientific community.
Acknowledgements

The authors kindly thank Dr. Jeremy Yang and colleagues (University of New Mexico)

for providing access to the Smartsfilter web application and our colleagues for inspiration

and support.



Conflicts of Interest

SE consults for Collaborative Drug Discovery, Inc on a Bill and Melinda Gates

Foundation Grant#49852 “Collaborative drug discovery for TB through a novel data base

of SAR data optimized to promote data archiving and sharing”. He is also on the advisory

board for ChemSpider. AJW is employed by the Royal Society of Chemistry which owns

ChemSpider and associated technologies.



References

1      Bingham, A. and Ekins, S. (2009) Competitive Collaboration in the

       Pharmaceutical and Biotechnology Industry. Drug Disc Today 14, 1079-1081


2      Hunter, A.J. (2008) The Innovative Medicines Initiative: a pre-competitive

       initiative to enhance the biomedical science base of Europe to expedite the

       development of new medicines for patients. Drug Discov Today 13 (9-10), 371-

       373


3      Barnes, M.R. et al. (2009) Lowering industry firewalls: pre-competitive

       informatics initiatives in drug discovery. Nat Rev Drug Discov 8 (9), 701-708
4    Bailey, D.S. and Zanders, E.D. (2008) Drug discovery in the era of Facebook--

     new tools for scientific networking. Drug Discov Today 13 (19-20), 863-868


5    Ekins, S. and Williams, A.J. (2010) Reaching out to collaborators: crowdsourcing

     for pharmaceutical research. Pharm Res 27 (3), 393-395


6    Moran, M. et al. (2009) Neglected disease research and development: how much

     are we really spending? PLoS Med 6 (2), e30


7    Morel, C.M. et al. (2005) Health innovation networks to help developing

     countries address neglected diseases. Science 309 (5733), 401-404


8    Keiser, M.J. et al. (2009) Predicting new molecular targets for known drugs.

     Nature 462 (7270), 175-181


9    O'Connor, K.A. and Roth, B.L. (2005) Finding new tricks for old drugs: an

     efficient route for public-sector drug discovery. Nat Rev Drug Discov 4 (12),

     1005-1014


10   Roth, B.L. et al. (2004) Screening the receptorome to discover the molecular

     targets for plant-derived psychoactive compounds: a novel approach for CNS

     drug discovery. Pharmacol Ther 102 (2), 99-110


11   Maddry, J.A. et al. (2009) Antituberculosis activity of the molecular libraries

     screening center network library. Tuberculosis (Edinb) 89, 354-363


12   Ananthan, S. et al. (2009) High-throughput screening for inhibitors of

     Mycobacterium tuberculosis H37Rv. Tuberculosis (Edinb) 89, 334-353
13   Payne, D.A. et al. (2007) Drugs for bad bugs: confronting the challenges of

     antibacterial discovery. Nat Rev Drug Disc 6, 29-40


14   Gamo, F.-J. et al. (2010) Thousands of chemical starting points for antimalarial

     lead identification. Nature 465, 305-310


15   Brazma, A. et al. (2001) Minimum information about a microarray experiment

     (MIAME)-toward standards for microarray data. Nat Genet 29 (4), 365-371


16   Dix, D.J. et al. (2007) The ToxCast program for prioritizing toxicity testing of

     environmental chemicals. Toxicol Sci 95 (1), 5-12


17   Ekins, S. et al. (2010) A Collaborative Database And Computational Models For

     Tuberculosis Drug Discovery. Mol BioSystems 6, 840-851


18   Williams, A.J. et al. (2009) Free Online Resources Enabling Crowdsourced Drug

     Discovery. Drug Discovery World Winter


19   Hann, M. et al. (1999) Strategic pooling of compounds for high-throughput

     screening. J Chem Inf Comput Sci 39 (5), 897-902


20   Huth, J.R. et al. (2005) ALARM NMR: a rapid and robust experimental method

     to detect reactive false positives in biochemical screens. J Am Chem Soc 127 (1),

     217-224


21   Huth, J.R. et al. (2007) Toxicological evaluation of thiol-reactive compounds

     identified using a la assay to detect reactive molecules by nuclear magnetic

     resonance. Chem Res Toxicol 20 (12), 1752-1759
22   Metz, J.T. et al. (2007) Enhancement of chemical rules for predicting compound

     reactivity towards protein thiol groups. J Comput Aided Mol Des 21 (1-3), 139-

     144


23   Baell, J.B. and Holloway, G.A. (2010) New Substructure Filters for Removal of

     Pan Assay Interference Compounds (PAINS) from Screening Libraries and for

     Their Exclusion in Bioassays. J Med Chem 53, 2719-2740


24   Axerio-Cilies, P. et al. (2009) Investigation of the incidence of "undesirable"

     molecular moieties for high-throughput screening compound libraries in marketed

     drug compounds. Eur J Med Chem 44 (3), 1128-1134


25   Ekins, S. et al. (2010) Analysis and hit filtering of a very large library of

     compounds screened against Mycobacterium tuberculosis Submitted


26   Lipinski, C.A. et al. (1997) Experimental and computational approaches to

     estimate solubility and permeability in drug discovery and development settings.

     Adv Drug Del Rev 23, 3-25


27   Lipinski, C.A. (2000) Drug-like properties and the causes of poor solubility and

     poor permeability. J Pharm Toxicol Methods 44, 235-249


28   Oprea, T.I. (2002) Current trends in lead discovery: are we looking for the

     appropriate properties? J Comput Aided Mol Des 16, 325-334


29   Oprea, T.I. et al. (2001) Is there a difference between leads and drugs? A

     historical perspective. J Chem Inf Comput Sci 41, 1308-1315
30   Rosen, J. et al. (2009) Novel Chemical Space Exploration via Natural Products. J

     Med Chem 52, 1953-1962


31   Ritchie, T.J. et al. (2009) Analysis of the Calculated Physicochemical Properties

     of Respiratory Drugs: Can We Design for Inhaled Drugs Yet? J Chem Inf Model

     49, 1025-1032


32   Seidler, J. et al. (2003) Identification and prediction of promiscuous aggregating

     inhibitors among known drugs. J Med Chem 46, 4477-4486


33   Rishton, G.M. (2008) Molecular diversity in the context of leadlikeness:

     compound properties that enable effective biochemical screening. Curr Opin

     Chem Biol 12 (3), 340-351


34   Rishton, G.M. (2005) Failure and success in modern drug discovery: guiding

     principles in the establishment of high probability of success drug discovery

     organizations. Med Chem 1 (5), 519-527


35   Oprea, T.I. et al. (2009) A crowdsourcing evaluation of the NIH chemical probes.

     Nat Chem Biol 5 (7), 441-447


36   Coan, K.E. and Shoichet, B.K. (2008) Stoichiometry and physical chemistry of

     promiscuous aggregate-based inhibitors. J Am Chem Soc 130 (29), 9606-9612


37   Feng, B.Y. et al. (2007) A high-throughput screen for aggregation-based

     inhibition in a large compound library. J Med Chem 50 (10), 2385-2390
38   Jadhav, A. et al. Quantitative analyses of aggregation, autofluorescence, and

     reactivity artifacts in a screen for inhibitors of a thiol protease. J Med Chem 53

     (1), 37-51


39   Doak, A.K. et al. Colloid Formation by Drugs in Simulated Intestinal Fluid. J

     Med Chem


40   Schmidt, C. (2010) GSK/Sirtris compounds dogged by assay artifacts. Nat

     Biotechnol 28 (3), 185-186
Table 1. Summary of SMARTS filter failures for various datasets. The Abbott ALARM

[20], Glaxo [19] and Blake SMARTS filter calculation were performed through the

Smartsfilter web application, Division of Biocomputing, Dept. of Biochem & Mol

Biology, University of New Mexico, Albuquerque, NM,

(http://pangolin.health.unm.edu/tomcat/biocomp/smartsfilter). The GSK malaria

screening data (N = 13,471) was obtained [14] from the CDD data base. We also used

the MicroSource US Drugs dataset (N = 1039) as a reference set of “drug-like”

molecules. Large datasets > 1000 molecules were fragmented into smaller SDF files

before running through this website.


Dataset (N)      Number         Number failing Number that

                 failing the    Pfizer LINT       failed Glaxo

                 Abbott         filters * (%)     filters [19]

                 ALARM                            (%)

                 filters [20]

                 (%)

GSK Malaria      10124 (75.8)     7683 (57.5)       129 (0.01)

hits (paper in

press).

(13,355)

Microsource       688 (66.1)       516 (49.6)       143 (13.7)

US FDA

drugs (1041)
*Originally provided as a Sybyl script to Tripos by Dr. James Blake (Array Biopharma)

while at Pfizer.
Table 2. Mean (SD) of molecular descriptors from the CDD data base for the GSK dataset. MW = molecular weight, HBD =

hydrogen bond donor, HBA = Hydrogen bond acceptor, Lipinski = rule of 5 score, PSA = polar surface area, RBN = rotatable bond

number. Molecular properties were calculated using the Marvin plug-in (ChemAxon, Budapest, Hungary) within the CDD data base.




                                                                  Lipinski

                                                                  rule of 5

  Dataset         MW            logP       HBD         HBA         alerts      pKa        PSA        RBN

GSK data         478.16         4.53        1.83       5.60         0.82       6.67      76.85        7.17

(N = 13,471)    (114.34)       (1.58)      (1.04)     (1.99)       (0.83)     (3.72)     (30.05)     (3.37)

More Related Content

What's hot

Slides for rare disorders meeting
Slides for rare disorders meetingSlides for rare disorders meeting
Slides for rare disorders meetingSean Ekins
 
Data Mining and Big Data Analytics in Pharma
Data Mining and Big Data Analytics in Pharma Data Mining and Big Data Analytics in Pharma
Data Mining and Big Data Analytics in Pharma Ankur Khanna
 
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge DiscoveryMichel Dumontier
 
Mashing Up Drug Discovery
Mashing Up Drug DiscoveryMashing Up Drug Discovery
Mashing Up Drug DiscoverySciBite Limited
 
Pharma data analytics
Pharma data analyticsPharma data analytics
Pharma data analyticsAxon Lawyers
 
Breaking Down Information Silos
Breaking Down Information SilosBreaking Down Information Silos
Breaking Down Information SilosChris Waller
 
Big Data in Genomics: Opportunities and Challenges
Big Data in Genomics: Opportunities and ChallengesBig Data in Genomics: Opportunities and Challenges
Big Data in Genomics: Opportunities and ChallengesMatthieu Schapranow
 
Current advances to bridge the usability-expressivity gap in biomedical seman...
Current advances to bridge the usability-expressivity gap in biomedical seman...Current advances to bridge the usability-expressivity gap in biomedical seman...
Current advances to bridge the usability-expressivity gap in biomedical seman...Maulik Kamdar
 
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...Michel Dumontier
 
Big Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use CasesBig Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use CasesJosef Scheiber
 
Building a Culture of Model-driven Drug Discovery at Merck
Building a Culture of Model-driven Drug Discovery at MerckBuilding a Culture of Model-driven Drug Discovery at Merck
Building a Culture of Model-driven Drug Discovery at MerckChris Waller
 
A Federated In-Memory Database System for Life Sciences
A Federated In-Memory Database System for Life SciencesA Federated In-Memory Database System for Life Sciences
A Federated In-Memory Database System for Life SciencesMatthieu Schapranow
 
Desperately seeking DARCP
Desperately seeking DARCPDesperately seeking DARCP
Desperately seeking DARCPChris Southan
 
2009 09 Lod London
2009 09 Lod London2009 09 Lod London
2009 09 Lod LondonJun Zhao
 
Big Data and its Impact on Industry (Example of the Pharmaceutical Industry)
Big Data and its Impact on Industry (Example of the Pharmaceutical Industry)Big Data and its Impact on Industry (Example of the Pharmaceutical Industry)
Big Data and its Impact on Industry (Example of the Pharmaceutical Industry)Hellmuth Broda
 
2011-10-11 Open PHACTS at BioIT World Europe
2011-10-11 Open PHACTS at BioIT World Europe2011-10-11 Open PHACTS at BioIT World Europe
2011-10-11 Open PHACTS at BioIT World Europeopen_phacts
 
Web-scale Discovery Tools and the Backgrounding of Government Information
Web-scale Discovery Tools and the Backgrounding of Government InformationWeb-scale Discovery Tools and the Backgrounding of Government Information
Web-scale Discovery Tools and the Backgrounding of Government InformationChristopher Brown
 

What's hot (20)

Slides for rare disorders meeting
Slides for rare disorders meetingSlides for rare disorders meeting
Slides for rare disorders meeting
 
Data Mining and Big Data Analytics in Pharma
Data Mining and Big Data Analytics in Pharma Data Mining and Big Data Analytics in Pharma
Data Mining and Big Data Analytics in Pharma
 
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
 
Mashing Up Drug Discovery
Mashing Up Drug DiscoveryMashing Up Drug Discovery
Mashing Up Drug Discovery
 
Pharma data analytics
Pharma data analyticsPharma data analytics
Pharma data analytics
 
Breaking Down Information Silos
Breaking Down Information SilosBreaking Down Information Silos
Breaking Down Information Silos
 
Online Resources to Support Open Drug Discovery Systems
Online Resources to Support Open Drug Discovery SystemsOnline Resources to Support Open Drug Discovery Systems
Online Resources to Support Open Drug Discovery Systems
 
Big Data in Genomics: Opportunities and Challenges
Big Data in Genomics: Opportunities and ChallengesBig Data in Genomics: Opportunities and Challenges
Big Data in Genomics: Opportunities and Challenges
 
Current advances to bridge the usability-expressivity gap in biomedical seman...
Current advances to bridge the usability-expressivity gap in biomedical seman...Current advances to bridge the usability-expressivity gap in biomedical seman...
Current advances to bridge the usability-expressivity gap in biomedical seman...
 
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
 
Big Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use CasesBig Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use Cases
 
Open Drug Discovery Teams: A Chemistry Mobile App for Collaboration
Open Drug Discovery Teams: A Chemistry Mobile App for Collaboration Open Drug Discovery Teams: A Chemistry Mobile App for Collaboration
Open Drug Discovery Teams: A Chemistry Mobile App for Collaboration
 
Building a Culture of Model-driven Drug Discovery at Merck
Building a Culture of Model-driven Drug Discovery at MerckBuilding a Culture of Model-driven Drug Discovery at Merck
Building a Culture of Model-driven Drug Discovery at Merck
 
A Federated In-Memory Database System for Life Sciences
A Federated In-Memory Database System for Life SciencesA Federated In-Memory Database System for Life Sciences
A Federated In-Memory Database System for Life Sciences
 
Desperately seeking DARCP
Desperately seeking DARCPDesperately seeking DARCP
Desperately seeking DARCP
 
2009 09 Lod London
2009 09 Lod London2009 09 Lod London
2009 09 Lod London
 
Scibite - We Do.
Scibite - We Do.Scibite - We Do.
Scibite - We Do.
 
Big Data and its Impact on Industry (Example of the Pharmaceutical Industry)
Big Data and its Impact on Industry (Example of the Pharmaceutical Industry)Big Data and its Impact on Industry (Example of the Pharmaceutical Industry)
Big Data and its Impact on Industry (Example of the Pharmaceutical Industry)
 
2011-10-11 Open PHACTS at BioIT World Europe
2011-10-11 Open PHACTS at BioIT World Europe2011-10-11 Open PHACTS at BioIT World Europe
2011-10-11 Open PHACTS at BioIT World Europe
 
Web-scale Discovery Tools and the Backgrounding of Government Information
Web-scale Discovery Tools and the Backgrounding of Government InformationWeb-scale Discovery Tools and the Backgrounding of Government Information
Web-scale Discovery Tools and the Backgrounding of Government Information
 

Viewers also liked

Viewers also liked (19)

Web Crawling Chemistry
Web Crawling ChemistryWeb Crawling Chemistry
Web Crawling Chemistry
 
Current opinions in drug discovery public compound databases
Current opinions in drug discovery public compound databasesCurrent opinions in drug discovery public compound databases
Current opinions in drug discovery public compound databases
 
Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry
Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry
Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry
 
Building A Community Resource For The Life Sciences
Building A Community Resource For The Life SciencesBuilding A Community Resource For The Life Sciences
Building A Community Resource For The Life Sciences
 
ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Comm...
ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Comm...ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Comm...
ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Comm...
 
A Presentation at Nature Publishing Group Crowdsourcing, Collaborations and T...
A Presentation at Nature Publishing Group Crowdsourcing, Collaborations and T...A Presentation at Nature Publishing Group Crowdsourcing, Collaborations and T...
A Presentation at Nature Publishing Group Crowdsourcing, Collaborations and T...
 
Connecting Chemistry Across the Internet Using ChemSpider
Connecting Chemistry Across the Internet Using ChemSpiderConnecting Chemistry Across the Internet Using ChemSpider
Connecting Chemistry Across the Internet Using ChemSpider
 
Practical semantics in the pharmaceutical industry - the Open PHACTS project
Practical semantics in the pharmaceutical industry - the Open PHACTS projectPractical semantics in the pharmaceutical industry - the Open PHACTS project
Practical semantics in the pharmaceutical industry - the Open PHACTS project
 
The Mobile Chemistry Classroom Cooperative
The Mobile Chemistry Classroom Cooperative The Mobile Chemistry Classroom Cooperative
The Mobile Chemistry Classroom Cooperative
 
ChemSpider and Traveling the Internet via Chemical Structures Cheminformatics...
ChemSpider and Traveling the Internet via Chemical Structures Cheminformatics...ChemSpider and Traveling the Internet via Chemical Structures Cheminformatics...
ChemSpider and Traveling the Internet via Chemical Structures Cheminformatics...
 
Engaging participation from the chemistry community
Engaging participation from the chemistry communityEngaging participation from the chemistry community
Engaging participation from the chemistry community
 
Leading scientists towards openness
Leading scientists towards opennessLeading scientists towards openness
Leading scientists towards openness
 
Chemistry apps in the wild and at the royal society of chemistry
Chemistry apps in the wild and at the royal society of chemistryChemistry apps in the wild and at the royal society of chemistry
Chemistry apps in the wild and at the royal society of chemistry
 
RSC ChemSpider for students
RSC ChemSpider for studentsRSC ChemSpider for students
RSC ChemSpider for students
 
RSC ChemSpider is the online chemistry database where community contributions...
RSC ChemSpider is the online chemistry database where community contributions...RSC ChemSpider is the online chemistry database where community contributions...
RSC ChemSpider is the online chemistry database where community contributions...
 
Chemical Database Projects Delivered by RSC eScience
Chemical Database Projects Delivered by RSC eScienceChemical Database Projects Delivered by RSC eScience
Chemical Database Projects Delivered by RSC eScience
 
ChemValidator – an online service for validating and standardizing chemical s...
ChemValidator – an online service for validating and standardizing chemical s...ChemValidator – an online service for validating and standardizing chemical s...
ChemValidator – an online service for validating and standardizing chemical s...
 
Feeding and consuming data to support open notebook science via the chem spid...
Feeding and consuming data to support open notebook science via the chem spid...Feeding and consuming data to support open notebook science via the chem spid...
Feeding and consuming data to support open notebook science via the chem spid...
 
Challenging cajoling and rewarding the community for their contributions to o...
Challenging cajoling and rewarding the community for their contributions to o...Challenging cajoling and rewarding the community for their contributions to o...
Challenging cajoling and rewarding the community for their contributions to o...
 

Similar to When pharmaceutical companies publish large datasets an abundance of riches or fools gold

Slides for burroughs wellcome foundation ajw100611 sefinal
Slides for burroughs wellcome foundation ajw100611 sefinalSlides for burroughs wellcome foundation ajw100611 sefinal
Slides for burroughs wellcome foundation ajw100611 sefinalSean Ekins
 
Latest Project to Cure PKD
Latest Project to Cure PKDLatest Project to Cure PKD
Latest Project to Cure PKDSean Flaherty
 
Nature Drug Discovery Dec 2013
Nature Drug Discovery Dec 2013Nature Drug Discovery Dec 2013
Nature Drug Discovery Dec 2013charles10000
 
Nature Drug Discovery Dec 2013
Nature Drug Discovery Dec 2013Nature Drug Discovery Dec 2013
Nature Drug Discovery Dec 2013charles10000
 
Big Data, AI, and Pharma
Big Data, AI, and PharmaBig Data, AI, and Pharma
Big Data, AI, and PharmaAmit Sheth
 
Rock Report: Big Data by @Rock_Health
Rock Report: Big Data by @Rock_HealthRock Report: Big Data by @Rock_Health
Rock Report: Big Data by @Rock_HealthRock Health
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
 
Benefits of Big Data in Health Care A Revolution
Benefits of Big Data in Health Care A RevolutionBenefits of Big Data in Health Care A Revolution
Benefits of Big Data in Health Care A Revolutionijtsrd
 
Drug Discovery Involving Artificial Intelligence and Big Data Modeling: Review
Drug Discovery Involving Artificial Intelligence and Big Data Modeling: ReviewDrug Discovery Involving Artificial Intelligence and Big Data Modeling: Review
Drug Discovery Involving Artificial Intelligence and Big Data Modeling: ReviewBRNSSPublicationHubI
 
Indications discovery and drug repurposing
Indications discovery and drug repurposingIndications discovery and drug repurposing
Indications discovery and drug repurposingSean Ekins
 
Technological Forecasting & Social Change 126 (2018) 3–13C.docx
Technological Forecasting & Social Change 126 (2018) 3–13C.docxTechnological Forecasting & Social Change 126 (2018) 3–13C.docx
Technological Forecasting & Social Change 126 (2018) 3–13C.docxbradburgess22840
 

Similar to When pharmaceutical companies publish large datasets an abundance of riches or fools gold (20)

Reaching out to collaborators and crowdsourcing for pharmaceutical research
Reaching out to collaborators and crowdsourcing for pharmaceutical research  Reaching out to collaborators and crowdsourcing for pharmaceutical research
Reaching out to collaborators and crowdsourcing for pharmaceutical research
 
Slides for burroughs wellcome foundation ajw100611 sefinal
Slides for burroughs wellcome foundation ajw100611 sefinalSlides for burroughs wellcome foundation ajw100611 sefinal
Slides for burroughs wellcome foundation ajw100611 sefinal
 
Open Notebook Science and One Future for Scientific Research
Open Notebook Science and One Future for Scientific ResearchOpen Notebook Science and One Future for Scientific Research
Open Notebook Science and One Future for Scientific Research
 
Precompetitive preclinical ADME/tox data and set it free on the web to facili...
Precompetitive preclinical ADME/tox data and set it free on the web to facili...Precompetitive preclinical ADME/tox data and set it free on the web to facili...
Precompetitive preclinical ADME/tox data and set it free on the web to facili...
 
Latest Project to Cure PKD
Latest Project to Cure PKDLatest Project to Cure PKD
Latest Project to Cure PKD
 
Izant openscience
Izant openscienceIzant openscience
Izant openscience
 
Nature Drug Discovery Dec 2013
Nature Drug Discovery Dec 2013Nature Drug Discovery Dec 2013
Nature Drug Discovery Dec 2013
 
Nature Drug Discovery Dec 2013
Nature Drug Discovery Dec 2013Nature Drug Discovery Dec 2013
Nature Drug Discovery Dec 2013
 
Big Data, AI, and Pharma
Big Data, AI, and PharmaBig Data, AI, and Pharma
Big Data, AI, and Pharma
 
Rock Report: Big Data by @Rock_Health
Rock Report: Big Data by @Rock_HealthRock Report: Big Data by @Rock_Health
Rock Report: Big Data by @Rock_Health
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
Benefits of Big Data in Health Care A Revolution
Benefits of Big Data in Health Care A RevolutionBenefits of Big Data in Health Care A Revolution
Benefits of Big Data in Health Care A Revolution
 
A perspective of Publicly Accessible/Open Access Chemistry Databases
A perspective of Publicly Accessible/Open Access Chemistry DatabasesA perspective of Publicly Accessible/Open Access Chemistry Databases
A perspective of Publicly Accessible/Open Access Chemistry Databases
 
Towards a gold standard and regarding quality in public domain chemistry data...
Towards a gold standard and regarding quality in public domain chemistry data...Towards a gold standard and regarding quality in public domain chemistry data...
Towards a gold standard and regarding quality in public domain chemistry data...
 
Drug Discovery Involving Artificial Intelligence and Big Data Modeling: Review
Drug Discovery Involving Artificial Intelligence and Big Data Modeling: ReviewDrug Discovery Involving Artificial Intelligence and Big Data Modeling: Review
Drug Discovery Involving Artificial Intelligence and Big Data Modeling: Review
 
Connecting the Data Wires
Connecting the Data WiresConnecting the Data Wires
Connecting the Data Wires
 
Indications discovery and drug repurposing
Indications discovery and drug repurposingIndications discovery and drug repurposing
Indications discovery and drug repurposing
 
Data analytics: Are U.S. hospitals late to the party?
Data analytics: Are U.S. hospitals late to the party?Data analytics: Are U.S. hospitals late to the party?
Data analytics: Are U.S. hospitals late to the party?
 
Technological Forecasting & Social Change 126 (2018) 3–13C.docx
Technological Forecasting & Social Change 126 (2018) 3–13C.docxTechnological Forecasting & Social Change 126 (2018) 3–13C.docx
Technological Forecasting & Social Change 126 (2018) 3–13C.docx
 
THE LARGE DATA DEMO - ONE MODEL
THE LARGE DATA DEMO - ONE MODELTHE LARGE DATA DEMO - ONE MODEL
THE LARGE DATA DEMO - ONE MODEL
 

Recently uploaded

Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfROWELL MARQUINA
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - AvrilIvanti
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 

Recently uploaded (20)

Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdf
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - Avril
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 

When pharmaceutical companies publish large datasets an abundance of riches or fools gold

  • 1. Perspective When Pharmaceutical Companies Publish Large Datasets: An Abundance Of Riches Or Fool’s Gold Sean Ekins1, 2, 3, 4 and Antony J. Williams5 1 Collaborations in Chemistry, 601 Runnymede Avenue, Jenkintown, PA 19046, U.S.A. 2 Collaborative Drug Discovery, 1633 Bayshore Highway, Suite 342, Burlingame, CA 94403. 3 Department of Pharmaceutical Sciences, University of Maryland, MD 21201, U.S.A. 4 Department of Pharmacology, University of Medicine & Dentistry of New Jersey (UMDNJ)-Robert Wood Johnson Medical School, 675 Hoes lane, Piscataway, NJ 08854. 5 Royal Society of Chemistry, 904 Tamaras Circle, Wake Forest, NC-27587. Running Head: Pharma Screening Dataset analysis Corresponding Authors: Sean Ekins, Collaborations in Chemistry, 601 Runnymede Ave, Jenkintown, PA 19046, Email ekinssean@yahoo.com, Tel 215-687-1320. Antony J. Williams, Royal Society of Chemistry, 904 Tamaras Circle, Wake Forest, NC-27587. antony.williams@chemspider.com
  • 2. Abstract With the recent announcement that GlaxoSmithKline have released a huge tranche of whole cell malaria screening data to the public domain, accompanied by a corresponding publication, this raises some issues for consideration before this exemplar instance becomes a trend. We have examined the data from a high level, by studying the molecular properties, and consider the various alerts presently in use by major pharma companies. We acknowledge the potential value of such data but also raise the issue of the actual value of such datasets released into the public domain. We also suggest approaches that could enhance the value of such datasets to the community and theoretically offer more immediate benefit to the search for leads for other neglected diseases.
  • 3. Introduction We are currently witnessing significant shifts in the ways that pharmaceutical research can be accelerated. These approaches include decentralizing research as well as engagement with the external research communities through crowdsourcing. There is a marked trend towards collaboration of all kinds [1-5]. In parallel there is a renewed interest in neglected disease research (on malaria, tuberculosis (TB), kinetoplastids etc [6]) due to the significant influence of the US National Institutes of Health (NIH), foundations such as the Bill and Melinda Gates Foundation, The European Commission and increasing investment from pharmaceutical companies and others [6,7]. In the past drug companies generally only published small chunks of data in the form of molecular structures and biological (pharmacology) data when it was convenient as needed to influence investors, the public, FDA approval or after a compound or project was terminated. With the combinatorial chemistry and high throughput screening that has seen explosive growth over the past decade, each large pharmaceutical company has created massive proprietary data bases and invested enormous resources in the purchase or development of complex informatics platforms. These software systems probably contain more data than can be realistically mined as they have pursued their quest for blockbuster targeted therapies. The quality of the data and the applicability of the assays can also add a lot of noise into the system. While it is not unusual for academics (and occasionally pharmaceutical companies) to publish and collate relatively large datasets (several hundred to thousands of compounds) primarily for quantitative structure activity analysis (http://www.cheminformatics.org/datasets/index.shtml,
  • 4. http://www.qsarworld.com/qsar-datasets.php) or compile datasets from NIH funded screening programs (http://pdsp.med.unc.edu/indexR.html) [8-12], pharmaceutical companies have been less willing to make larger screening datasets available to the public and even with such huge efforts there has been little productivity in screening for antibiotics [13]. This has now changed with the unprecedented release by GlaxoSmithKline (GSK) of >13,500 in vitro screening hits against Malaria using Plasmodium falciparum along with their associated cytotoxicity (in HepG2 cells) data from an initial screen of over 2 million compounds ([14] which follows an earlier press release from GSK http://www.nature.com/news/2010/100120/full/news.2010.20.html). Hosted GSK malaria data Three data bases initially all hosted the data. These include the European Bioinformatics Institute-European Molecular Biology Laboratory (EBI-EMBL, ChEMBL http://www.ebi.ac.uk/chembl/), PubChem (http://pubchem.ncbi.nlm.nih.gov/) and Collaborative Drug Discovery (CDD, www.collaborativedrug.com). What happens now the data are hosted and announced to the community is open to prediction; 1. the malaria community may ignore it, which is highly unlikely or 2. the malaria community will be excited by the availability of the data and will use the data in their research and, potentially, may find a new drug. This is also perhaps statistically a low probability which will realistically take many years unless we find ways to accelerate the process. The question is therefore whether such data depositions are an abundance of riches or fool’s gold? They may actually be neither.
  • 5. This massive contribution of data to the community creates a precedent although it is expected that other companies will follow suit (http://www.nature.com/news/2010/100120/full/news.2010.20.html). It also raises many other questions which we can only begin to pose responses for. Who should get to host such public data? If the data are truly “open” then anyone can download the data and reuse, repurpose and host the data. It will be interesting to see what “licensing” is applied to the different types of data when it is exposed by various companies. The GSK malaria screening hits data were added to the ChEMBL dataset and the chemical structures are available for download. In the CDD data base, data associated with public datasets are generally made available for similarity, substructure and Boolean searching (http://www.collaborativedrug.com/register). The sdf file of structures and data is also available for download as well. In the United States PubChem has established itself as the de facto repository for screening data, so deposition here represents a drop in the ocean of over 27 million unique molecule structures, although admittedly just a fraction are associated with bioactivity data. The deposition and hosting of the data in the three repositories (ChEMBL, PubChem and CDD) does leverage what each data base has to offer in terms of integration to existing information on the compounds. We also believe that deposition into other data bases with links out to others, including the original hosting organizations, also offers great benefits. We believe that a great model for this approach is also the ChemSpider data base from the Royal Society of Chemistry (www.chemspider.com) which has already demonstrated the feasibility of such an approach and now also hosts the GSK data. We suggest there should be connectivity between all the data bases hosting this data and others released in the future.
  • 6. Questions and opportunities upon opening up data We can predict that there may be increased competition to woo pharmaceutical companies to exclusively deposit data in various data bases. There may however be more utility if the various data bases collaborated to convince companies that releasing data could be a powerful force for change, with each group contributing their technologies and own expertise to the task. Instead of having standards for deposition a lá microarray MIAME [15] etc. there are no such standards for chemical structures and biological data. Who will ensure the quality of the data and act as a host for future annotation and potential structure validation? It is unlikely to be the data bases clamoring to host the data. Would any organization be interested in funding future data uploads and data cleansing? Will companies only deposit compounds into the public domain that are of less interest to them and have been demonstrated to be inactive against screens while retaining drug-like hits and negative data which can be valuable for SAR creation? While we applaud the pharmaceutical companies for donating data to the community we posit the question as to whether these contributions may create more noise and less signal in the public compound data bases. Will people really want to mine the data if it is perceived as cast-off data, or just represents commercially available screening libraries with little novel in the way of chemical entities? In many cases we judge that most researchers will treat these data as of minor importance since in this case it is malaria- related. However, this may be a mistake as GSK suggest that many of these compounds are acting on malaria kinases or proteases that could be important for the treatment of other diseases and which could lead to a drug for other blockbuster diseases. Will there be any incentives to pursue such neglected disease data? For the time-being
  • 7. pharmaceutical companies are investing more in such neglected diseases [6], although much less than on other major diseases. This brings to mind the United States Environmental Agency’s (EPA) ToxCast project [16] which created a data base, at major public expense, and then invited academics to build computational models or mine the data. Ultimately what is derived from the data will be dependent on the quality of the data populated into the data base. The only incentive is if something of interest is found then there may be some funding to investigate further. If pharmaceutical companies are to put data into the public domain then it may be of value for these or other organizations to consider incentivizing researchers to mine it, or offer challenges and awards. GSK malaria screening data analysis Driven purely by curiosity regarding the nature of the malaria data recently deposited into the public domain by GSK [14], we have undertaken a preliminary evaluation using a simple descriptor analysis as was performed previously for some very large tuberculosis datasets originally funded by the NIH [17]. In addition we have used some readily available substructure alerts or “filters” to identify potentially reactive molecules. Pharmaceutical companies use such computational filters to clean up screening sets and remove undesirable molecules from vendor libraries [18]. Examples include filters from GSK [19], and Abbott [20-22]. An academic group also developed an extensive series of over 400 substructural features for the removal of pan assay interference compounds from screening libraries [23], which have yet to be integrated into a public resource. Our simplistic analysis compares the GSK dataset [14] to widely
  • 8. available drug-like molecules from the MicroSource US drugs dataset (http://www.msdiscovery.com/usdrugs.html). As we would expect, the percentage of GSK malaria screening hit molecules failing the published “GSK filters” [19] is close to zero while the percentage failing the Pfizer and Abbott filters [20] is considerably higher (~57-76%, respectively) as these appear to be more conservative (Table 1). This could be interpreted as representing different business rule decisions instituted according to their own criteria which we are not in a position to critically judge. The percentage of failures for the set of US FDA drugs is lower for both Pfizer and Abbott filters (Table 1), suggesting that these compounds are by no means perfect but also perhaps setting a threshold. A recent published study filtered a set of > 1000 marketed drugs and at least 26% failed filters for molecular features undesirable for high throughput screening [24]. We have also recently used the same rules to filter sets of compounds with activity against tuberculosis [11,12], with 81-92% failing the Abbott filters [25] which may be related to mechanism of action. In the GSK paper they suggest that they did not find any non-specific inhibitors of lactate dehydrogenase, while cytotoxicity was seen in 1,982 compounds [14]. A detailed analysis of our calculated molecular descriptors for the GSK malaria hits [14] shows that most are normally distributed apart from the skewed Lipinski violations data and the bimodal molecular weight. Table 2 shows the means and standard deviations for each descriptor. Interestingly 3,269 (24.3%) of the compounds fail more than one of the Lipinski rules of 5 (MW ≤ 500, logP ≤ 5, HBD ≤ 5, HBA ≤ 10) [26] using the descriptors calculated in the CDD data base. It should be noted that the performance of logP calculators differ between various software vendors and this should be taken into account when
  • 9. considering the calculated values. The GSK screening hits are generally large and very hydrophobic as is also suggested in their publication [14], and although they suggested this may be important to reach intracellular targets, there is no discussion of the limitations of such compounds. One could imagine that these molecular properties may also present significant solubility challenges [27]. These molecular properties can also be compared with the published lead-like rules (MW < 350, LogP< 3, Affinity ~0.1uM) [28,29], the natural product lead-like rules (MW < 460, Log P< 4.2, Log Sol -5, RBN ≤ 10, Rings ≤ 4, HBD ≤5, HBA ≤ 9) [30] and the mean molecular properties for the respiratory drugs which are either delivered by inhalation or intranasal routes (MW ~370, PSA ~89.2, LogP ~1.7) [31]. In summary, the malaria screening hits in total [14] may not be ‘lead-like’ and are closest to ‘natural product lead-like’. Although the malaria paper [14] suggests that the compounds are “drug-like” the evidence for this is weak (and is a statement open to wide interpretation) in the absence of comparison with drugs or even in vivo data for any of their hits. These antimalarial hits as a group are also vastly different to the mean molecular properties of compounds that have shown activity against TB, which are generally of lower molecular weight, less hydrophobic and with lower pKa and fewer RBN [17]. The GSK compounds may be used with computational models to find active compounds that inhibit this disease in vitro [17]. We have taken Bayesian models generated from previously published very large screening datasets for TB, and used them to filter the GSK malaria hits. The two separate models retrieved 4,841 and 6,030 compounds for the single point and dose response models, respectively (Supplemental data, also available at www.collaborativedrug.com) that would be predicted as active in the Mtb phenotypic screen from which the data was derived from [11,12]. This may
  • 10. provide a starting point for cherry picking compounds for follow up in vitro. It remains to be seen whether GSK will publish Mtb screening data for the same compounds but we would encourage this. Awareness of breaking the rules Some companies try to avoid compounds that have reactive groups and fail related alerts or simply fail ‘the Rule of Five’, as a starting point before screening, although many chemists believe even this is too stringent and there are several successful drugs and development candidates that fall outside these requirements and ‘break the rules’. Depending on the stringency of computational filtering this would suggest at a minimum 24 – 76% of the GSK malaria screening hits would be filtered out or flagged as problematic at the very least using these simple methods. Structural alerts can be difficult to assess because the presence of a potentially problematic functionality must be assessed in the context of the molecule under consideration e.g. is it required for the activity pharmacophore. Our results would simply indicate that the user should be aware of the properties and features in the compounds in this dataset [14] before embarking upon lead optimization. It remains to be seen if datasets deposited by other research groups will also follow this trend and is something we are currently assessing. Certainly, such computational approaches could be readily used without to much effort to assess large datasets deposited in the future. Concluding thoughts
  • 11. While attitudes on the quality of a compound vary between medicinal chemists, as a community we need to be vigilant of the data in any public or commercial data base and this is ultimately the responsibility of the user. At one level the structure fidelity should be the responsibility of the depositor, but as this study suggests some readily available tools can be used to evaluate such datasets for chemical properties and can point to issues that need to be considered before a great deal of time is spent further mining the data. However, for many who may be unaware of such technologies or their limitations how do they start processing such freely available data? There may be needles in this set of malaria actives and they may require computational approaches to pull them out of the haystack, but at the same time they need to be mindful of the compound quality at the start to avoid blind alleys due to aggregation [32], false positive issues [33-39] or artifacts [40]. Whoever funds such high throughput screening whether industry, government, not for profit or academia, needs to seriously consider which compounds are screened (and use appropriate filtering criteria before hand) in vitro and then ultimately where and how the data is deposited. We sincerely hope that this provides a starting point for discussion of the many important issues raised above as pharmaceutical companies start depositing their datasets (whether compound screening, solubility, metabolism, toxicity etc) in data bases funded by non-profits or private concerns. It may then have accomplished something more with considerable positive implications, than was originally envisaged by the company. We have waited a long time for this type of landmark data contribution to the community by GSK and would like those pharmaceutical companies that follow this exemplary lead to ensure they deliver even higher value and actionable data to the scientific community.
  • 12. Acknowledgements The authors kindly thank Dr. Jeremy Yang and colleagues (University of New Mexico) for providing access to the Smartsfilter web application and our colleagues for inspiration and support. Conflicts of Interest SE consults for Collaborative Drug Discovery, Inc on a Bill and Melinda Gates Foundation Grant#49852 “Collaborative drug discovery for TB through a novel data base of SAR data optimized to promote data archiving and sharing”. He is also on the advisory board for ChemSpider. AJW is employed by the Royal Society of Chemistry which owns ChemSpider and associated technologies. References 1 Bingham, A. and Ekins, S. (2009) Competitive Collaboration in the Pharmaceutical and Biotechnology Industry. Drug Disc Today 14, 1079-1081 2 Hunter, A.J. (2008) The Innovative Medicines Initiative: a pre-competitive initiative to enhance the biomedical science base of Europe to expedite the development of new medicines for patients. Drug Discov Today 13 (9-10), 371- 373 3 Barnes, M.R. et al. (2009) Lowering industry firewalls: pre-competitive informatics initiatives in drug discovery. Nat Rev Drug Discov 8 (9), 701-708
  • 13. 4 Bailey, D.S. and Zanders, E.D. (2008) Drug discovery in the era of Facebook-- new tools for scientific networking. Drug Discov Today 13 (19-20), 863-868 5 Ekins, S. and Williams, A.J. (2010) Reaching out to collaborators: crowdsourcing for pharmaceutical research. Pharm Res 27 (3), 393-395 6 Moran, M. et al. (2009) Neglected disease research and development: how much are we really spending? PLoS Med 6 (2), e30 7 Morel, C.M. et al. (2005) Health innovation networks to help developing countries address neglected diseases. Science 309 (5733), 401-404 8 Keiser, M.J. et al. (2009) Predicting new molecular targets for known drugs. Nature 462 (7270), 175-181 9 O'Connor, K.A. and Roth, B.L. (2005) Finding new tricks for old drugs: an efficient route for public-sector drug discovery. Nat Rev Drug Discov 4 (12), 1005-1014 10 Roth, B.L. et al. (2004) Screening the receptorome to discover the molecular targets for plant-derived psychoactive compounds: a novel approach for CNS drug discovery. Pharmacol Ther 102 (2), 99-110 11 Maddry, J.A. et al. (2009) Antituberculosis activity of the molecular libraries screening center network library. Tuberculosis (Edinb) 89, 354-363 12 Ananthan, S. et al. (2009) High-throughput screening for inhibitors of Mycobacterium tuberculosis H37Rv. Tuberculosis (Edinb) 89, 334-353
  • 14. 13 Payne, D.A. et al. (2007) Drugs for bad bugs: confronting the challenges of antibacterial discovery. Nat Rev Drug Disc 6, 29-40 14 Gamo, F.-J. et al. (2010) Thousands of chemical starting points for antimalarial lead identification. Nature 465, 305-310 15 Brazma, A. et al. (2001) Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet 29 (4), 365-371 16 Dix, D.J. et al. (2007) The ToxCast program for prioritizing toxicity testing of environmental chemicals. Toxicol Sci 95 (1), 5-12 17 Ekins, S. et al. (2010) A Collaborative Database And Computational Models For Tuberculosis Drug Discovery. Mol BioSystems 6, 840-851 18 Williams, A.J. et al. (2009) Free Online Resources Enabling Crowdsourced Drug Discovery. Drug Discovery World Winter 19 Hann, M. et al. (1999) Strategic pooling of compounds for high-throughput screening. J Chem Inf Comput Sci 39 (5), 897-902 20 Huth, J.R. et al. (2005) ALARM NMR: a rapid and robust experimental method to detect reactive false positives in biochemical screens. J Am Chem Soc 127 (1), 217-224 21 Huth, J.R. et al. (2007) Toxicological evaluation of thiol-reactive compounds identified using a la assay to detect reactive molecules by nuclear magnetic resonance. Chem Res Toxicol 20 (12), 1752-1759
  • 15. 22 Metz, J.T. et al. (2007) Enhancement of chemical rules for predicting compound reactivity towards protein thiol groups. J Comput Aided Mol Des 21 (1-3), 139- 144 23 Baell, J.B. and Holloway, G.A. (2010) New Substructure Filters for Removal of Pan Assay Interference Compounds (PAINS) from Screening Libraries and for Their Exclusion in Bioassays. J Med Chem 53, 2719-2740 24 Axerio-Cilies, P. et al. (2009) Investigation of the incidence of "undesirable" molecular moieties for high-throughput screening compound libraries in marketed drug compounds. Eur J Med Chem 44 (3), 1128-1134 25 Ekins, S. et al. (2010) Analysis and hit filtering of a very large library of compounds screened against Mycobacterium tuberculosis Submitted 26 Lipinski, C.A. et al. (1997) Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Del Rev 23, 3-25 27 Lipinski, C.A. (2000) Drug-like properties and the causes of poor solubility and poor permeability. J Pharm Toxicol Methods 44, 235-249 28 Oprea, T.I. (2002) Current trends in lead discovery: are we looking for the appropriate properties? J Comput Aided Mol Des 16, 325-334 29 Oprea, T.I. et al. (2001) Is there a difference between leads and drugs? A historical perspective. J Chem Inf Comput Sci 41, 1308-1315
  • 16. 30 Rosen, J. et al. (2009) Novel Chemical Space Exploration via Natural Products. J Med Chem 52, 1953-1962 31 Ritchie, T.J. et al. (2009) Analysis of the Calculated Physicochemical Properties of Respiratory Drugs: Can We Design for Inhaled Drugs Yet? J Chem Inf Model 49, 1025-1032 32 Seidler, J. et al. (2003) Identification and prediction of promiscuous aggregating inhibitors among known drugs. J Med Chem 46, 4477-4486 33 Rishton, G.M. (2008) Molecular diversity in the context of leadlikeness: compound properties that enable effective biochemical screening. Curr Opin Chem Biol 12 (3), 340-351 34 Rishton, G.M. (2005) Failure and success in modern drug discovery: guiding principles in the establishment of high probability of success drug discovery organizations. Med Chem 1 (5), 519-527 35 Oprea, T.I. et al. (2009) A crowdsourcing evaluation of the NIH chemical probes. Nat Chem Biol 5 (7), 441-447 36 Coan, K.E. and Shoichet, B.K. (2008) Stoichiometry and physical chemistry of promiscuous aggregate-based inhibitors. J Am Chem Soc 130 (29), 9606-9612 37 Feng, B.Y. et al. (2007) A high-throughput screen for aggregation-based inhibition in a large compound library. J Med Chem 50 (10), 2385-2390
  • 17. 38 Jadhav, A. et al. Quantitative analyses of aggregation, autofluorescence, and reactivity artifacts in a screen for inhibitors of a thiol protease. J Med Chem 53 (1), 37-51 39 Doak, A.K. et al. Colloid Formation by Drugs in Simulated Intestinal Fluid. J Med Chem 40 Schmidt, C. (2010) GSK/Sirtris compounds dogged by assay artifacts. Nat Biotechnol 28 (3), 185-186
  • 18. Table 1. Summary of SMARTS filter failures for various datasets. The Abbott ALARM [20], Glaxo [19] and Blake SMARTS filter calculation were performed through the Smartsfilter web application, Division of Biocomputing, Dept. of Biochem & Mol Biology, University of New Mexico, Albuquerque, NM, (http://pangolin.health.unm.edu/tomcat/biocomp/smartsfilter). The GSK malaria screening data (N = 13,471) was obtained [14] from the CDD data base. We also used the MicroSource US Drugs dataset (N = 1039) as a reference set of “drug-like” molecules. Large datasets > 1000 molecules were fragmented into smaller SDF files before running through this website. Dataset (N) Number Number failing Number that failing the Pfizer LINT failed Glaxo Abbott filters * (%) filters [19] ALARM (%) filters [20] (%) GSK Malaria 10124 (75.8) 7683 (57.5) 129 (0.01) hits (paper in press). (13,355) Microsource 688 (66.1) 516 (49.6) 143 (13.7) US FDA drugs (1041)
  • 19. *Originally provided as a Sybyl script to Tripos by Dr. James Blake (Array Biopharma) while at Pfizer.
  • 20. Table 2. Mean (SD) of molecular descriptors from the CDD data base for the GSK dataset. MW = molecular weight, HBD = hydrogen bond donor, HBA = Hydrogen bond acceptor, Lipinski = rule of 5 score, PSA = polar surface area, RBN = rotatable bond number. Molecular properties were calculated using the Marvin plug-in (ChemAxon, Budapest, Hungary) within the CDD data base. Lipinski rule of 5 Dataset MW logP HBD HBA alerts pKa PSA RBN GSK data 478.16 4.53 1.83 5.60 0.82 6.67 76.85 7.17 (N = 13,471) (114.34) (1.58) (1.04) (1.99) (0.83) (3.72) (30.05) (3.37)