An introduction to usage of AI in Pharmaceutical industry. Machine Learning algorithms and use cases for Pharma Industry have been discussed in this PPT during the FDP for Andhra University
1. AI FOR PHARMACEUTICAL
INDUSTRY – A PRIMER
GOPI KRISHNA NUTI
VICE PRESIDENT, MUST RESEARCH
LEAD DATA SCIENTIST, AUTODESK
HTTPS://WWW.LINKEDIN.COM/IN/NGOPIKRISHNA/
2. ABOUT ME • Gopi Krishna Nuti
• Education
• Proud alumni of Andhra University College of Engineering,
Computer Science Department
• MS in Data Science from State University of NewYork at Buffalo
• MBA from Amrita University
• Career
• Working in IT industry for the past 20 years and in AI/ML/Data
Science for nearly a decade
• Lead Data Scientist in Autodesk,Bangalore
• Vice President of MUST Research, aTechnology NGO working to
bridge Academia and Industry in the field of AI.Working closely
with multiple governments, NASSCOM, NITI Ayog, BIS etc.
• Author of Amazon Best Seller “Machine Learning for Engineers”
• Research Publications and patents
3. WHAT IS AI
A wide-ranging branch of computer science concerned with building smart machines
capable of performing tasks that typically require human intelligence.
Multiple names with subtle differences.AI/ML/DL/Data Science. Can be loosely
considered to be the same.
Relies heavily on mathematics (Statistics and Linear Algebra)
Statistics? We already use them, don’t we?
Samples, populations, Standard deviations, p-values, paired t-tests,
normal distribution are very commonly used in Pharmaceutical
industry
4. SO,
WHAT’S NEW
WITH AI?
• AI draws heavily from the same basic principles.
• However, Statistics is used for cases with data paucity.
• Example:
• Consider clinical trial of a drug for high-risk, advanced
stage women pregnant for the first time after the age of
45.
• Clinical samples are, understandably, small.
• Statistics comes to the rescue here
• ML/AI is useful when data is abundant.
5. DO I HAVETO LEARN MATHS?
• Fortunately, NO.
• Familiarity with math shall be helpful. Unfamiliarity is not a deal breaker.
• Asking Pharmacologists to master the math is like expecting a film star in Celebrity
Cricket League to go play in ICCWorld cup.
• Neither wrong nor impossible.
• However, such a person is likely to be a sportsman; not an actor ☺
6. WHAT CAN
AI DO?
• Analyse massive amounts of data, mathematically.
• Identify hidden patterns in data and extracts insights which would be
humanly impractical.
• Predict how changes to one variable impacts another
• Analyse data collected over time and identify trends
• Group similar pieces of data into a single cluster along multiple
dimensions
• Identify factors which are related to one another
• Learn from images, free form text to detect specific pieces of information
• For the most part, backed by verifiable algebraic/statistical algorithms.
7. WHAT CAN I DOWITH AI?
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7577280/
8.
9. WHAT AI CAN (NOT) DO?
• Create a race of killer robot machines which will enslave humans???
• Yeah,AI cannot do that yet.
10. WHAT SHOULD
WE LEARN IN AI
Supervised
Learning
Regression
Classification
Timeseries forecasting
Unsupervised
learning
Clustering
Apriori
Deep Learning
Vs Statistics
based
approaches
11. SUPERVISED
LEARNING -
REGRESSION
• Predicting a parameter based on other parameters
• y=mx+c anyone? That’s Linear Regression.
• Simplest ML algorithm
• m, c have a different formula than slope and y-intercept that we studied in 8th
• Parameter to predict = f(remaining parameters)
• Other algorithms
• Decision Trees, Random Forests, SupportVector Regression, Polynomial
Regression etc.
• Use cases:
• Predict
• Solubility, Partition Coefficient (logP), degree of ionization, intrinsic
permeability
• Using
• Molecular descriptor, SMILES strings, electron density of the molecule
in the chemical space etc.
12. EXAMPLE FOR REGRESSION
• ESOL Dataset
• Properties of 1128 compounds have been tabulated
Question : Can you predict the solubility of CC(C)C(C(=O)OC(C#N)c1cccc(Oc2ccccc2)c1)c3ccc(OC(F)F)cc3 (Flucythrinate)
without an actual experiment?
Compound ID smiles
Minimum
Degree
Molecular
Weight
Number of
H-Bond
Donors
Number
of Rings
Number of
Rotatable
Bonds
Polar
Surface
Area
measured log
solubility in mols
per litre
Amigdalin
OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)C(
O)C3O
1 457.432 7 3 7 202.32 -0.77
Fenfuram Cc1occc1C(=O)Nc2ccccc2 1 201.225 1 2 2 42.24 -3.3
citral CC(C)=CCCC(C)=CC(=O) 1 152.237 0 0 4 17.07 -2.06
Picene c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43 2 278.354 0 5 0 0 -7.87
Thiophene c1ccsc1 2 84.143 0 1 0 0 -1.33
benzothiazole c2ccc1scnc1c2 2 135.191 0 2 0 12.89 -1.5
2,2,4,6,6'-PCB Clc1cc(Cl)c(c(Cl)c1)c2c(Cl)cccc2Cl 1 326.437 0 2 1 0 -7.32
Estradiol CC12CCC3C(CCc4cc(O)ccc34)C2CCC1O 1 272.388 2 4 0 40.46 -5.03
Dieldrin ClC4=C(Cl)C5(Cl)C3C1CC(C2OC12)C3C4(Cl)C5(Cl)Cl 1 380.913 0 5 0 12.53 -6.29
Delaney, John S. "ESOL: estimating aqueous solubility directly from molecular structure." Journal of chemical information and computer sciences 44.3 (2004): 1000-1005.
13. REGRESSION PROCEDURE
• Mathematical formulation
Solubility = f(Minimum Degree, Molecular Weight, Number of H-Bond Donors, Number of Rings, Number of Rotatable Bonds, Polar Surface Area)
• By using the Regression ML formulae,
solubility for Flucythrinate = -6.878
• Actual/Measured solubility = -6.876
• Error of just 0.002
• Time taken for predicting this value: less than 5 minutes.
• Cost – negligible
• Historical data is all that’s needed
14. SOME
TERMINOLOGY
• Parameter to be predicted (Solubility in the example) –
Dependent Variable
• Input parameters (Molecular Weight, Number of H-Bond
Donors, Number of Rings, Number of Rotatable Bonds, Polar
Surface Area) – Independent variables
• Difference between predicted value and actual value – Error
• R2 – A ratio of how much model explains the data. 1 is the
highest possible value and 0 is the lowest.
• Error is a statistical inevitability. It can be minimized but can
never be eliminated.
15. SUPERVISED
LEARNING -
CLASSIFICATION
• If “y” happens to NOT be a number but a class or category.
• Examples:Yes/No, Mild/Moderate/Severe, etc.
• Algorithms are Logistic Regression, k-Nearest Neighbours, DecisionTrees,
SupportVector Classifications, Naïve Bayes Classification etc.
• Use cases
• Predict Molecules that might respond to a given biochemical assay
Molecular properties
• Properties of novel molecules using Properties of old molecules,
structure of old molecules, structure of new molecule
16. EXAMPLE FOR CLASSIFICATION
assay measurements for 12 different toxic effects
1 – Toxic, 0 0 Non-toxic, NA – Information unavailable/unknown
SR.HSE
NCGC00178831-03 0
NCGC00166114-03 0
NCGC00263563-01 0
NCGC00013058-02 1
NCGC00167516-01 NA
NCGC00018301-05 1
NCGC00249897-01 1
NCGC00016000-18 1
AW AWeight Arto
BertzC
T Chi0 Chi1 Chi10 Chi2
NCGC00178831-03 54367203 13.053 2.176 3.194 23.112 15.868 1.496 15.127
NCGC00166114-03 12688176.07 22.123 2.065 3.137 21.033 13.718 1.937 13.187
NCGC00263563-01 3076932.336 13.085 2.154 3.207 46.896 29.958 3.806 30.105
NCGC00013058-02 71685690.57 12.832 2.029 3.38 51.086 32.045 1.806 29.09
NCGC00167516-01 7989702.276 12.936 2.124 3.573 70.295 46.402 3.604 42.132
NCGC00018301-05 6.213 13.143 2 2.607 17.079 11.041 0.286 9.157
NCGC00249897-01 2.773 28.889 2.167 2.561 8.715 5.698 0.037 5.368
NCGC00016000-18 4.183 17.275 1.875 2.303 12.552 7.599 0 5.685
Chemical structures of molecules
800 properties of the molecule are available
If a new molecule’s properties are provided, can you predict if SR.HSE is toxic?
Example is fromTox21 Dataset
https://www.bioinf.jku.at/research/DeepTox/tox21.html
[Mayr2016] Mayr, A., Klambauer, G., Unterthiner, T., & Hochreiter, S. (2016). DeepTox: Toxicity Prediction using Deep Learning. Frontiers in Environmental Science, 3:80.
[Huang2016] Huang, R., Xia, M., Nguyen, D. T., Zhao, T., Sakamuru, S., Zhao, J., Shahane, S., Rossoshek, A., & Simeonov, A. (2016). Tox21Challenge to build predictive models of nuclear receptor and stress response pathways as mediated by exposure to
environmental chemicals and drugs. Frontiers in Environmental Science, 3:85.
17. SOME
TERMINOLOGY
• Parameter to be predicted (SR.HSE) – Dependent Variable
• Input parameters (AW, Aweight, Arto, BertzCT, Chi0, Chi1, Chi10, Chi2) –
Independent variables
• Error is a statistical inevitability. It can be minimized but can never be
eliminated.
• Can’t calculate error because Severe – Mild = Moderate is meaningless
• Values like True Positive, True Negative, False Positive, False Negative are
calculated.
• Metrics used are Specificity, Sensitivity, Accuracy, Precision, Recall, Area Under
Curve. All these are derived from the above 4.
• Standard example used to explain false positives to newbies actually comes
from biological sciences.
Best example of False Positive: A pregnancy kit confirms the pregnancy of a
male human.
18. SUPERVISED
LEARNING –
FORECASTING
• Also called Time series analysis
• Useful when the behaviour patterns should be identified over a
period of time.
• Classic examples: Stock market price prediction.Today’s opening
price is dependent on yesterday’s closing price.
• Algorithms
• ARIMA, SARIMA, HoltWinters etc.
• Use cases
• Patient Churn rate during experimentation stages. Based
on historical data, can we predict when a volunteer will
drop off the sample?
• Can we identify a seasonality and/or trend in the
resurgence of Dengue, Covid etc?
19. IS THERE A
SEASONALITY HERE?
• If so, can we predict when the
next season will begin?
• Immense Benefits to healthcare
20. IS THERE A
SEASONALITY HERE?
• If so, can we predict when the
next season will begin?
• Immense Benefits to healthcare
21. UNSUPERVISED
LEARNING
• We are not predicting anything here.We are simply trying to understand
the data better. i.e.There is no y=mx+c like relationship in the data
• Identify hidden patterns in the data
• Use cases
• Drug repurposing
• Therapeutic efficacy of drugs and target proteins of known
and unknown pharmaceuticals
• Interpreting the molecular mechanism of chemicals
• Which molecules are similar to one another?To which
cluster can I assign this molecule?
• Assumes that similar molecules react in similar ways
with a protein
• Patient screening/selection
• Group patients together based on previously unknown
similarities
• Precision Medicine
22. SOME
ALGORITHM
DETAILS
• Clustering is a data mining technique
• used to group data based on similarities or patterns in
data
• All observations in a cluster are similar to one another.
• Clusters are different from each other
• Association rule
• used to find the relationship among items in data
sets
• Mis-leading if used on small datasets
23. EXAMPLE OF CLUSTERING
After applying k-means clustering algorithm, to identify 4 clusters, the
centroids of these clusters is found to be →
• Cluster 1 was dominated by elderly patients with the youngest age
of 71 years and the oldest age of 97 years came from Lens Poly,
General Poly, and Refraction Poly.
• Cluster 2 is also dominated by elderly patients but have different
ages, namely 52-68 years from more diverse polys such as Lens
Polymers, General, EED, Glaucoma and Refraction Poly.
• Cluster 3 was dominated by patients aged 31-49 years with most
patients in cluster 3 coming from EED Poly. In cluster 4 the patients
in this cluster are dominated by children up to adolescents aged 1 to
30 years with patients who come from diverse poly areas such as
pediatric poly and EED.
https://iopscience.iop.org/article/10.1088/1742-6596/1196/1/012051/pdf
24. HYPOTHETICAL EXAMPLE OF CLUSTERING
https://iopscience.iop.org/article/10.1088/1742-6596/1196/1/012051/pdf
Patient Id Blood group Year of Birth Smoking Drinking Narcotics Promiscuity Age of onset
disease
History of
foreign travel
Hereditary
A A+ Y N N N 72 Y Y
B B- Y N Y N 68 Y N
C O+ N Y N Y 45 Y Y
D O+ N N N N 82 N Y
E B+ Y N N Y 32 N N
F Bombay blood
group
N Y Y N 65 N N
• Imagine a cluster being identified with below characteristics
• Cluster of cases where Blood group = O+,Year birth is before 1947, Smoking =Y, Drinking=Y, Narcotics=Y,
Promiscuity=Y, History of Foreign Travel=N, Hereditary=N and Age of onset > 80
• If this cluster is < 0.05% of cases, then we can statistically ignore it.
• If this cluster is > 80% of cases then, we can perhaps say that the disease is prevalent among a particular section of
population.
25. APRIORI
• Used to find relationship between
items in a database.
• Famous example of beer and diapers in
a supermarket
26. HYPOTHETICAL EXAMPLE FOR A-PRIORI
• Chemical structures of molecule are quantified and tabulated
• Solubility is measured and documented.
Question : How many items with Minimum Degree of 3 of exactly 2 H-Bond donors have 2 rotatable bonds?
Compound ID smiles
Minimum
Degree
Molecular
Weight
Number of
H-Bond
Donors
Number
of Rings
Number of
Rotatable
Bonds
Polar
Surface
Area
measured log
solubility in mols
per litre
Amigdalin
OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)C(
O)C3O
1 457.432 7 3 7 202.32 -0.77
Fenfuram Cc1occc1C(=O)Nc2ccccc2 1 201.225 1 2 2 42.24 -3.3
citral CC(C)=CCCC(C)=CC(=O) 1 152.237 0 0 4 17.07 -2.06
Picene c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43 2 278.354 0 5 0 0 -7.87
Thiophene c1ccsc1 2 84.143 0 1 0 0 -1.33
benzothiazole c2ccc1scnc1c2 2 135.191 0 2 0 12.89 -1.5
2,2,4,6,6'-PCB Clc1cc(Cl)c(c(Cl)c1)c2c(Cl)cccc2Cl 1 326.437 0 2 1 0 -7.32
Estradiol CC12CCC3C(CCc4cc(O)ccc34)C2CCC1O 1 272.388 2 4 0 40.46 -5.03
Dieldrin ClC4=C(Cl)C5(Cl)C3C1CC(C2OC12)C3C4(Cl)C5(Cl)Cl 1 380.913 0 5 0 12.53 -6.29
27. DEEP
LEARNING
• Performs the same activities i.e. Regression,
Classification, Forecasting etc.
• Relies on linear algebra rather than statistical
methods.
• Training data requirements are much higher
• Computational complexity is much higher
• Results are much more accurate. i.e. Error is much
lower.
28. COMPUTER
VISION
• Processing images to identify information from them
• Examples
• Automatically analysing x-ray images
• Diabetic Retinopathy
• Identifying compliance to drug intake during clinical
trials
29. NATURAL
LANGUAGE
PROCESSING
• Reading free-form text to identify information
• Examples:
• Automatically read prescriptions
• Analyse medical reports to extract information in
actionable manner
• Patient A reported reduced pain in 30 minutes after
administering 10 mg of the drug. Patient B reported
reduced pain in 15 minutes after administering 15 mg.
Name Dosage Time
A 10 30
B 15 15
30. PROBLEMS
WITH AI
• Heavily dependent on the data that is fed to it.
Garbage in, Garbage out.
• In some cases, too much data is also a problem.
• Explainability is still aWork In Progress for AI
scientists.
31. CAN AI REPLACE
HUMANS?
Yeah, No.AI is not so intelligent.
AI is very very good at searching for answers in data.
Human intellect is in asking the right questions.
AI is neither artificial nor intelligent. It is made from
natural resources and it is people who are performing the
tasks to make the systems appear autonomous.
- Kate Crawford, Senior principal researcher at Microsoft
Research.
32. HOW DOES IT ALL WORK TOGETHER
• Data is increasingly digitized and stored in databases.
• IoT devices are capturing information in clinical studies, manufacturing, biometry data
• Once data is available in a table, ML can start working
33. FURTHER READING
• Should I learn mathematics?
• Should I learn programming?
• Where can I learn more?
• Machine Learning for Engineers https://www.amazon.in/dp/9389024870/ref=cm_sw_em_r_mt_dp_M4SYKWQ234BAKR800NKZ
• https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7577280/
• https://pubmed.ncbi.nlm.nih.gov/30472429/
34. QUESTIONS AND ANSWERS
• ThankYou!
• For further discussions
• Ngopikrishna.public@gmail.com
• +91-9036005121