1. PHARMACOGENOMIC DATA MINING
with Hierarchical Clustering Algorithms
Ohene Z. Frank
CSC 576 Data Warehousing and Mining
Final Report
2. Frank | PGX Data Mining 1
PHARMACOGENOMIC DATA MINING WITH HIERARCHICAL CLUSTERING ALGORITHMS
Designer’s drugs, individualized drugs and personalized medicine are few of the
buzzwords that are proliferating the biotech information super-highway and are
widely used by pharmaceutical scientists, clinical scientists, researchers and medical
humanitarians when referring to pharmacogenomics. Malorye Branca of Bio-IT
World stated, “One of the most seductive lures of the genomic revolution is the
promise of personalized medicine”. Pharmacogenomics is the study of how one’s
genetic makeup affects the body’s response to drugs, hence an intersection of
genetics, pharmacodynamics and pharmacokinetics. Pharmacogenetics is widely
used synonymously with pharmacogenomics. Conceptually, these genomics terms
are interchangeable, but from a purist view, pharmacogenomics is the technology
where as pharmacogenetics is the science. Genaissance Pharmaceuticals defined
pharmacogenomics as the application of genome science (genomics) to the study
of human variability to drug response.
So, what’s the real tumult? In the United States, there is at least 100, 000 death
annually due to adverse reactions (side effects) to prescription drugs. Moreover,
millions of people are being treated with drugs that are ineffective or have very little
pharmacological effect; beta-blockers given to reduce blood pressure are
ineffective in one-third of patients and many antidepressants in half of the people
who take them [1].
The culpability for the lack of efficacy and intolerance of many drugs lies mainly with
our genes, which help to determine the way in which our body reacts, absorbs,
3. Frank | PGX Data Mining 2
distributes, metabolize and excretes drugs. Small genetic variations between
people (known as polymorphisms) can alter the behavior of proteins that carry a
drug to its target cells or tissues, neutralize the enzymes that activate a drug or aid in
the excretion process or alter the structure of the receptor to which a drug is
supposed to bind [1]. Variation in immune-system genes can also influence how
particular drugs are tolerated. These slight genetic variations mean that the dose at
which a drug will work may vary hugely from person to person; hence, the one-size-
fits-all drug development and prescribing can lead to life-threatening adverse
reaction to a drug or in some cases, fatality.
On the right path forward, the genomics revolution has given us the tools to identify
people who don't fit the standard prescribing mold. Genomics is the use of high
throughput molecular biology technologies to study large numbers of genes, and
gene products simultaneously in whole cells, whole tissues, or whole organisms [2].
The genome is all of the genetic material in a cell or an organism. According to the
U.S. Department of Energy, the genome is an organism’s complete set of DNA. In
the human genome, DNA is arranged into 24 distinct chromosomes, which are
separate molecules (physically) that range in length from about 50 million to 250
million base pairs [3]. Each chromosome is a single strand of the DNA double helix
that is very long in length (as illustrated Figure 1).
4. Frank | PGX Data Mining 3
Figure 11: Illustration of a chromosome replicating its DNA before a cell divides.
Single nucleotide polymorphisms (SNPs) are single-letter variations in the genetic
code that are scattered throughout the genome. Most SNPs are benign, with
absolutely no effect on gene structure or expression; however, a subset of these
variations provides crucial links to disease-causing genes, either because they
directly alter a gene's activity or aid in pinpointing the location of a disease-related
gene [1].
1
Figure is the courtesy of Genaissance Pharmaceuticals, Inc.
5. Frank | PGX Data Mining 4
The profusion of SNPs and the simplistic identification, make them the ideal
biomarkers for clinical studies. SNPs are also found in genes for drug-metabolizing
enzymes, influencing individuals' ability to process a drug properly.
Many companies have compiled large collections of SNPs with the intention of
developing diagnostic and prognostic tests, as well as to guide the development of
a new generation of drugs that would target genetically determined subsets of
patients [1]. All in all, this type of genomic technology as it aims to identify the best
possible medications for individuals while maximizing efficacy and minimizing toxicity
is known as pharmacogenomics.
Due the gravity and promise of pharmacogenomics, several genomics companies
are manufacturing DNA microarrays to identify common SNPs that influence the
activity of various enzymes. Ultimately, these gene expression chips could help to
prevent life-threatening reactions to drugs, identify appropriate drug doses, and
prescribe the right drug combination (or concomitant medications) to give to
patients with complex conditions.
In order for this to come to fulfillment at faster pace, one can applied data mining
techniques to a clinical data warehouse that contains both clinical trials data and
genomic data (anonymized genotyping and microarray) utilizing hierarchical
clustering algorithms.
6. Frank | PGX Data Mining 5
The data mining technique most widely utilized for the analysis of gene expression
data is hierarchical clustering. This type of clustering algorithms has the advantage
of being relatively simple and the result can be easily visualized. Hierarchical
clustering is an agglomerative approach in which single expression profiles are
joined to form groups that are further joined until the process has been carried to
completion, forming a single hierarchical tree [5].
There are six main hierarchical clustering algorithms (single-linkage, complete-
linkage, average-linkage, weighted pair-group average, within-groups and Ward’s
method) that can be applied to gene expression profiling (microarray) data analysis.
These clustering algorithms differ in the methodology in which distances are
calculated between the growing clusters and the remaining members (including
other clusters) in the data set. [5]
Single-linkage Clustering: This method is also referred to as the minimum, or
nearest-neighbor method. The distance between two clusters, x and y, is
calculated as the minimum distance between a member of cluster x and a
member of cluster y. This method tends to produce “loose” clusters that can
be joined, if any two members are close together. This method often results in
sequential addition of single samples to an existing cluster, which in turn,
produces trees with many long, single-addition branches representing clusters
that have grown by accumulation.
Complete-linkage Clustering: This method is also referred to as the maximum
or furthest-neighbor method. The distance between two clusters is calculated
7. Frank | PGX Data Mining 6
as the greatest distance between members of the relevant clusters. This
method tends to produce very compact clusters of elements and the clusters
are often very similar in size.
Average-linkage Clustering: This method is also referred to as unweighted
pair-group method average. The average distance is calculated from the
distance between each point in a cluster and all other points in another
cluster. The two clusters with the lowest average distance are joined
together to form a new cluster.
Weighted Pair-group Average: This method is identical to average-linkage
clustering (as described above), except that the size of the respective clusters
is used as a weight in the computations. This method should be used when
the cluster sizes are suspected to be greatly uneven.
Within-groups Clustering: This method is similar to average-linkage clustering
also, except that the clusters are merged and a cluster average is used for
further calculations instead of the individual cluster elements. This method
tends to produce tighter clusters than average-linkage clustering.
Ward's Method: In this method, the calculation of the total sum of squared
deviations from the mean of a cluster and joining clusters in order that it
produces the smallest possible increase in the sum of squared errors
determines the clusters.
8. Frank | PGX Data Mining 7
Figure 32: Hierarchical Clustering Demonstration
Figure 3 is a representation of gene expression data that were subjected to average-
linkage, complete-linkage and single-linkage hierarchical clustering using a
Euclidean distance metric and gene-expression families (A–J) that were color coded
for comparison. Genes that are up-regulated appear in red, and those that are
down-regulated appear in green, with the relative log2 (ratio) reflected by the
intensity of the color [5].
2
Courtesy of Nature Reviews, Nature Publishing Group
9. Frank | PGX Data Mining 8
The aim and allure of pharmacogenomic data mining is to discovery knowledge
from a clinical genomic data warehouse (comprised of both genomic and clinical
data), in order to identify and prescribe the most effective and least toxic drug for
an individual based the person’s genetic makeup and the targeted disease.
References
[1] Abbott, A., Nature 425, 760 - 762 (23 October 2003).
[2] Genaissance Pharmaceuticals, Inc., Online Glossary (2004).
[3] US Department of Energy, Human Genome Information Project,
Pharmacogenomics (2004).
[4] Branca, M., The New, New Pharmacogenomics, Bio-IT World (Sept. 9, 2002).
[5] Quackenbush, J., Nature Reviews Genetics 2, 418-427 (2001).
[6] Brown, M., Essentials of Medical Genomics, 163-198 (2003).
[7] Hollinger, M.A., Introduction to Pharmacology 2, 288-290 (2003).