The presentation includes preliminary information about the big data mainly metagenomic data and discussions related to the hurdles in analyzing using conventional approaches. In the later part, brief introduction about machine learning approaches using biological example for each. In the last, work done with special focus on implementation of a machine learning approach Random Forest for the functional annotation and taxonomic classification of metagenomic data.
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Machine Learning
1. Computational Analysis of High-throughput
Biological Data Using
Machine Learning Approaches
Ashok K Sharma
1220104
MetaInformatics Laboratory
IISER Bhopal
2. Topics Covered in the Talk
Introduction
Beginning of Genomics
Current Sequencing Scenario
Metagenomic Approaches
The Conventional Approach for Data Analysis- Limitations
Machine Learning Approaches and their Implementations
SVM
HMM
Naive Bayes
Random Forest
Discuss the Work Done So Far
Future Directions
3. Beginning of Genomics
First DNA isolated by Swiss
physician Friedrich
Miescher in 1869
The term genome was used
by German botanist Hans
Winker in 1920
The history of modern genomics began in 1970s
Nucleotide sequence of
bacteriophage lambda DNA
F. Sanger et. al., J Mol Biol, 1982
Whole-genome random
sequencing and assembly of
Haemophilus influenzae Rd
Fleischmann RD et. al., Science, 1995
Sequencing and analysis of the
human genome
ES Lander et. al., Nature, 2001
~48 kb ~1,800 kb
3 billion bp
3 billion USD
10 Years
4. Sequencer Read length ~ Cost /Mb ~ Data/run
Roche 454 400-800bp 20$ 450Mb
Ion Torrent 200bp 2$ 10Mb-1Gb
Illumina 150bp 0.50$ 600Gb
PacBio SMRT ~20kb 1.4$ 350Mb
Ion TorrentRoche 454 sequencer Illumina/Solexa Sequencer
Next Generation Sequencing Technologies
Leading to The Sequencing Era
5. Metagenomics: New Approach to Sequence the Unknown
•The first idea of cloning DNA directly from environmental samples was proposed by Pace in 1985
•The term “metagenome” was coined by Handelsman in 1998
The First Large Scale Metagenomics Project:
Environmental Genome Shotgun Sequencing
of the Sargasso Sea
C. Ventor et. al., Science, 2004
The First Large Scale Organismal Study:
Model Study Comparing the Gut Flora of 124
European individuals
Qin et. al., Nature, 2009
1.6 GB and 1.2 million genes 576.7 GB and 3.3 million genes
98% bacteria cannot be cultured and hence cannot be sequenced
6. Genomics and Metagenomics Have Exponentially
Increased the Sequence databases
Published papers on
“Metagenomics” in
PubMed
$1000
$100M
Cost per
Human Genome
180
140
100
60
20
1984 1988 1992 1996 2000 2004 2008 2012
Sequences(inmillions)
Growth of GenBank
(1984-2013)
• Metagenomic: 538
• Non-Metagenomic: 18787
https://gold.jgi-psf.org/
• Running projects:
10. Genomics vs Metagenomics
…GGATCCATCGTACCGATTC..
…TTACAATTTA…
…CCATGGCCGAAATTTCGTA…
…AGCTAAAATTACCGGGGAT…
Community of
Microbial
Species- Mainly
Unculturable
Fragmentation
of DNA
Sequencing
Analysis
Culture a
Single
Microbe
Fragmentation of
DNA
Sequencing
…AGCTAAAATTACCGGG…
GENOMICS METAGENOMICS
The Metagenomic Challenges
• Assembly
•Taxonomic Assignment
• Metabolic Pathway Construction
• Gene Prediction
• Functional Annotation
• Comparative Analysis
Assembly
DNA Isolation
11. Flow of Presentation
Introduction
Beginning of Genomics
Sequencing era
Metagenomics
The Conventional Approach for Data Analysis- Limitations
Machine learning approaches and their implementations
SVM
HMM
Naive Bayes
Random Forest
Work done
Future directions
12. Conventional Methods Cannot be Used for
Metagenomic Data Analysis
Database : 4.7 million Sequences
Query
Seeds
•Homology Based Approach- BLAST
•Most widely used by researchers
•Dynamic Programming is used
Each sequence is fragmented into seeds and
searched against all sequences of the database
It will take about 17 years on a Xeon 2.6 Ghz PC to carry out the BLAST of
>3 million metagenomic genes from one project
BLAST of 1000 genes against
NR database ~ 1 Day (25.5 Hrs.) ~ 2 Days (47.1 Hrs.)
~10 GB ~13 GB ~17 GB~4 GB< 1 GB
2012
2014
2013
Future
????
NCBI
NR
13. Flow of Presentation
Introduction
Beginning of Genomics
Sequencing era
Metagenomics
The Conventional Approach
Machine learning approaches and their implementations
SVM
HMM
Naive Bayes
Random Forest
Work done
Future directions
14. Key idea: Learnfrom known data and Predicton unknown data
Machine Learning- Valuable Alternatives
Database : 4.7 million SequencesQuery
Searching One against All
Memorize the information
Processing all at Once
• From known examples or dataLearning
• Derives a hypothesis based on training examplesHypothesis
• Based on Hypothesis, predictions on unknown
query
Prediction
16. Properties of Training Examples
• Training Dataset : Well curated and free from
noise.
• Features : Fixed length patterns
MKWMPFVGTMPLVQTKSITDLCAPLC
MMK
KW
WM
MP……………………………….......
M I W . . .
M 0.12 0.34 0.09 . . .
I 0.28 0.19 0.41 . . .
W . . 0.24 . . .
P - 0.17 - - - -
17. Support Vector Machines (SVM)
X2
X1
SVMs finds the maximal margin which separates two classes
Class 1
Class2
18. Support Vector Machines (SVM)
X1
X2
X3
( Ben-Hur, et al., 2008)
Linear
Kernel
Polynomial
Kernel d=2
Gaussian
Kernel,
sigma = 1
X2
X1
Class 1
Class2
19.
20. Carbohydrate Metabolism Amino acid metabolism Nucleotide metabolism
Model 1 Model 2 Model 3
Query Sequence
Prediction
Value 1
Prediction
Value 2
Prediction
Value 3
Feature Extraction
Mapped into high-dimension
feature space
Classification Based on
Maximum Prediction Value
•Dipeptide Frequency
AD : 0.08
RH : 0.02
•Amino Acid Composition
A : 0.14
F : 0.05
21. GPCRpred: an SVM-based method for prediction of families and
subfamilies of G-protein coupled receptors
Bhasin M , and Raghava G P S Nucl. Acids Res. 2004;32:W383-W389
Protein Sequence
SVM
Is GPCR ?
SVM
SVMSVMSVMSVM
GPCR
Recognition
Family
Prediction
Sub-Family
Prediction
99.5%
Accuracy
G-protein coupled
receptors (GPCRs)
important targets for drug
design.
Dipeptide
frequency
used as a
feature
22. Hidden Markov Models (HMM)
• A powerful statistical tool widely used in modeling sequences
• Markov Chains:
AYTGGGTACC
AYT-GGTMCC
AYCGGG-MC-
Making Profiles
What is the
probability of
Rain today?
23. Carbohydrate Metabolism Amino acid metabolism Nucleotide metabolism
HMM Profile Database
QUERY
SEQUENCE
Prediction Based
on Best Profile
Match
24. The Pfam Protein Families Database
• A large collection of protein families, each represented by multiple
sequence alignments and hidden Markov models (HMMs)
• Identification of domains that occur within proteins can provide insights
into their function
Steps used for building of Pfam:
Manually curated collection of protein families (3,071 families)
Each curated family is represented by seed and full alignment
Building HMM profiles using HMMER3.0
Widely used for identification of protein structure and function
Marco Punta et. al., Nucleic Acids Res, 2011
25. Naive Bayes Classifier
1. Simple probabilistic classifier Based on
Bayes’ theorem
2. Goal is to determine the most probable
hypothesis
Prior probability of class
Likelihood of X given that class
X2
X1Class 1 Class 2
Kohenen J. et al. In Silico Biol 2009;9(1-2):23-34.
26. Algorithm: Word sizes between 6 and 9 bases
Word-specific priors: Pi = [n(wi) + 0.5]/(N +1)]
Genus-specific conditional probabilities: P(wi|G) = [m(wi) + Pi]/(M + 1)
Naive Bayesian assignment: P(G|S) = P(S|G) * P(G)/P(S)
Bootstrap confidence estimation: For each query sequence
Naive Bayesian Classifier for Rapid Assignment of
rRNA Sequences into New Bacterial Taxonomy
Qiong Wang et. al., Appl Environ Microbiol, 2007
AUGCGUCAGCUCGAUCGAUCUA
AUGCGUCA
UGCGUCAG
GCGUCAGC
CGUCAGCU
27. Classification and Regression Trees (CART)
X1> C1
1
No Yes
X2> C2
YesNo
2X1> C3
YesNo
2X2> C4
YesNo
1 2
X1
X2
C1 C3
C4
C2
1
2
28. Random Forest
• Collection of unpruned CARTs
• Bagging- avoids overfitting
• Improve prediction accuracy
• Encouraging diversity among the tree
X
Tree 1
Tree 2 Tree 3
Svetnik V. et al., J Chem Inf Comput Sci 2003 Nov-Dec;43(6):1947-58
29. Features of Random Forest
o Cross validation procedure is inbuilt in random forest, as each
tree in the forest has its own training (bootstrap) and test data
(OOB data)
o OOB error rate calculates the overall percentage of
misclassification
o Calculates the important features for the classification
29
30. MODEL
t1
ABC
AB C
A B
t2
ABC
AC
A C
B
t3
ABC
BCA
CB
Query Sequence
Classification Based on the
Majority of votes
Carbohydrate Metabolism Amino acid metabolism Nucleotide metabolism
Feature Extraction
•Dipeptide Frequency
AD : 0.08
RH : 0.02
•Amino Acid Composition
A : 0.14
F : 0.05
31. Prediction of protein-RNA binding site using
Random Forest
Zhi Ping Liu et. al. Bioinformatics,2010
• Protein-RNA interaction plays a key role in number of
biological processes
Dataset:
339 Protein-RNA complexes form RsiteDB
Entangle was used to define the interaction site between
protein chain and RNA
Features:
Interaction propensity, Hydrophobicity, Relative excessive
surface area, Secondary structure, Conservation score and
Side chain environment
32. Machine Learning methods are becoming
popular for Biological Data Analysis
0
100
200
300
400
500
600
700
1976 1993 2003 2013
Numberofpublications
Year
SVM
0
100
200
300
400
500
600
700
1976 1993 2003 2013
Numberofpublications
Year
HMM
0
100
200
300
400
500
600
700
2003 2008 2013
Numberofpublications
Year
Random Forest
http://www.ncbi.nlm.nih.gov/pubmed
33. Flow of Presentation
Introduction
Beginning of Genomics
Sequencing era
Metagenomics
The Conventional Approach
Machine learning approaches and their implementations
SVM
HMM
Naive Bayes
Random Forest
Work done
Future directions
34. Implementation of Machine Learning
for the Analysis of Metagenomic Data
in my Recent Projects
: A fast and accurate functional classifier
of genomic and metagenomic sequences
35. METHODOLOGY: eggNOG database was used
ORF1 ORF2
Sequencing, assembly and ORF predictionMetagenome
Routine task for metagenomic analysis
Class Group Annotation
O Cellular Processes and Signaling Serine-Type endopeptidase
J Information Storage and Processing tRNA synthetase
Functional
Class
Functional
Annotation
2.3 million sequences were
divided in to 22 Functional
Class
Dipeptide as input features
for optimization and
training of Random Forest
Final Random Forest model
was integrated with
RAPsearch2
Manuscript Submitted, 2014
36. Stand alone server
Query Sequence
Genomic Metagenomic
Random Forest
RAPsearch
Functional Class Prediction
Functional Annotation
37. : A Tool for Fast and Accurate
Taxonomic Classification of 16S rRNA
Hypervariable Regions in Metagenomic
Datasets
38. Metagenome
16SrRNA: Marker gene to identify microbial species
Sequencing of either HVR or Complete 16S
Taxonomic Classification
METHODOLOGY: Greengenes database was used
Sequences for hypervariable
regions were extracted and
grouped according to
taxonomic information
4-mer nucleotide
composition were used as
Input feature for training
and optimization of RF
Sequences discarded during
clustering and real
metagenomic 16S sequences
were used for the testing
Routine task for metagenomic analysis
Manuscript Submitted, 2014
39.
40. Flow of Presentation
Introduction
Beginning of Genomics
Sequencing era
Metagenomics
The Conventional Approach
Machine learning approaches and their implementations
SVM
HMM
Naive Bayes
Random Forest
Work done
Future directions
41. Future Directions
• Analysis of metagenomic data generated from
the laboratory projects
• Implementation of machine learning in the
analyses of metagenomic data
• Metabolic pathway analysis and reconstruction
42. Acknowledgement
•Thesis Supervisor : Dr. Vineet Sharma
•Lab Members:
•Dr. Sanjiv Kumar
•Darshan Dhakan
•Ankit Gupta
•Rituja Saxena
•Parul Milttal
•Vishnu Prasoodanan
•Harish K
•Nikhil Chuadhary
•IISER Bhopal for providing the fellowship for doctoral
research
Hinweis der Redaktion
History of genomics started when first time dna was isolated by … after the few year later the term genome was given by …. As you all know genome refers to the organisms complete set of the dna which contains the organisms hereditary information.
History of modern genomics began in 1970s when first time sanger time report his methode for determining the order of nucleotides of DNA using chain terminating nucleotide analogues.
In 1982 First bacteriophage genome size of around 48 kb was determined using shotgun digest methode before coming the first automated dna sequencer. From the sequence reading frames for 46 genes were clearly identified.
There was the long wait after bacteriophage genome sequence and than first free living bacteria was sequenced in 1995, complete set was around 1800 kb. It took more than 100 years from the time when first time dna was isolated. The sequencing of h.influenge gave new directions to the genome sequencing and at the same time several large scale genome sequence project for higher eukaryotes were started and cpmpleted.
Only after 6 years, sequencing of human genome was completed in 2001, that was the largest scientific effort in the history of mankind. Sequencing of human genome contains of 2.91 billion of base pairs. Appearance of the genes in the human genome was around 30,000 to 40,000. Human genome was 25 times as large as any previously sequenced genome and eight times as large as the sum of all such genomes. The project cost was around 3 million usd dollar and it took 10 year of the time for completion.
Sanger sequencing was used for sequencing of all genome from lower to higher eukaryotes that was the major bottleneck in the genome sequencing analysis. Because of the time and cost is very high. After the successfull completion of human genome project several new approaches reached in to the market which started the new era of the sequencing.
* bacterial virus, or bacteriophage that infects the bacterial species Escherichia coli (E. coli). *
NGS brings spike in the genomic analysis via fasten the analysis and reduced the cost.
Still majority of the microbes on the earth are still unknown because most of the microbes can not be cultured.
Only a very small fraction of the microbes found in nature have been grown in pure culture, so we lack a comprehensive view of genetic diversity present on the earth surface.
An approach to this problem has emerged called metgenomics or environmental genomics.
This study shed new light on the diversity of life on Earth. A total of more than 1.6 Gb of sequence from Sargasso Sea samples yielded 1.2 million previously unknown gene sequences. Before analysis of the Sargasso Sea data the NCBI non-redundant amino acid (nr) dataset contained some 1.5 million peptides, about 630, 000 of which were classified as bacterial.
Size of the genomic and metagenomic databases increases rapidly. This large amount of the data required efficient tool for fast and accurate analysis.
In case of both genomic and metagenomic analysis the common steps after sequencing is
Assembly
Annotation
And comparision
3. I will not talk much about genomic analysis here. I will maily talk about metagenomic analysis of high throutput data.
Any metagenomic analysis mainly focuses on two points first is who is out there, in this type of analysis the prime focus is to find the organisms present in the specific environment, what is the proporpoin of the individual species present. Who is dominating in that environment and trying to correlate that is dominance of particular type of organisms makes this environment unique and differentiate from others.
Second point is mainly focus on the functional part. Like genes present in that environment, what kind of function they are doing and in which pathway they are present, using the abundance of functions and pathaways of the particular environment here trying to find the number of unique function and unique pathways present in the environment of our choice.
To anwer these two questions, after sequencing analysis followed sequential steps. Will discuss in my next slide.
Together to obtain a new understanding of the numbers and abundance of microbial community how these parameters change in response to external stimuli.
Any metagenomic analysis mainly focuses on two points first is who is out there, in this type of analysis the prime focus is to find the organisms present in the specific environment, what is the proporpoin of the individual species present. Who is dominating in that environment and trying to correlate that is dominance of particular type of organisms makes this environment unique and differentiate from others.
Second point is mainly focus on the functional part. Like genes present in that environment, what kind of function they are doing and in which pathway they are present, using the abundance of functions and pathaways of the particular environment here trying to find the number of unique function and unique pathways present in the environment of our choice.
To anwer these two questions, after sequencing analysis followed sequential steps. Will discuss in my next slide.
Together to obtain a new understanding of the numbers and abundance of microbial community how these parameters change in response to external stimuli.
Any metagenomic analysis mainly focuses on two points first is who is out there, in this type of analysis the prime focus is to find the organisms present in the specific environment, what is the proporpoin of the individual species present. Who is dominating in that environment and trying to correlate that is dominance of particular type of organisms makes this environment unique and differentiate from others.
Second point is mainly focus on the functional part. Like genes present in that environment, what kind of function they are doing and in which pathway they are present, using the abundance of functions and pathaways of the particular environment here trying to find the number of unique function and unique pathways present in the environment of our choice.
To anwer these two questions, after sequencing analysis followed sequential steps. Will discuss in my next slide.
Together to obtain a new understanding of the numbers and abundance of microbial community how these parameters change in response to external stimuli.
In the study of complex communities, it is often necessary to address the question of how much sequence is enough to understand a community and to carry out comparative analyses of related communities. In many cases, this information can be obtained by applying various methods based on 16S rRNA sequence that can reveal a tremendous amount of information about microbial diversity and abundance.
Metagenomics projects differ from traditional microbial-sequencing projects in many respects.
SVMs finds the maximal margin which separates two classes and then outputs the hyperplane separator at the center of the margin.
Data is not linearly separable
Project all the points in to the higher dimensions using mapping
Hoping that severability will improve .. This mapping is called kernel function.
SVMs finds the maximal margin which separates two classes and then outputs the hyperplane separator at the center of the margin.
Data is not linearly separable
Project all the points in to the higher dimensions using mapping
Hoping that severability will improve .. This mapping is called kernel function.
Accurately identifying genes from metagenomic fragments is one of the most fundamental issues. I am discussing here novel gene prediction method MetaGUN for metagenomic fragments based on a machine learning approach of SVM.
π(F) represents vector of initial probabilities
The quality of the seed alignment is the crucial factor in determining the quality of the Pfam resource, influencing not only all data generated within the database but also the outcome of external searches that use our profile HMMs.
With strong independence assumptions between predictors.
Bayesin decision theory came long before, it was studied in the filed of statistical theory and more specifically in the field of pattern recognition.
probable hypothesis: for given data d + initial knowledge about the prior probablities. Prior pbobalities reflects background knowledge.
P(c|x) is the posterior probability of class (target) given predictor (attribute).
P(c) is the prior probability of class.
P(x|c) is the likelihood which is the probability of predictor given class.
P(x) is the prior probability of predictor.
The Ribosomal Database Project (RDP) Classifier, a naive Bayesian classifier, can rapidly and accurately classify bacterial 16S rRNA sequences in to the higher taxonomy.
2. It provides taxonomic assignments from domain to genus, with confidence estimates for each assignment
3. Type sequences with Bergey’s taxonomy (average seq length 1,460 bases and had a range of 1,200 to 1,833 bases)
4. Complete rRNA database sequences with NCBI’s taxonomy near-full-length (1,200 bases) 16S rRNA sequences were obtained and taxonomic information for these databases obtained from genebank.
5. let n(wi) be the number of sequences containing subsequence wi
6. The conditional probability that a member of G contains wi was estimated with the equation..
7. the probability that an unknown query sequence, S, is a member of genus G is....... where P(G) is the prior probability of a sequence being a member of
G .. P(S) the overall probability of observing sequence S from any genus.
8. Overall classification accuracy by query size
In the standard classification situation we have observations from two or more known classes and want to develop a rule for assigning current and new observations in to classes using numerical and/or predictor variables.
Classification trees build these rules by recursive binary partitioning in to the regions that are increasingly homogenous with the respect to the class variable.
These homogenous regions are called nodes. At each step in the fitting of the classification trees a optimization is carried out to select a node, particular variable cut off or group of codes. That results in homogenous subgroup of the data.
Root node: entry point for the collection of the data. Inner node: a que is asked about the data and one child node for per possible answer. Leaf node correspond to the decision to take if reached.
RF uses ensemble of decision trees based on the samples, their class designation and variables. Since results from ensemble models are much more satisfactory when compared to the single model. What happens is that basically the tree is created according to the implemented algorithm and if pruning is enabled, an additional step looks at what nodes/branches can be removed without affecting the performance too much.
2. Two kind of ensembles. Bagging and boosting. in the bagging we dont look back to the earlier tree while in boosting consider the earlier trees and strive to compensate their weakness (leads to overfitting of the data). RF is an example of bagging method.
3. Most popular machine learning now a days bc . 1. its versatile classification algorithm make it suitable for analysis of large data sets. 2. higher prediction accuracy and provide information on variable of importance. 3. they r very effective, fast and easy to use.
Algorithm
1. Boot strapping is used to grow classification trees in the forest. If you have n number of samples than number N cases at random 2/3rd (but with replacement) from the original data used for training and rest 1/3rd prediction called OOB. Error rate is called out of bag error rate.
2. if there is m number of the input variables than a number m << M specified such that at each node variables selected randomly and evaluated for their ability to spilt the data. Variable resulting largest decrease in the impurity is chosen to separate the sample data At the each parent node.. Here impurity measure is Gini impurity.. Decrease in the gini impurity related to the increase in the amount of order in the sample classes, introduce by a split.
3. Random selection of the variables for splitting ensure low correlation bw the trees and prevent over fitting of RFM
Last point: every classification tree in the forest cast for unweighted weight for the sample and finally majority of the votes determine the class of the sample. Single tree in the forest is the weak classifier b/c it trains on the subset of the data.. Thats why contribution of all the trees in the forest is a strong classifier.
4. Training process is completed when forest is fully grown, Trained model can be used to predict the classes of the unknown samples
The expected error rate of the classification of a new sample by a classifier is estimated by cross validation procedure. Such as leave one out or k-fold cross validation . Aggregate OOB error rate from all trees.
In addition to the internal cross validation RF also calculates variable importance
In the forest model values of the predictor variables is randomly shuffled to break the association between the response and the predictor values. As the sum of the gini impurity decreases at every node in the forest for which that variable is used for the splitting.
To calculate the permutation variable importance. Prediction accuracy after permutation is substracted from permutation accuracy before permutation. And averaged overall trees in the forest to give the permutation importance value.
If the predictor never had any meaningful association with the response. Suffeling its value will produce very little or no change in model accuracy. on the other hand if predictor was strongly correlated with response, permutation should create large change or drop in accuracy.
These two measures help to find variable highly related with response and to find small number of variables for good prediction.
An example of predicting RNA binding sites. (a) Actual interface residues with RNA in protein 1R3E:A. (b) Predictions are mapped onto the original structure where different prediction catalogs are represented by different colors. (c) Structure of the protein–RNA complex with an example of prediction in the zoomed part. (d) Mutual interaction propensity between the triplets and nucleotides in the protein. Triplets are listed by sliding residues through the protein sequence. The box part corresponds to the values of residues in the zoomed part of (c). (e) Upper panel shows the interface propensity of each amino acid type in the dataset. It is defined as the proportion of an amino acid in interaction sites divided by the proportion of the residue in the dataset (see more in Supplementary Materials). Lower panel shows the interface propensity of binding with RNA for the residues in the protein. The box part corresponds to the values of the zoomed sites.
PubMed is a free search engine accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics.
590 “naive bayes”