SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Computational Analysis of High-throughput
Biological Data Using
Machine Learning Approaches
Ashok K Sharma
1220104
MetaInformatics Laboratory
IISER Bhopal
Topics Covered in the Talk
 Introduction
 Beginning of Genomics
 Current Sequencing Scenario
 Metagenomic Approaches
 The Conventional Approach for Data Analysis- Limitations
 Machine Learning Approaches and their Implementations
 SVM
 HMM
 Naive Bayes
 Random Forest
 Discuss the Work Done So Far
 Future Directions
Beginning of Genomics
First DNA isolated by Swiss
physician Friedrich
Miescher in 1869
The term genome was used
by German botanist Hans
Winker in 1920
The history of modern genomics began in 1970s
Nucleotide sequence of
bacteriophage lambda DNA
F. Sanger et. al., J Mol Biol, 1982
Whole-genome random
sequencing and assembly of
Haemophilus influenzae Rd
Fleischmann RD et. al., Science, 1995
Sequencing and analysis of the
human genome
ES Lander et. al., Nature, 2001
~48 kb ~1,800 kb
3 billion bp
3 billion USD
10 Years
Sequencer Read length ~ Cost /Mb ~ Data/run
Roche 454 400-800bp 20$ 450Mb
Ion Torrent 200bp 2$ 10Mb-1Gb
Illumina 150bp 0.50$ 600Gb
PacBio SMRT ~20kb 1.4$ 350Mb
Ion TorrentRoche 454 sequencer Illumina/Solexa Sequencer
Next Generation Sequencing Technologies
Leading to The Sequencing Era
Metagenomics: New Approach to Sequence the Unknown
•The first idea of cloning DNA directly from environmental samples was proposed by Pace in 1985
•The term “metagenome” was coined by Handelsman in 1998
The First Large Scale Metagenomics Project:
Environmental Genome Shotgun Sequencing
of the Sargasso Sea
C. Ventor et. al., Science, 2004
The First Large Scale Organismal Study:
Model Study Comparing the Gut Flora of 124
European individuals
Qin et. al., Nature, 2009
1.6 GB and 1.2 million genes 576.7 GB and 3.3 million genes
98% bacteria cannot be cultured and hence cannot be sequenced
Genomics and Metagenomics Have Exponentially
Increased the Sequence databases
Published papers on
“Metagenomics” in
PubMed
$1000
$100M
Cost per
Human Genome
180
140
100
60
20
1984 1988 1992 1996 2000 2004 2008 2012
Sequences(inmillions)
Growth of GenBank
(1984-2013)
• Metagenomic: 538
• Non-Metagenomic: 18787
https://gold.jgi-psf.org/
• Running projects:
What
How
Who
Species Diversity
What Metagenomics can Answer ?
Arcobacter
Paludibacter
Shewanella
Pseudomonas
Unknown
Species Richness
What
How
Who
Metabolic Capabilities
Functional Potential
What Metagenomics can Answer ?
What
How
Who
What Metagenomics can Answer ?
Genomics vs Metagenomics
…GGATCCATCGTACCGATTC..
…TTACAATTTA…
…CCATGGCCGAAATTTCGTA…
…AGCTAAAATTACCGGGGAT…
Community of
Microbial
Species- Mainly
Unculturable
Fragmentation
of DNA
Sequencing
Analysis
Culture a
Single
Microbe
Fragmentation of
DNA
Sequencing
…AGCTAAAATTACCGGG…
GENOMICS METAGENOMICS
The Metagenomic Challenges
• Assembly
•Taxonomic Assignment
• Metabolic Pathway Construction
• Gene Prediction
• Functional Annotation
• Comparative Analysis
Assembly
DNA Isolation
Flow of Presentation
 Introduction
 Beginning of Genomics
 Sequencing era
 Metagenomics
 The Conventional Approach for Data Analysis- Limitations
 Machine learning approaches and their implementations
 SVM
 HMM
 Naive Bayes
 Random Forest
 Work done
 Future directions
Conventional Methods Cannot be Used for
Metagenomic Data Analysis
Database : 4.7 million Sequences
Query
Seeds
•Homology Based Approach- BLAST
•Most widely used by researchers
•Dynamic Programming is used
Each sequence is fragmented into seeds and
searched against all sequences of the database
It will take about 17 years on a Xeon 2.6 Ghz PC to carry out the BLAST of
>3 million metagenomic genes from one project
BLAST of 1000 genes against
NR database ~ 1 Day (25.5 Hrs.) ~ 2 Days (47.1 Hrs.)
~10 GB ~13 GB ~17 GB~4 GB< 1 GB
2012
2014
2013
Future
????
NCBI
NR
Flow of Presentation
 Introduction
 Beginning of Genomics
 Sequencing era
 Metagenomics
 The Conventional Approach
 Machine learning approaches and their implementations
 SVM
 HMM
 Naive Bayes
 Random Forest
 Work done
 Future directions
Key idea: Learnfrom known data and Predicton unknown data
Machine Learning- Valuable Alternatives
Database : 4.7 million SequencesQuery
Searching One against All
Memorize the information
Processing all at Once
• From known examples or dataLearning
• Derives a hypothesis based on training examplesHypothesis
• Based on Hypothesis, predictions on unknown
query
Prediction
Machine
Learning
Techniques
Support
Vector
Machines
Random
Forest
Artificial
Neural
Networks
Clustering
Naïve
Bayesian
Hidden
Markov
Models
• Reliability / Accuracy
•Adaption to new Environment
• Sensitivity to Noise
• Ability to handle diverse data
• Speed
• Limitations
Machine Learning
•Supervised
•Unsupervised
Properties of Training Examples
• Training Dataset : Well curated and free from
noise.
• Features : Fixed length patterns
MKWMPFVGTMPLVQTKSITDLCAPLC
MMK
KW
WM
MP……………………………….......
M I W . . .
M 0.12 0.34 0.09 . . .
I 0.28 0.19 0.41 . . .
W . . 0.24 . . .
P - 0.17 - - - -
Support Vector Machines (SVM)
X2
X1
SVMs finds the maximal margin which separates two classes
Class 1
Class2
Support Vector Machines (SVM)
X1
X2
X3
( Ben-Hur, et al., 2008)
Linear
Kernel
Polynomial
Kernel d=2
Gaussian
Kernel,
sigma = 1
X2
X1
Class 1
Class2
Carbohydrate Metabolism Amino acid metabolism Nucleotide metabolism
Model 1 Model 2 Model 3
Query Sequence
Prediction
Value 1
Prediction
Value 2
Prediction
Value 3
Feature Extraction
Mapped into high-dimension
feature space
Classification Based on
Maximum Prediction Value
•Dipeptide Frequency
AD : 0.08
RH : 0.02
•Amino Acid Composition
A : 0.14
F : 0.05
GPCRpred: an SVM-based method for prediction of families and
subfamilies of G-protein coupled receptors
Bhasin M , and Raghava G P S Nucl. Acids Res. 2004;32:W383-W389
Protein Sequence
SVM
Is GPCR ?
SVM
SVMSVMSVMSVM
GPCR
Recognition
Family
Prediction
Sub-Family
Prediction
99.5%
Accuracy
G-protein coupled
receptors (GPCRs)
important targets for drug
design.
Dipeptide
frequency
used as a
feature
Hidden Markov Models (HMM)
• A powerful statistical tool widely used in modeling sequences
• Markov Chains:
AYTGGGTACC
AYT-GGTMCC
AYCGGG-MC-
Making Profiles
What is the
probability of
Rain today?
Carbohydrate Metabolism Amino acid metabolism Nucleotide metabolism
HMM Profile Database
QUERY
SEQUENCE
Prediction Based
on Best Profile
Match
The Pfam Protein Families Database
• A large collection of protein families, each represented by multiple
sequence alignments and hidden Markov models (HMMs)
• Identification of domains that occur within proteins can provide insights
into their function
Steps used for building of Pfam:
 Manually curated collection of protein families (3,071 families)
 Each curated family is represented by seed and full alignment
 Building HMM profiles using HMMER3.0
 Widely used for identification of protein structure and function
Marco Punta et. al., Nucleic Acids Res, 2011
Naive Bayes Classifier
1. Simple probabilistic classifier Based on
Bayes’ theorem
2. Goal is to determine the most probable
hypothesis
Prior probability of class
Likelihood of X given that class
X2
X1Class 1 Class 2
Kohenen J. et al. In Silico Biol 2009;9(1-2):23-34.
Algorithm: Word sizes between 6 and 9 bases
Word-specific priors: Pi = [n(wi) + 0.5]/(N +1)]
Genus-specific conditional probabilities: P(wi|G) = [m(wi) + Pi]/(M + 1)
Naive Bayesian assignment: P(G|S) = P(S|G) * P(G)/P(S)
Bootstrap confidence estimation: For each query sequence
Naive Bayesian Classifier for Rapid Assignment of
rRNA Sequences into New Bacterial Taxonomy
Qiong Wang et. al., Appl Environ Microbiol, 2007
AUGCGUCAGCUCGAUCGAUCUA
AUGCGUCA
UGCGUCAG
GCGUCAGC
CGUCAGCU
Classification and Regression Trees (CART)
X1> C1
1
No Yes
X2> C2
YesNo
2X1> C3
YesNo
2X2> C4
YesNo
1 2
X1
X2
C1 C3
C4
C2
1
2
Random Forest
• Collection of unpruned CARTs
• Bagging- avoids overfitting
• Improve prediction accuracy
• Encouraging diversity among the tree
X
Tree 1
Tree 2 Tree 3
Svetnik V. et al., J Chem Inf Comput Sci 2003 Nov-Dec;43(6):1947-58
Features of Random Forest
o Cross validation procedure is inbuilt in random forest, as each
tree in the forest has its own training (bootstrap) and test data
(OOB data)
o OOB error rate calculates the overall percentage of
misclassification
o Calculates the important features for the classification
29
MODEL
t1
ABC
AB C
A B
t2
ABC
AC
A C
B
t3
ABC
BCA
CB
Query Sequence
Classification Based on the
Majority of votes
Carbohydrate Metabolism Amino acid metabolism Nucleotide metabolism
Feature Extraction
•Dipeptide Frequency
AD : 0.08
RH : 0.02
•Amino Acid Composition
A : 0.14
F : 0.05
Prediction of protein-RNA binding site using
Random Forest
Zhi Ping Liu et. al. Bioinformatics,2010
• Protein-RNA interaction plays a key role in number of
biological processes
Dataset:
339 Protein-RNA complexes form RsiteDB
Entangle was used to define the interaction site between
protein chain and RNA
Features:
Interaction propensity, Hydrophobicity, Relative excessive
surface area, Secondary structure, Conservation score and
Side chain environment
Machine Learning methods are becoming
popular for Biological Data Analysis
0
100
200
300
400
500
600
700
1976 1993 2003 2013
Numberofpublications
Year
SVM
0
100
200
300
400
500
600
700
1976 1993 2003 2013
Numberofpublications
Year
HMM
0
100
200
300
400
500
600
700
2003 2008 2013
Numberofpublications
Year
Random Forest
http://www.ncbi.nlm.nih.gov/pubmed
Flow of Presentation
 Introduction
 Beginning of Genomics
 Sequencing era
 Metagenomics
 The Conventional Approach
 Machine learning approaches and their implementations
 SVM
 HMM
 Naive Bayes
 Random Forest
 Work done
 Future directions
Implementation of Machine Learning
for the Analysis of Metagenomic Data
in my Recent Projects
: A fast and accurate functional classifier
of genomic and metagenomic sequences
METHODOLOGY: eggNOG database was used
ORF1 ORF2
Sequencing, assembly and ORF predictionMetagenome
Routine task for metagenomic analysis
Class Group Annotation
O Cellular Processes and Signaling Serine-Type endopeptidase
J Information Storage and Processing tRNA synthetase
Functional
Class
Functional
Annotation
2.3 million sequences were
divided in to 22 Functional
Class
Dipeptide as input features
for optimization and
training of Random Forest
Final Random Forest model
was integrated with
RAPsearch2
Manuscript Submitted, 2014
Stand alone server
Query Sequence
Genomic Metagenomic
Random Forest
RAPsearch
Functional Class Prediction
Functional Annotation
: A Tool for Fast and Accurate
Taxonomic Classification of 16S rRNA
Hypervariable Regions in Metagenomic
Datasets
Metagenome
16SrRNA: Marker gene to identify microbial species
Sequencing of either HVR or Complete 16S
Taxonomic Classification
METHODOLOGY: Greengenes database was used
Sequences for hypervariable
regions were extracted and
grouped according to
taxonomic information
4-mer nucleotide
composition were used as
Input feature for training
and optimization of RF
Sequences discarded during
clustering and real
metagenomic 16S sequences
were used for the testing
Routine task for metagenomic analysis
Manuscript Submitted, 2014
Flow of Presentation
 Introduction
 Beginning of Genomics
 Sequencing era
 Metagenomics
 The Conventional Approach
 Machine learning approaches and their implementations
 SVM
 HMM
 Naive Bayes
 Random Forest
 Work done
 Future directions
Future Directions
• Analysis of metagenomic data generated from
the laboratory projects
• Implementation of machine learning in the
analyses of metagenomic data
• Metabolic pathway analysis and reconstruction
Acknowledgement
•Thesis Supervisor : Dr. Vineet Sharma
•Lab Members:
•Dr. Sanjiv Kumar
•Darshan Dhakan
•Ankit Gupta
•Rituja Saxena
•Parul Milttal
•Vishnu Prasoodanan
•Harish K
•Nikhil Chuadhary
•IISER Bhopal for providing the fellowship for doctoral
research
Machine Learning

Weitere ähnliche Inhalte

Was ist angesagt?

PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICSPROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICSLubna MRL
 
Advancing the Metagenomics Revolution
Advancing the Metagenomics RevolutionAdvancing the Metagenomics Revolution
Advancing the Metagenomics RevolutionLarry Smarr
 
The Emerging Global Collaboratory for Microbial Metagenomics Researchers
The Emerging Global Collaboratory for Microbial Metagenomics ResearchersThe Emerging Global Collaboratory for Microbial Metagenomics Researchers
The Emerging Global Collaboratory for Microbial Metagenomics ResearchersLarry Smarr
 
Viral Metagenomics (CABBIO 20150629 Buenos Aires)
Viral Metagenomics (CABBIO 20150629 Buenos Aires)Viral Metagenomics (CABBIO 20150629 Buenos Aires)
Viral Metagenomics (CABBIO 20150629 Buenos Aires)bedutilh
 
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
Cross-Kingdom Standards in Genomics, Epigenomics and MetagenomicsCross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics Christopher Mason
 
Metagenomics sequencing
Metagenomics sequencingMetagenomics sequencing
Metagenomics sequencingcdgenomics525
 
Intro to metagenomic binning
Intro to metagenomic binningIntro to metagenomic binning
Intro to metagenomic binningA. Murat Eren
 
Metagenomic analysis
Metagenomic analysisMetagenomic analysis
Metagenomic analysisAnimesh Kumar
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorialc.titus.brown
 
Creating a Cyberinfrastructure for Advanced Marine Microbial Ecology Research...
Creating a Cyberinfrastructure for Advanced Marine Microbial Ecology Research...Creating a Cyberinfrastructure for Advanced Marine Microbial Ecology Research...
Creating a Cyberinfrastructure for Advanced Marine Microbial Ecology Research...Larry Smarr
 
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...Mick Watson
 
The Emerging Global Community of Microbial Metagenomics Researchers
The Emerging Global Community of Microbial Metagenomics ResearchersThe Emerging Global Community of Microbial Metagenomics Researchers
The Emerging Global Community of Microbial Metagenomics ResearchersLarry Smarr
 
Bayesian Taxonomic Assignment for the Next-Generation Metagenomics
Bayesian Taxonomic Assignment for the Next-Generation MetagenomicsBayesian Taxonomic Assignment for the Next-Generation Metagenomics
Bayesian Taxonomic Assignment for the Next-Generation MetagenomicsJonathan Eisen
 
Whole genome sequencing of bacteria & analysis
Whole genome sequencing of bacteria & analysisWhole genome sequencing of bacteria & analysis
Whole genome sequencing of bacteria & analysisdrelamuruganvet
 
Metagenomics analysis
Metagenomics  analysisMetagenomics  analysis
Metagenomics analysisVijiMahesh1
 
Tom Delmont: From the Terragenome Project to Global Metagenomic Comparisons: ...
Tom Delmont: From the Terragenome Project to Global Metagenomic Comparisons: ...Tom Delmont: From the Terragenome Project to Global Metagenomic Comparisons: ...
Tom Delmont: From the Terragenome Project to Global Metagenomic Comparisons: ...GigaScience, BGI Hong Kong
 
Analysis of binning tool in metagenomics
Analysis of binning tool in metagenomicsAnalysis of binning tool in metagenomics
Analysis of binning tool in metagenomicsDr. sreeremya S
 

Was ist angesagt? (20)

PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICSPROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
 
Advancing the Metagenomics Revolution
Advancing the Metagenomics RevolutionAdvancing the Metagenomics Revolution
Advancing the Metagenomics Revolution
 
The Emerging Global Collaboratory for Microbial Metagenomics Researchers
The Emerging Global Collaboratory for Microbial Metagenomics ResearchersThe Emerging Global Collaboratory for Microbial Metagenomics Researchers
The Emerging Global Collaboratory for Microbial Metagenomics Researchers
 
Viral Metagenomics (CABBIO 20150629 Buenos Aires)
Viral Metagenomics (CABBIO 20150629 Buenos Aires)Viral Metagenomics (CABBIO 20150629 Buenos Aires)
Viral Metagenomics (CABBIO 20150629 Buenos Aires)
 
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
Cross-Kingdom Standards in Genomics, Epigenomics and MetagenomicsCross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
 
Metagenomics sequencing
Metagenomics sequencingMetagenomics sequencing
Metagenomics sequencing
 
Intro to metagenomic binning
Intro to metagenomic binningIntro to metagenomic binning
Intro to metagenomic binning
 
Metagenomic analysis
Metagenomic analysisMetagenomic analysis
Metagenomic analysis
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
Creating a Cyberinfrastructure for Advanced Marine Microbial Ecology Research...
Creating a Cyberinfrastructure for Advanced Marine Microbial Ecology Research...Creating a Cyberinfrastructure for Advanced Marine Microbial Ecology Research...
Creating a Cyberinfrastructure for Advanced Marine Microbial Ecology Research...
 
Metagenomics
MetagenomicsMetagenomics
Metagenomics
 
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...
 
The Emerging Global Community of Microbial Metagenomics Researchers
The Emerging Global Community of Microbial Metagenomics ResearchersThe Emerging Global Community of Microbial Metagenomics Researchers
The Emerging Global Community of Microbial Metagenomics Researchers
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
EU PathoNGenTraceConsortium:cgMLST Evolvement and Challenges for Harmonization
EU PathoNGenTraceConsortium:cgMLST Evolvement and Challenges for HarmonizationEU PathoNGenTraceConsortium:cgMLST Evolvement and Challenges for Harmonization
EU PathoNGenTraceConsortium:cgMLST Evolvement and Challenges for Harmonization
 
Bayesian Taxonomic Assignment for the Next-Generation Metagenomics
Bayesian Taxonomic Assignment for the Next-Generation MetagenomicsBayesian Taxonomic Assignment for the Next-Generation Metagenomics
Bayesian Taxonomic Assignment for the Next-Generation Metagenomics
 
Whole genome sequencing of bacteria & analysis
Whole genome sequencing of bacteria & analysisWhole genome sequencing of bacteria & analysis
Whole genome sequencing of bacteria & analysis
 
Metagenomics analysis
Metagenomics  analysisMetagenomics  analysis
Metagenomics analysis
 
Tom Delmont: From the Terragenome Project to Global Metagenomic Comparisons: ...
Tom Delmont: From the Terragenome Project to Global Metagenomic Comparisons: ...Tom Delmont: From the Terragenome Project to Global Metagenomic Comparisons: ...
Tom Delmont: From the Terragenome Project to Global Metagenomic Comparisons: ...
 
Analysis of binning tool in metagenomics
Analysis of binning tool in metagenomicsAnalysis of binning tool in metagenomics
Analysis of binning tool in metagenomics
 

Ähnlich wie Machine Learning

Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxxRowlet
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotesc.titus.brown
 
Bioinformatics (Exam point of view)
Bioinformatics (Exam point of view)Bioinformatics (Exam point of view)
Bioinformatics (Exam point of view)Sijo A
 
Introduction to Bioinformatics-1.pdf
Introduction to Bioinformatics-1.pdfIntroduction to Bioinformatics-1.pdf
Introduction to Bioinformatics-1.pdfkigaruantony
 
Experimental methods and the big data sets
Experimental methods and the big data sets Experimental methods and the big data sets
Experimental methods and the big data sets improvemed
 
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.jennomics
 
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biologyLarge scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biologyMaté Ongenaert
 
2014 09 30_t1_bioinformatics_wim_vancriekinge
2014 09 30_t1_bioinformatics_wim_vancriekinge2014 09 30_t1_bioinformatics_wim_vancriekinge
2014 09 30_t1_bioinformatics_wim_vancriekingeProf. Wim Van Criekinge
 
Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformaticsAtai Rabby
 
Bioinformatics t1-introduction wim-vancriekinge_v2013
Bioinformatics t1-introduction wim-vancriekinge_v2013Bioinformatics t1-introduction wim-vancriekinge_v2013
Bioinformatics t1-introduction wim-vancriekinge_v2013Prof. Wim Van Criekinge
 
Biotechnophysics: DNA Nanopore Sequencing
Biotechnophysics: DNA Nanopore SequencingBiotechnophysics: DNA Nanopore Sequencing
Biotechnophysics: DNA Nanopore SequencingMelanie Swan
 
bioinformatics simple
bioinformatics simple bioinformatics simple
bioinformatics simple nadeem akhter
 
bioinfomatics
bioinfomaticsbioinfomatics
bioinfomaticsnguyenpg
 
Bioinformatics issues and challanges presentation at s p college
Bioinformatics  issues and challanges  presentation at s p collegeBioinformatics  issues and challanges  presentation at s p college
Bioinformatics issues and challanges presentation at s p collegeSKUASTKashmir
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Spark Summit
 
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...DataScienceConferenc1
 

Ähnlich wie Machine Learning (20)

Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptx
 
Bioinformatics seminar
Bioinformatics seminarBioinformatics seminar
Bioinformatics seminar
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
 
Bioinformatics (Exam point of view)
Bioinformatics (Exam point of view)Bioinformatics (Exam point of view)
Bioinformatics (Exam point of view)
 
Introduction to Bioinformatics-1.pdf
Introduction to Bioinformatics-1.pdfIntroduction to Bioinformatics-1.pdf
Introduction to Bioinformatics-1.pdf
 
Experimental methods and the big data sets
Experimental methods and the big data sets Experimental methods and the big data sets
Experimental methods and the big data sets
 
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
 
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biologyLarge scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
 
2014 09 30_t1_bioinformatics_wim_vancriekinge
2014 09 30_t1_bioinformatics_wim_vancriekinge2014 09 30_t1_bioinformatics_wim_vancriekinge
2014 09 30_t1_bioinformatics_wim_vancriekinge
 
Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformatics
 
Cloud bioinformatics 2
Cloud bioinformatics 2Cloud bioinformatics 2
Cloud bioinformatics 2
 
Introduction to 16S Microbiome Analysis
Introduction to 16S Microbiome AnalysisIntroduction to 16S Microbiome Analysis
Introduction to 16S Microbiome Analysis
 
Bioinformatics t1-introduction wim-vancriekinge_v2013
Bioinformatics t1-introduction wim-vancriekinge_v2013Bioinformatics t1-introduction wim-vancriekinge_v2013
Bioinformatics t1-introduction wim-vancriekinge_v2013
 
Biotechnophysics: DNA Nanopore Sequencing
Biotechnophysics: DNA Nanopore SequencingBiotechnophysics: DNA Nanopore Sequencing
Biotechnophysics: DNA Nanopore Sequencing
 
bioinformatics simple
bioinformatics simple bioinformatics simple
bioinformatics simple
 
bioinfomatics
bioinfomaticsbioinfomatics
bioinfomatics
 
Bioinformatics issues and challanges presentation at s p college
Bioinformatics  issues and challanges  presentation at s p collegeBioinformatics  issues and challanges  presentation at s p college
Bioinformatics issues and challanges presentation at s p college
 
Variant analysis and whole exome sequencing
Variant analysis and whole exome sequencingVariant analysis and whole exome sequencing
Variant analysis and whole exome sequencing
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
 
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
 

Kürzlich hochgeladen

Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 

Kürzlich hochgeladen (20)

Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 

Machine Learning

  • 1. Computational Analysis of High-throughput Biological Data Using Machine Learning Approaches Ashok K Sharma 1220104 MetaInformatics Laboratory IISER Bhopal
  • 2. Topics Covered in the Talk  Introduction  Beginning of Genomics  Current Sequencing Scenario  Metagenomic Approaches  The Conventional Approach for Data Analysis- Limitations  Machine Learning Approaches and their Implementations  SVM  HMM  Naive Bayes  Random Forest  Discuss the Work Done So Far  Future Directions
  • 3. Beginning of Genomics First DNA isolated by Swiss physician Friedrich Miescher in 1869 The term genome was used by German botanist Hans Winker in 1920 The history of modern genomics began in 1970s Nucleotide sequence of bacteriophage lambda DNA F. Sanger et. al., J Mol Biol, 1982 Whole-genome random sequencing and assembly of Haemophilus influenzae Rd Fleischmann RD et. al., Science, 1995 Sequencing and analysis of the human genome ES Lander et. al., Nature, 2001 ~48 kb ~1,800 kb 3 billion bp 3 billion USD 10 Years
  • 4. Sequencer Read length ~ Cost /Mb ~ Data/run Roche 454 400-800bp 20$ 450Mb Ion Torrent 200bp 2$ 10Mb-1Gb Illumina 150bp 0.50$ 600Gb PacBio SMRT ~20kb 1.4$ 350Mb Ion TorrentRoche 454 sequencer Illumina/Solexa Sequencer Next Generation Sequencing Technologies Leading to The Sequencing Era
  • 5. Metagenomics: New Approach to Sequence the Unknown •The first idea of cloning DNA directly from environmental samples was proposed by Pace in 1985 •The term “metagenome” was coined by Handelsman in 1998 The First Large Scale Metagenomics Project: Environmental Genome Shotgun Sequencing of the Sargasso Sea C. Ventor et. al., Science, 2004 The First Large Scale Organismal Study: Model Study Comparing the Gut Flora of 124 European individuals Qin et. al., Nature, 2009 1.6 GB and 1.2 million genes 576.7 GB and 3.3 million genes 98% bacteria cannot be cultured and hence cannot be sequenced
  • 6. Genomics and Metagenomics Have Exponentially Increased the Sequence databases Published papers on “Metagenomics” in PubMed $1000 $100M Cost per Human Genome 180 140 100 60 20 1984 1988 1992 1996 2000 2004 2008 2012 Sequences(inmillions) Growth of GenBank (1984-2013) • Metagenomic: 538 • Non-Metagenomic: 18787 https://gold.jgi-psf.org/ • Running projects:
  • 7. What How Who Species Diversity What Metagenomics can Answer ? Arcobacter Paludibacter Shewanella Pseudomonas Unknown Species Richness
  • 10. Genomics vs Metagenomics …GGATCCATCGTACCGATTC.. …TTACAATTTA… …CCATGGCCGAAATTTCGTA… …AGCTAAAATTACCGGGGAT… Community of Microbial Species- Mainly Unculturable Fragmentation of DNA Sequencing Analysis Culture a Single Microbe Fragmentation of DNA Sequencing …AGCTAAAATTACCGGG… GENOMICS METAGENOMICS The Metagenomic Challenges • Assembly •Taxonomic Assignment • Metabolic Pathway Construction • Gene Prediction • Functional Annotation • Comparative Analysis Assembly DNA Isolation
  • 11. Flow of Presentation  Introduction  Beginning of Genomics  Sequencing era  Metagenomics  The Conventional Approach for Data Analysis- Limitations  Machine learning approaches and their implementations  SVM  HMM  Naive Bayes  Random Forest  Work done  Future directions
  • 12. Conventional Methods Cannot be Used for Metagenomic Data Analysis Database : 4.7 million Sequences Query Seeds •Homology Based Approach- BLAST •Most widely used by researchers •Dynamic Programming is used Each sequence is fragmented into seeds and searched against all sequences of the database It will take about 17 years on a Xeon 2.6 Ghz PC to carry out the BLAST of >3 million metagenomic genes from one project BLAST of 1000 genes against NR database ~ 1 Day (25.5 Hrs.) ~ 2 Days (47.1 Hrs.) ~10 GB ~13 GB ~17 GB~4 GB< 1 GB 2012 2014 2013 Future ???? NCBI NR
  • 13. Flow of Presentation  Introduction  Beginning of Genomics  Sequencing era  Metagenomics  The Conventional Approach  Machine learning approaches and their implementations  SVM  HMM  Naive Bayes  Random Forest  Work done  Future directions
  • 14. Key idea: Learnfrom known data and Predicton unknown data Machine Learning- Valuable Alternatives Database : 4.7 million SequencesQuery Searching One against All Memorize the information Processing all at Once • From known examples or dataLearning • Derives a hypothesis based on training examplesHypothesis • Based on Hypothesis, predictions on unknown query Prediction
  • 15. Machine Learning Techniques Support Vector Machines Random Forest Artificial Neural Networks Clustering Naïve Bayesian Hidden Markov Models • Reliability / Accuracy •Adaption to new Environment • Sensitivity to Noise • Ability to handle diverse data • Speed • Limitations Machine Learning •Supervised •Unsupervised
  • 16. Properties of Training Examples • Training Dataset : Well curated and free from noise. • Features : Fixed length patterns MKWMPFVGTMPLVQTKSITDLCAPLC MMK KW WM MP………………………………....... M I W . . . M 0.12 0.34 0.09 . . . I 0.28 0.19 0.41 . . . W . . 0.24 . . . P - 0.17 - - - -
  • 17. Support Vector Machines (SVM) X2 X1 SVMs finds the maximal margin which separates two classes Class 1 Class2
  • 18. Support Vector Machines (SVM) X1 X2 X3 ( Ben-Hur, et al., 2008) Linear Kernel Polynomial Kernel d=2 Gaussian Kernel, sigma = 1 X2 X1 Class 1 Class2
  • 19.
  • 20. Carbohydrate Metabolism Amino acid metabolism Nucleotide metabolism Model 1 Model 2 Model 3 Query Sequence Prediction Value 1 Prediction Value 2 Prediction Value 3 Feature Extraction Mapped into high-dimension feature space Classification Based on Maximum Prediction Value •Dipeptide Frequency AD : 0.08 RH : 0.02 •Amino Acid Composition A : 0.14 F : 0.05
  • 21. GPCRpred: an SVM-based method for prediction of families and subfamilies of G-protein coupled receptors Bhasin M , and Raghava G P S Nucl. Acids Res. 2004;32:W383-W389 Protein Sequence SVM Is GPCR ? SVM SVMSVMSVMSVM GPCR Recognition Family Prediction Sub-Family Prediction 99.5% Accuracy G-protein coupled receptors (GPCRs) important targets for drug design. Dipeptide frequency used as a feature
  • 22. Hidden Markov Models (HMM) • A powerful statistical tool widely used in modeling sequences • Markov Chains: AYTGGGTACC AYT-GGTMCC AYCGGG-MC- Making Profiles What is the probability of Rain today?
  • 23. Carbohydrate Metabolism Amino acid metabolism Nucleotide metabolism HMM Profile Database QUERY SEQUENCE Prediction Based on Best Profile Match
  • 24. The Pfam Protein Families Database • A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs) • Identification of domains that occur within proteins can provide insights into their function Steps used for building of Pfam:  Manually curated collection of protein families (3,071 families)  Each curated family is represented by seed and full alignment  Building HMM profiles using HMMER3.0  Widely used for identification of protein structure and function Marco Punta et. al., Nucleic Acids Res, 2011
  • 25. Naive Bayes Classifier 1. Simple probabilistic classifier Based on Bayes’ theorem 2. Goal is to determine the most probable hypothesis Prior probability of class Likelihood of X given that class X2 X1Class 1 Class 2 Kohenen J. et al. In Silico Biol 2009;9(1-2):23-34.
  • 26. Algorithm: Word sizes between 6 and 9 bases Word-specific priors: Pi = [n(wi) + 0.5]/(N +1)] Genus-specific conditional probabilities: P(wi|G) = [m(wi) + Pi]/(M + 1) Naive Bayesian assignment: P(G|S) = P(S|G) * P(G)/P(S) Bootstrap confidence estimation: For each query sequence Naive Bayesian Classifier for Rapid Assignment of rRNA Sequences into New Bacterial Taxonomy Qiong Wang et. al., Appl Environ Microbiol, 2007 AUGCGUCAGCUCGAUCGAUCUA AUGCGUCA UGCGUCAG GCGUCAGC CGUCAGCU
  • 27. Classification and Regression Trees (CART) X1> C1 1 No Yes X2> C2 YesNo 2X1> C3 YesNo 2X2> C4 YesNo 1 2 X1 X2 C1 C3 C4 C2 1 2
  • 28. Random Forest • Collection of unpruned CARTs • Bagging- avoids overfitting • Improve prediction accuracy • Encouraging diversity among the tree X Tree 1 Tree 2 Tree 3 Svetnik V. et al., J Chem Inf Comput Sci 2003 Nov-Dec;43(6):1947-58
  • 29. Features of Random Forest o Cross validation procedure is inbuilt in random forest, as each tree in the forest has its own training (bootstrap) and test data (OOB data) o OOB error rate calculates the overall percentage of misclassification o Calculates the important features for the classification 29
  • 30. MODEL t1 ABC AB C A B t2 ABC AC A C B t3 ABC BCA CB Query Sequence Classification Based on the Majority of votes Carbohydrate Metabolism Amino acid metabolism Nucleotide metabolism Feature Extraction •Dipeptide Frequency AD : 0.08 RH : 0.02 •Amino Acid Composition A : 0.14 F : 0.05
  • 31. Prediction of protein-RNA binding site using Random Forest Zhi Ping Liu et. al. Bioinformatics,2010 • Protein-RNA interaction plays a key role in number of biological processes Dataset: 339 Protein-RNA complexes form RsiteDB Entangle was used to define the interaction site between protein chain and RNA Features: Interaction propensity, Hydrophobicity, Relative excessive surface area, Secondary structure, Conservation score and Side chain environment
  • 32. Machine Learning methods are becoming popular for Biological Data Analysis 0 100 200 300 400 500 600 700 1976 1993 2003 2013 Numberofpublications Year SVM 0 100 200 300 400 500 600 700 1976 1993 2003 2013 Numberofpublications Year HMM 0 100 200 300 400 500 600 700 2003 2008 2013 Numberofpublications Year Random Forest http://www.ncbi.nlm.nih.gov/pubmed
  • 33. Flow of Presentation  Introduction  Beginning of Genomics  Sequencing era  Metagenomics  The Conventional Approach  Machine learning approaches and their implementations  SVM  HMM  Naive Bayes  Random Forest  Work done  Future directions
  • 34. Implementation of Machine Learning for the Analysis of Metagenomic Data in my Recent Projects : A fast and accurate functional classifier of genomic and metagenomic sequences
  • 35. METHODOLOGY: eggNOG database was used ORF1 ORF2 Sequencing, assembly and ORF predictionMetagenome Routine task for metagenomic analysis Class Group Annotation O Cellular Processes and Signaling Serine-Type endopeptidase J Information Storage and Processing tRNA synthetase Functional Class Functional Annotation 2.3 million sequences were divided in to 22 Functional Class Dipeptide as input features for optimization and training of Random Forest Final Random Forest model was integrated with RAPsearch2 Manuscript Submitted, 2014
  • 36. Stand alone server Query Sequence Genomic Metagenomic Random Forest RAPsearch Functional Class Prediction Functional Annotation
  • 37. : A Tool for Fast and Accurate Taxonomic Classification of 16S rRNA Hypervariable Regions in Metagenomic Datasets
  • 38. Metagenome 16SrRNA: Marker gene to identify microbial species Sequencing of either HVR or Complete 16S Taxonomic Classification METHODOLOGY: Greengenes database was used Sequences for hypervariable regions were extracted and grouped according to taxonomic information 4-mer nucleotide composition were used as Input feature for training and optimization of RF Sequences discarded during clustering and real metagenomic 16S sequences were used for the testing Routine task for metagenomic analysis Manuscript Submitted, 2014
  • 39.
  • 40. Flow of Presentation  Introduction  Beginning of Genomics  Sequencing era  Metagenomics  The Conventional Approach  Machine learning approaches and their implementations  SVM  HMM  Naive Bayes  Random Forest  Work done  Future directions
  • 41. Future Directions • Analysis of metagenomic data generated from the laboratory projects • Implementation of machine learning in the analyses of metagenomic data • Metabolic pathway analysis and reconstruction
  • 42. Acknowledgement •Thesis Supervisor : Dr. Vineet Sharma •Lab Members: •Dr. Sanjiv Kumar •Darshan Dhakan •Ankit Gupta •Rituja Saxena •Parul Milttal •Vishnu Prasoodanan •Harish K •Nikhil Chuadhary •IISER Bhopal for providing the fellowship for doctoral research

Hinweis der Redaktion

  1. History of genomics started when first time dna was isolated by … after the few year later the term genome was given by …. As you all know genome refers to the organisms complete set of the dna which contains the organisms hereditary information. History of modern genomics began in 1970s when first time sanger time report his methode for determining the order of nucleotides of DNA using chain terminating nucleotide analogues. In 1982 First bacteriophage genome size of around 48 kb was determined using shotgun digest methode before coming the first automated dna sequencer. From the sequence reading frames for 46 genes were clearly identified. There was the long wait after bacteriophage genome sequence and than first free living bacteria was sequenced in 1995, complete set was around 1800 kb. It took more than 100 years from the time when first time dna was isolated. The sequencing of h.influenge gave new directions to the genome sequencing and at the same time several large scale genome sequence project for higher eukaryotes were started and cpmpleted. Only after 6 years, sequencing of human genome was completed in 2001, that was the largest scientific effort in the history of mankind. Sequencing of human genome contains of 2.91 billion of base pairs. Appearance of the genes in the human genome was around 30,000 to 40,000. Human genome was 25 times as large as any previously sequenced genome and eight times as large as the sum of all such genomes. The project cost was around 3 million usd dollar and it took 10 year of the time for completion. Sanger sequencing was used for sequencing of all genome from lower to higher eukaryotes that was the major bottleneck in the genome sequencing analysis. Because of the time and cost is very high. After the successfull completion of human genome project several new approaches reached in to the market which started the new era of the sequencing. * bacterial virus, or bacteriophage that infects the bacterial species Escherichia coli (E. coli). *
  2. NGS brings spike in the genomic analysis via fasten the analysis and reduced the cost. Still majority of the microbes on the earth are still unknown because most of the microbes can not be cultured.
  3. Only a very small fraction of the microbes found in nature have been grown in pure culture, so we lack a comprehensive view of genetic diversity present on the earth surface. An approach to this problem has emerged called metgenomics or environmental genomics. This study shed new light on the diversity of life on Earth. A total of more than 1.6 Gb of sequence from Sargasso Sea samples yielded 1.2 million previously unknown gene sequences. Before analysis of the Sargasso Sea data the NCBI non-redundant amino acid (nr) dataset contained some 1.5 million peptides, about 630, 000 of which were classified as bacterial.
  4. Size of the genomic and metagenomic databases increases rapidly. This large amount of the data required efficient tool for fast and accurate analysis. In case of both genomic and metagenomic analysis the common steps after sequencing is Assembly Annotation And comparision 3. I will not talk much about genomic analysis here. I will maily talk about metagenomic analysis of high throutput data.
  5. Any metagenomic analysis mainly focuses on two points first is who is out there, in this type of analysis the prime focus is to find the organisms present in the specific environment, what is the proporpoin of the individual species present. Who is dominating in that environment and trying to correlate that is dominance of particular type of organisms makes this environment unique and differentiate from others. Second point is mainly focus on the functional part. Like genes present in that environment, what kind of function they are doing and in which pathway they are present, using the abundance of functions and pathaways of the particular environment here trying to find the number of unique function and unique pathways present in the environment of our choice. To anwer these two questions, after sequencing analysis followed sequential steps. Will discuss in my next slide.  Together to obtain a new understanding of the numbers and abundance of microbial community how these parameters change in response to external stimuli. 
  6. Any metagenomic analysis mainly focuses on two points first is who is out there, in this type of analysis the prime focus is to find the organisms present in the specific environment, what is the proporpoin of the individual species present. Who is dominating in that environment and trying to correlate that is dominance of particular type of organisms makes this environment unique and differentiate from others. Second point is mainly focus on the functional part. Like genes present in that environment, what kind of function they are doing and in which pathway they are present, using the abundance of functions and pathaways of the particular environment here trying to find the number of unique function and unique pathways present in the environment of our choice. To anwer these two questions, after sequencing analysis followed sequential steps. Will discuss in my next slide.  Together to obtain a new understanding of the numbers and abundance of microbial community how these parameters change in response to external stimuli. 
  7. Any metagenomic analysis mainly focuses on two points first is who is out there, in this type of analysis the prime focus is to find the organisms present in the specific environment, what is the proporpoin of the individual species present. Who is dominating in that environment and trying to correlate that is dominance of particular type of organisms makes this environment unique and differentiate from others. Second point is mainly focus on the functional part. Like genes present in that environment, what kind of function they are doing and in which pathway they are present, using the abundance of functions and pathaways of the particular environment here trying to find the number of unique function and unique pathways present in the environment of our choice. To anwer these two questions, after sequencing analysis followed sequential steps. Will discuss in my next slide.  Together to obtain a new understanding of the numbers and abundance of microbial community how these parameters change in response to external stimuli. 
  8.  In the study of complex communities, it is often necessary to address the question of how much sequence is enough to understand a community and to carry out comparative analyses of related communities. In many cases, this information can be obtained by applying various methods based on 16S rRNA sequence that can reveal a tremendous amount of information about microbial diversity and abundance. Metagenomics projects differ from traditional microbial-sequencing projects in many respects.
  9. SVMs finds the maximal margin which separates two classes and then outputs the hyperplane separator at the center of the margin. Data is not linearly separable Project all the points in to the higher dimensions using mapping Hoping that severability will improve .. This mapping is called kernel function.
  10. SVMs finds the maximal margin which separates two classes and then outputs the hyperplane separator at the center of the margin. Data is not linearly separable Project all the points in to the higher dimensions using mapping Hoping that severability will improve .. This mapping is called kernel function.
  11. Accurately identifying genes from metagenomic fragments is one of the most fundamental issues. I am discussing here novel gene prediction method MetaGUN for metagenomic fragments based on a machine learning approach of SVM.
  12. π(F) represents vector of initial probabilities
  13. The quality of the seed alignment is the crucial factor in determining the quality of the Pfam resource, influencing not only all data generated within the database but also the outcome of external searches that use our profile HMMs.
  14. With strong independence assumptions between predictors. Bayesin decision theory came long before, it was studied in the filed of statistical theory and more specifically in the field of pattern recognition. probable hypothesis: for given data d + initial knowledge about the prior probablities. Prior pbobalities reflects background knowledge. P(c|x) is the posterior probability of class (target) given predictor (attribute).  P(c) is the prior probability of class.  P(x|c) is the likelihood which is the probability of predictor given class.  P(x) is the prior probability of predictor.
  15. The Ribosomal Database Project (RDP) Classifier, a naive Bayesian classifier, can rapidly and accurately classify bacterial 16S rRNA sequences in to the higher taxonomy. 2. It provides taxonomic assignments from domain to genus, with confidence estimates for each assignment 3. Type sequences with Bergey’s taxonomy (average seq length 1,460 bases and had a range of 1,200 to 1,833 bases) 4. Complete rRNA database sequences with NCBI’s taxonomy near-full-length (1,200 bases) 16S rRNA sequences were obtained and taxonomic information for these databases obtained from genebank. 5. let n(wi) be the number of sequences containing subsequence wi 6. The conditional probability that a member of G contains wi was estimated with the equation.. 7. the probability that an unknown query sequence, S, is a member of genus G is....... where P(G) is the prior probability of a sequence being a member of G .. P(S) the overall probability of observing sequence S from any genus. 8. Overall classification accuracy by query size
  16. In the standard classification situation we have observations from two or more known classes and want to develop a rule for assigning current and new observations in to classes using numerical and/or predictor variables. Classification trees build these rules by recursive binary partitioning in to the regions that are increasingly homogenous with the respect to the class variable. These homogenous regions are called nodes. At each step in the fitting of the classification trees a optimization is carried out to select a node, particular variable cut off or group of codes. That results in homogenous subgroup of the data. Root node: entry point for the collection of the data. Inner node: a que is asked about the data and one child node for per possible answer. Leaf node correspond to the decision to take if reached.
  17. RF uses ensemble of decision trees based on the samples, their class designation and variables. Since results from ensemble models are much more satisfactory when compared to the single model. What happens is that basically the tree is created according to the implemented algorithm and if pruning is enabled, an additional step looks at what nodes/branches can be removed without affecting the performance too much. 2. Two kind of ensembles. Bagging and boosting. in the bagging we dont look back to the earlier tree while in boosting consider the earlier trees and strive to compensate their weakness (leads to overfitting of the data). RF is an example of bagging method. 3. Most popular machine learning now a days bc . 1. its versatile classification algorithm make it suitable for analysis of large data sets. 2. higher prediction accuracy and provide information on variable of importance. 3. they r very effective, fast and easy to use. Algorithm 1. Boot strapping is used to grow classification trees in the forest. If you have n number of samples than number N cases at random 2/3rd (but with replacement) from the original data used for training and rest 1/3rd prediction called OOB. Error rate is called out of bag error rate. 2. if there is m number of the input variables than a number m << M specified such that at each node variables selected randomly and evaluated for their ability to spilt the data. Variable resulting largest decrease in the impurity is chosen to separate the sample data At the each parent node.. Here impurity measure is Gini impurity.. Decrease in the gini impurity related to the increase in the amount of order in the sample classes, introduce by a split. 3. Random selection of the variables for splitting ensure low correlation bw the trees and prevent over fitting of RFM Last point: every classification tree in the forest cast for unweighted weight for the sample and finally majority of the votes determine the class of the sample. Single tree in the forest is the weak classifier b/c it trains on the subset of the data.. Thats why contribution of all the trees in the forest is a strong classifier. 4. Training process is completed when forest is fully grown, Trained model can be used to predict the classes of the unknown samples
  18. The expected error rate of the classification of a new sample by a classifier is estimated by cross validation procedure. Such as leave one out or k-fold cross validation . Aggregate OOB error rate from all trees. In addition to the internal cross validation RF also calculates variable importance In the forest model values of the predictor variables is randomly shuffled to break the association between the response and the predictor values. As the sum of the gini impurity decreases at every node in the forest for which that variable is used for the splitting. To calculate the permutation variable importance. Prediction accuracy after permutation is substracted from permutation accuracy before permutation. And averaged overall trees in the forest to give the permutation importance value. If the predictor never had any meaningful association with the response. Suffeling its value will produce very little or no change in model accuracy. on the other hand if predictor was strongly correlated with response, permutation should create large change or drop in accuracy. These two measures help to find variable highly related with response and to find small number of variables for good prediction.
  19. An example of predicting RNA binding sites. (a) Actual interface residues with RNA in protein 1R3E:A. (b) Predictions are mapped onto the original structure where different prediction catalogs are represented by different colors. (c) Structure of the protein–RNA complex with an example of prediction in the zoomed part. (d) Mutual interaction propensity between the triplets and nucleotides in the protein. Triplets are listed by sliding residues through the protein sequence. The box part corresponds to the values of residues in the zoomed part of (c). (e) Upper panel shows the interface propensity of each amino acid type in the dataset. It is defined as the proportion of an amino acid in interaction sites divided by the proportion of the residue in the dataset (see more in Supplementary Materials). Lower panel shows the interface propensity of binding with RNA for the residues in the protein. The box part corresponds to the values of the zoomed sites.
  20. PubMed is a free search engine accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. 590 “naive bayes”