SlideShare ist ein Scribd-Unternehmen logo
1 von 61
Downloaden Sie, um offline zu lesen
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

Pattern Recognition In Clinical Data
Saket Choudhary
Dual Degree Project
Guide: Prof. Santosh Noronha

C G C A T C G A G C T
C G C G T C G A G C T

October 30, 2013

C ONCLUSIONS
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

I NTRODUCTION
I NTRODUCTION
Objective
S IGNIFICANT M UTATIONS
Next Generation Sequencing
Motivation
Computational Methods for Driver Detection
V IRAL G ENOME D ETECION
Workflow
R EPRODUCIBILITY
Reproducibility
C ONCLUSIONS
Wrapping up

C ONCLUSIONS
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

O BJECTIVE
BWA
v/s
BWAPSSM

Galaxy
Workflow

Viral
Genome
Integration

Benchmarking
Alignment
tools

Tools
for
Biologists
Reproducible
analysis
Driver
& Passenger
Mutation
Detection

Galaxy
Tools

Better/New
Algorithms

Next Generation
Sequencing
&
Cancer Research

Driver
& Passenger
Mutation
Detection

Lit.
Survey

Ensembl
Method
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

O BJECTIVE
BWA
v/s
BWAPSSM

Galaxy
Workflow

Viral
Genome
Integration

Benchmarking
Alignment
tools

Tools
for
Biologists
Reproducible
analysis
Driver
& Passenger
Mutation
Detection

Galaxy
Tools

Better/New
Algorithms

Next Generation
Sequencing
&
Cancer Research

Driver
& Passenger
Mutation
Detection

Lit.
Survey

Ensembl
Method
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

O BJECTIVE
BWA
v/s
BWAPSSM

Galaxy
Workflow

Viral
Genome
Integration

Benchmarking
Alignment
tools

Tools
for
Biologists
Reproducible
analysis
Driver
& Passenger
Mutation
Detection

Galaxy
Tools

Better/New
Algorithms

Next Generation
Sequencing
&
Cancer Research

Driver
& Passenger
Mutation
Detection

Lit.
Survey

Ensembl
Method
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

O BJECTIVE
BWA
v/s
BWAPSSM

Galaxy
Workflow

Viral
Genome
Integration

Benchmarking
Alignment
tools

Tools
for
Biologists
Reproducible
analysis
Driver
& Passenger
Mutation
Detection

Galaxy
Tools

Better/New
Algorithms

Next Generation
Sequencing
&
Cancer Research

Driver
& Passenger
Mutation
Detection

Lit.
Survey

Ensembl
Method
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

O BJECTIVE
BWA
v/s
BWAPSSM

Galaxy
Workflow

Viral
Genome
Integration

Benchmarking
Alignment
tools

Tools
for
Biologists
Reproducible
analysis
Driver
& Passenger
Mutation
Detection

Galaxy
Tools

Better/New
Algorithms

Next Generation
Sequencing
&
Cancer Research

Driver
& Passenger
Mutation
Detection

Lit.
Survey

Ensembl
Method
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

O BJECTIVE
BWA
v/s
BWAPSSM

Galaxy
Workflow

Viral
Genome
Integration

Benchmarking
Alignment
tools

Tools
for
Biologists
Reproducible
analysis
Driver
& Passenger
Mutation
Detection

Galaxy
Tools

Better/New
Algorithms

Next Generation
Sequencing
&
Cancer Research

Driver
& Passenger
Mutation
Detection

Lit.
Survey

Ensembl
Method
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

O BJECTIVE
BWA
v/s
BWAPSSM

Galaxy
Workflow

Viral
Genome
Integration

Benchmarking
Alignment
tools

Tools
for
Biologists
Reproducible
analysis
Driver
& Passenger
Mutation
Detection

Galaxy
Tools

Better/New
Algorithms

Next Generation
Sequencing
&
Cancer Research

Driver
& Passenger
Mutation
Detection

Lit.
Survey

Ensembl
Method
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

O BJECTIVE
BWA
v/s
BWAPSSM

Galaxy
Workflow

Viral
Genome
Integration

Benchmarking
Alignment
tools

Tools
for
Biologists
Reproducible
analysis
Driver
& Passenger
Mutation
Detection

Galaxy
Tools

Better/New
Algorithms

Next Generation
Sequencing
&
Cancer Research

Driver
& Passenger
Mutation
Detection

Lit.
Survey

Ensembl
Method
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

O BJECTIVE
BWA
v/s
BWAPSSM

Galaxy
Workflow

Viral
Genome
Integration

Benchmarking
Alignment
tools

Tools
for
Biologists
Reproducible
analysis
Driver
& Passenger
Mutation
Detection

Galaxy
Tools

Better/New
Algorithms

Next Generation
Sequencing
&
Cancer Research

Driver
& Passenger
Mutation
Detection

Lit.
Survey

Ensembl
Method
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

O BJECTIVE
BWA
v/s
BWAPSSM

Galaxy
Workflow

Viral
Genome
Integration

Benchmarking
Alignment
tools

Tools
for
Biologists
Reproducible
analysis
Driver
& Passenger
Mutation
Detection

Galaxy
Tools

Better/New
Algorithms

Next Generation
Sequencing
&
Cancer Research

Driver
& Passenger
Mutation
Detection

Lit.
Survey

Ensembl
Method
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

O BJECTIVE
BWA
v/s
BWAPSSM

Galaxy
Workflow

Viral
Genome
Integration

Benchmarking
Alignment
tools

Tools
for
Biologists
Reproducible
analysis
Driver
& Passenger
Mutation
Detection

Galaxy
Tools

Better/New
Algorithms

Next Generation
Sequencing
&
Cancer Research

Driver
& Passenger
Mutation
Detection

Lit.
Survey

Ensembl
Method
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

O BJECTIVE
BWA
v/s
BWAPSSM

Galaxy
Workflow

Viral
Genome
Integration

Benchmarking
Alignment
tools

Tools
for
Biologists
Reproducible
analysis
Driver
& Passenger
Mutation
Detection

Galaxy
Tools

Better/New
Algorithms

Next Generation
Sequencing
&
Cancer Research

Driver
& Passenger
Mutation
Detection

Lit.
Survey

Ensembl
Method
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

O BJECTIVE
BWA
v/s
BWAPSSM

Galaxy
Workflow

Viral
Genome
Integration

Benchmarking
Alignment
tools

Tools
for
Biologists
Reproducible
analysis
Driver
& Passenger
Mutation
Detection

Galaxy
Tools

Better/New
Algorithms

Next Generation
Sequencing
&
Cancer Research

Driver
& Passenger
Mutation
Detection

Lit.
Survey

Ensembl
Method
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

N EXT G ENERATION S EQUENCING

C G C G T C G A G C T A G C A

C ONCLUSIONS
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

N EXT G ENERATION S EQUENCING

C G C G T C G A G C T A G C A

C ONCLUSIONS
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

N EXT G ENERATION S EQUENCING

C G C G T C G A G C T A G C A

G C G T∗

C G A G∗

C T A G∗

C ONCLUSIONS
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

N EXT G ENERATION S EQUENCING
C G C G T C G A G C T A G C A

G C G T∗

C G A G∗

C T A G∗

A G C G C G T C G A G C T A G C A C A

C ONCLUSIONS
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

NGS: W HY ?

Molecular Approach: Study of variations at the ’base’
level
Low Cost: 1000$ genome
Faster: Quicker than traditional sequencing techniques

C ONCLUSIONS
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

NGS: W HY ?

Molecular Approach: Study of variations at the ’base’
level
Low Cost: 1000$ genome
Faster: Quicker than traditional sequencing techniques

C ONCLUSIONS
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

NGS: W HY ?

Molecular Approach: Study of variations at the ’base’
level
Low Cost: 1000$ genome
Faster: Quicker than traditional sequencing techniques

C ONCLUSIONS
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

NGS: W HERE ?

Study variations, genotype-phenotype association
Look for ’markers of diseases’
Prognosis

C ONCLUSIONS
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

NGS: W HERE ?

Study variations, genotype-phenotype association
Look for ’markers of diseases’
Prognosis

C ONCLUSIONS
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

NGS: W HERE ?

Study variations, genotype-phenotype association
Look for ’markers of diseases’
Prognosis

C ONCLUSIONS
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

NGS: M UTATIONS

3x109 base pairs
We are all 99.9% similar, at DNA level
More than 2 million SNPs
No particular pattern of SNPs
If a certain mutation causes a change in an amino acid, it is
referred to as non synonymous(nsSNV)
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

D RIVERS AND PASSENGERS I
Cancer is known to arise due to mutations
Not all mutations are equally important!

Somatic Mutations
Set of mutations acquired after zygote formation, over and
above the germline mutations

Driver Mutations
Mutations that confer growth advantages to the cell, being
selected positively in the tumor tissue

C ONCLUSIONS
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

D RIVERS AND PASSENGERS

Drivers are NOT simply loss of function mutations, but more
than that:
Loss of function: Inactivate tumor suppressor proteins
Gain of function: Activates normal genes transforming
them to oncogenes
Drug Resistance Mutations: Mutations that have evolved
to overcome the inhibitory effect of drugs
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

D RIVERS AND PASSENGERS

Drivers are NOT simply loss of function mutations, but more
than that:
Loss of function: Inactivate tumor suppressor proteins
Gain of function: Activates normal genes transforming
them to oncogenes
Drug Resistance Mutations: Mutations that have evolved
to overcome the inhibitory effect of drugs
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

D RIVERS AND PASSENGERS

Drivers are NOT simply loss of function mutations, but more
than that:
Loss of function: Inactivate tumor suppressor proteins
Gain of function: Activates normal genes transforming
them to oncogenes
Drug Resistance Mutations: Mutations that have evolved
to overcome the inhibitory effect of drugs
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

D RIVER M UTATIONS : W HY ?
Identify driver mutations −→ better therapeutic targets
But how does one zero down upon the exact set? −→
experiments are too costly, probably infeasible for 2 million+
SNPs −→ Leverage computational analysis
Low cost of NGS comes with a heavier roadblock of data
analysis
Searching among 2 million+ SNPs is a non-trivial, and a
computationally intensive problem
Softwares have a low consensus ratio amongst them selves
←→ Defining a driver, computationally is non-trivial
However there is no tool that allows one to visualise the
results on an input across the cohort of tools
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

D RIVER M UTATIONS : W HY ?
Identify driver mutations −→ better therapeutic targets
But how does one zero down upon the exact set? −→
experiments are too costly, probably infeasible for 2 million+
SNPs −→ Leverage computational analysis
Low cost of NGS comes with a heavier roadblock of data
analysis
Searching among 2 million+ SNPs is a non-trivial, and a
computationally intensive problem
Softwares have a low consensus ratio amongst them selves
←→ Defining a driver, computationally is non-trivial
However there is no tool that allows one to visualise the
results on an input across the cohort of tools
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

D RIVER M UTATIONS : W HY ?
Identify driver mutations −→ better therapeutic targets
But how does one zero down upon the exact set? −→
experiments are too costly, probably infeasible for 2 million+
SNPs −→ Leverage computational analysis
Low cost of NGS comes with a heavier roadblock of data
analysis
Searching among 2 million+ SNPs is a non-trivial, and a
computationally intensive problem
Softwares have a low consensus ratio amongst them selves
←→ Defining a driver, computationally is non-trivial
However there is no tool that allows one to visualise the
results on an input across the cohort of tools
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

D RIVER M UTATIONS : W HY ?
Identify driver mutations −→ better therapeutic targets
But how does one zero down upon the exact set? −→
experiments are too costly, probably infeasible for 2 million+
SNPs −→ Leverage computational analysis
Low cost of NGS comes with a heavier roadblock of data
analysis
Searching among 2 million+ SNPs is a non-trivial, and a
computationally intensive problem
Softwares have a low consensus ratio amongst them selves
←→ Defining a driver, computationally is non-trivial
However there is no tool that allows one to visualise the
results on an input across the cohort of tools
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

M ACHINE L EARNING I
Two datasets:
Training: Labeled dataset, containing a table of features
with mutations labelled as ”drivers/passengers”
Test: ’Learning’ from training dataset, test the prediction
model
Table: Training Dataset

Chromosome
1
1
2
.
.
.

Position
27822
27832
47842
.
.
.

Ref
A
T
G
.
.
.

Alt
G
G
C
.
.
.

Type
Driver
Driver
Passenger
.
.
.
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

M ACHINE L EARNING II

Table: Test Dataset

Chromosome
1
1

Position
27824
47832

Ref
A
T

Alt
G
G

Type
?
?

C ONCLUSIONS
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

M ACHINE L EARNING : F EATURE S ELECTION I
Machine Learning relies on a set of features for training
Redundant features should be avoided
CHASM [1] makes use of
p(Xi ) represents the probability of occurrence of an event Xi
Considering a series of events X1 , X2 , X3 ...,Xn analogous ’series
of packets’ in communication theory , the information received
at each step can be quantified on a log scale by:
1
= −log2 (p(Xi ))
log2 (Xi )

(1)

The expected value of information from a series of events is
called shannon entropy: H(X):
H(X) = −

p(Xi ) log2 p(Xi )
i

(2)
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

M ACHINE L EARNING : F EATURE S ELECTION II
Mutual Information between two random variables X, Y is
defined as the amount of information gained about random
variable X due to additional information gained from the
second, Y:
I(X, Y) = H(X) − H(X|Y)

(3)

Here:
X: Class Label[Driver/Passenger]
Y: Predictive Feature
and hence I(X, Y) represents how much information was
gained about the class label Y from knowledge of a feature X
Simplifying :
I(X, Y) =

p(x, y)log2

p(x, y)
p(x)p(y)

(4)
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

F UNCTIONAL I MPACT I

If a certain mutation confers an advantage to the cell in
terms of replication rate, it is probably going to be selected
while all those mutations that reduce its fitness have a
higher chance of being eliminated from the population.
Certain residues in a MSA of homologous sequences are
more conserved than others. A highly conserved if
mutated is possibly going to cost a lot since what had
’evolved’ is disturbed!
Scores can be assigned based on this ”conservation”
parameter.
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

F UNCTIONAL I MPACT II
Figure: SIFT [?] algorithm

R EPRODUCIBILITY

C ONCLUSIONS
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

Some of the common tools/algorithms used for driver
mutation prediction:
SIFT
Polyphen
Mutation Assesor
TransFIC
Condel

C ONCLUSIONS
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

F RAMEWORK FOR COMPARING VARIOUS TOOLS I
Different tools use different formats, give different outputs
for similar input
Running analysis on multiple tools −→ keep shifting data
formats
Concordance?

Polyphen2 Input
chr1:888659 T/C
chr1:1120431 G/A
chr1:1387764 G/A
chr1:1421991 G/A
chr1:1599812 C/T
chr1:1888193 C/A
chr1:1900186 T/C
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

F RAMEWORK FOR COMPARING VARIOUS TOOLS II

SIFT Input
1,888659,T,C
1,1120431,G,A
1,1387764,G,A
1,1421991,G,A
1,1599812,C,T
1,1888193,C,A
1,1900186,T,C
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

D RIVER M UTATIONS : T OOLS DON ’ T AGREE

X Axis: Condel Score Y Axis: MA Score

C ONCLUSIONS
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

Solution?:
Galaxy[?], an open source web-based platform for
bioinformatics, makes it possible to represent the entire data
analysis pipeline in an intuitive graphical interface
Figure: Galaxy Workflow polyphen2 algorithm
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

Run all tools in one go:
Figure: Run all tools

R EPRODUCIBILITY

C ONCLUSIONS
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

Compare all tools:
Figure: Compare all tools

R EPRODUCIBILITY

C ONCLUSIONS
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

V IRAL G ENOME D ETECION
Cervical cancers have been proven to be associated with
Human Papillomavirus(HPV)
Cervical cancer datasets from Indian women was put through
an analysis to detect :
1. Any possible HPV integration
2. Sites of HPV integration
Who Cares?
Replacing whole genome sequencing, by targeted
sequencing at the sites where these virus have been
detected in a cohort of samples, thus speeding up the
whole process.
I NTRODUCTION

S IGNIFICANT M UTATIONS

RawData

V IRAL G ENOME D ETECION

Align data
to human
genome
reference

R EPRODUCIBILITY

Extract
unmapped
regions

Align
unmapped
regions
to Virus
genome

Extract
mapped
BLAST
regions

C ONCLUSIONS
I NTRODUCTION

S IGNIFICANT M UTATIONS

RawData

V IRAL G ENOME D ETECION

Align data
to human
genome
reference

R EPRODUCIBILITY

Extract
unmapped
regions

Align
unmapped
regions
to Virus
genome

Extract
mapped
BLAST
regions

C ONCLUSIONS
I NTRODUCTION

S IGNIFICANT M UTATIONS

RawData

V IRAL G ENOME D ETECION

Align data
to human
genome
reference

R EPRODUCIBILITY

Extract
unmapped
regions

Align
unmapped
regions
to Virus
genome

Extract
mapped
BLAST
regions

C ONCLUSIONS
I NTRODUCTION

S IGNIFICANT M UTATIONS

RawData

V IRAL G ENOME D ETECION

Align data
to human
genome
reference

R EPRODUCIBILITY

Extract
unmapped
regions

Align
unmapped
regions
to Virus
genome

Extract
mapped
BLAST
regions

C ONCLUSIONS
I NTRODUCTION

S IGNIFICANT M UTATIONS

RawData

V IRAL G ENOME D ETECION

Align data
to human
genome
reference

R EPRODUCIBILITY

Extract
unmapped
regions

Align
unmapped
regions
to Virus
genome

Extract
mapped
BLAST
regions

C ONCLUSIONS
I NTRODUCTION

S IGNIFICANT M UTATIONS

RawData

V IRAL G ENOME D ETECION

Align data
to human
genome
reference

R EPRODUCIBILITY

Extract
unmapped
regions

Align
unmapped
regions
to Virus
genome

Extract
mapped
BLAST
regions

C ONCLUSIONS
I NTRODUCTION

S IGNIFICANT M UTATIONS

RawData

V IRAL G ENOME D ETECION

Align data
to human
genome
reference

R EPRODUCIBILITY

Extract
unmapped
regions

Align
unmapped
regions
to Virus
genome

Extract
mapped
BLAST
regions

C ONCLUSIONS
I NTRODUCTION

S IGNIFICANT M UTATIONS

G ALAXY W ORKFLOW

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

Figure: Aligned HPV genomes

C ONCLUSIONS
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

R EPRODUCIBILITY

In pursuit of novel ’discovery’, standardizing the data
analysis pipeline is often ignored, leading to dubious
conclusions
Analysis should be reproducible and above all, correct
Parameter’s values can change the results by a big factor,
they need to be documented/logged
Garbage in, Garbage out
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

C ONCLUSIONS
With the Galaxy tool box for identification of significant
mutations and the study of the science behind the methods, the
next steps would be to:
Open source the toolbox to the community: A tool makes
little sense if it is not in a usable form, community
feedback will be used to add more tools and improve the
existing ones
A new method for driver mutation prediction: all the
methods have low level of concordance. A new method
that takes into account the available data at all levels :
mutations, transcriptome and micro array data is possible.
With the Galaxy toolbox in place, it would be possible to
integrate information at various levels
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

F UTURE W ORK

Develop an algorithm that integrates machine learning
approach with functional approach by zeroing down upon
only those attributes that are known to have an impact
The algorithm would also account for information at other
levels: RNA expressions, Clinical data.
Integrating information at all levels would provide a
deeper insight
The developed Galaxy toolbox will be used a the basic
framework for integrating information
I NTRODUCTION

S IGNIFICANT M UTATIONS

V IRAL G ENOME D ETECION

R EPRODUCIBILITY

C ONCLUSIONS

R EFERENCES I

Hannah Carter, Sining Chen, Leyla Isik, Svitlana
Tyekucheva, Victor E Velculescu, Kenneth W Kinzler, Bert
Vogelstein, and Rachel Karchin.
Cancer-specific high-throughput annotation of somatic
mutations: computational prediction of driver missense
mutations.
Cancer research, 69(16):6660–6667, 2009.

Weitere ähnliche Inhalte

Andere mochten auch (7)

gsoc2012demo
gsoc2012demogsoc2012demo
gsoc2012demo
 
testppt
testppttestppt
testppt
 
testppt
testppttestppt
testppt
 
testppt
testppttestppt
testppt
 
ISG-Presentacion
ISG-PresentacionISG-Presentacion
ISG-Presentacion
 
ISG-Presentacion
ISG-PresentacionISG-Presentacion
ISG-Presentacion
 
Global_Health_and_Intersectoral_Collaboration
Global_Health_and_Intersectoral_CollaborationGlobal_Health_and_Intersectoral_Collaboration
Global_Health_and_Intersectoral_Collaboration
 

Ähnlich wie Pattern Recognition in Clinical Data

Genomics, Bioinformatics, and Pathology
Genomics, Bioinformatics, and PathologyGenomics, Bioinformatics, and Pathology
Genomics, Bioinformatics, and PathologyDan Gaston
 
Cancer Genomics Visualization across Scales: Nucleotides to Cohorts
Cancer Genomics Visualization across Scales: Nucleotides to CohortsCancer Genomics Visualization across Scales: Nucleotides to Cohorts
Cancer Genomics Visualization across Scales: Nucleotides to CohortsNils Gehlenborg
 
Genome Editing Comes of Age; CRISPR, rAAV and the new landscape of molecular ...
Genome Editing Comes of Age; CRISPR, rAAV and the new landscape of molecular ...Genome Editing Comes of Age; CRISPR, rAAV and the new landscape of molecular ...
Genome Editing Comes of Age; CRISPR, rAAV and the new landscape of molecular ...Candy Smellie
 
Identification of cancer drivers across tumor types
Identification of cancer drivers across tumor typesIdentification of cancer drivers across tumor types
Identification of cancer drivers across tumor typesNuria Lopez-Bigas
 
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Elia Brodsky
 
CDAC 2018 Cantor liquid biopsies
CDAC 2018 Cantor liquid biopsiesCDAC 2018 Cantor liquid biopsies
CDAC 2018 Cantor liquid biopsiesMarco Antoniotti
 
Genomics 2015 Keynote - Utilizing cancer sequencing in the clinic
Genomics 2015 Keynote - Utilizing cancer sequencing in the clinicGenomics 2015 Keynote - Utilizing cancer sequencing in the clinic
Genomics 2015 Keynote - Utilizing cancer sequencing in the clinicAndreas Scherer
 
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICSPROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICSLubna MRL
 
Genome editing comes of age
Genome editing comes of ageGenome editing comes of age
Genome editing comes of ageJan Hryca
 
Web cast cancer gene panels march 11 2015
Web cast cancer gene panels march 11 2015Web cast cancer gene panels march 11 2015
Web cast cancer gene panels march 11 2015Andreas Scherer
 
Cancer Gene Panels
Cancer Gene PanelsCancer Gene Panels
Cancer Gene PanelsGolden Helix
 
CSH SC 2015 - PAGODA talk
CSH SC 2015 - PAGODA talkCSH SC 2015 - PAGODA talk
CSH SC 2015 - PAGODA talkJean Fan
 
Visual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationVisual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationNils Gehlenborg
 
Genome Editing Comes of Age
Genome Editing Comes of AgeGenome Editing Comes of Age
Genome Editing Comes of AgeCandy Smellie
 
DDP Stage 2 Presentation
DDP Stage 2 PresentationDDP Stage 2 Presentation
DDP Stage 2 PresentationSaket Choudhary
 
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics Workshop
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics WorkshopLopez-Bigas talk at the EBI/EMBL Cancer Genomics Workshop
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics WorkshopNuria Lopez-Bigas
 
Bioinformatics
BioinformaticsBioinformatics
BioinformaticsIncedo
 
Examining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencingExamining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencingStephen Turner
 

Ähnlich wie Pattern Recognition in Clinical Data (20)

Genomics, Bioinformatics, and Pathology
Genomics, Bioinformatics, and PathologyGenomics, Bioinformatics, and Pathology
Genomics, Bioinformatics, and Pathology
 
Cancer Genomics Visualization across Scales: Nucleotides to Cohorts
Cancer Genomics Visualization across Scales: Nucleotides to CohortsCancer Genomics Visualization across Scales: Nucleotides to Cohorts
Cancer Genomics Visualization across Scales: Nucleotides to Cohorts
 
Genome Editing Comes of Age; CRISPR, rAAV and the new landscape of molecular ...
Genome Editing Comes of Age; CRISPR, rAAV and the new landscape of molecular ...Genome Editing Comes of Age; CRISPR, rAAV and the new landscape of molecular ...
Genome Editing Comes of Age; CRISPR, rAAV and the new landscape of molecular ...
 
Identification of cancer drivers across tumor types
Identification of cancer drivers across tumor typesIdentification of cancer drivers across tumor types
Identification of cancer drivers across tumor types
 
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
 
CDAC 2018 Cantor liquid biopsies
CDAC 2018 Cantor liquid biopsiesCDAC 2018 Cantor liquid biopsies
CDAC 2018 Cantor liquid biopsies
 
Genomics 2015 Keynote - Utilizing cancer sequencing in the clinic
Genomics 2015 Keynote - Utilizing cancer sequencing in the clinicGenomics 2015 Keynote - Utilizing cancer sequencing in the clinic
Genomics 2015 Keynote - Utilizing cancer sequencing in the clinic
 
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICSPROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
 
Genome editing comes of age
Genome editing comes of ageGenome editing comes of age
Genome editing comes of age
 
Web cast cancer gene panels march 11 2015
Web cast cancer gene panels march 11 2015Web cast cancer gene panels march 11 2015
Web cast cancer gene panels march 11 2015
 
Cancer Gene Panels
Cancer Gene PanelsCancer Gene Panels
Cancer Gene Panels
 
CSH SC 2015 - PAGODA talk
CSH SC 2015 - PAGODA talkCSH SC 2015 - PAGODA talk
CSH SC 2015 - PAGODA talk
 
Visual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient StratificationVisual Exploration of Clinical and Genomic Data for Patient Stratification
Visual Exploration of Clinical and Genomic Data for Patient Stratification
 
Genome Editing Comes of Age
Genome Editing Comes of AgeGenome Editing Comes of Age
Genome Editing Comes of Age
 
DDP Stage 2 Presentation
DDP Stage 2 PresentationDDP Stage 2 Presentation
DDP Stage 2 Presentation
 
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics Workshop
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics WorkshopLopez-Bigas talk at the EBI/EMBL Cancer Genomics Workshop
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics Workshop
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Examining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencingExamining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencing
 
Oncogenomics 2013
Oncogenomics 2013Oncogenomics 2013
Oncogenomics 2013
 
Bio conference live 2013
Bio conference live 2013Bio conference live 2013
Bio conference live 2013
 

Mehr von Saket Choudhary (20)

testppt
testppttestppt
testppt
 
testppt
testppttestppt
testppt
 
Training_Authoring
Training_AuthoringTraining_Authoring
Training_Authoring
 
Training_Authoring
Training_AuthoringTraining_Authoring
Training_Authoring
 
Training_Authoring
Training_AuthoringTraining_Authoring
Training_Authoring
 
Training_Authoring
Training_AuthoringTraining_Authoring
Training_Authoring
 
Training_Authoring
Training_AuthoringTraining_Authoring
Training_Authoring
 
Training_Authoring
Training_AuthoringTraining_Authoring
Training_Authoring
 
Training_Authoring
Training_AuthoringTraining_Authoring
Training_Authoring
 
Training_Authoring
Training_AuthoringTraining_Authoring
Training_Authoring
 
Testslideshare1
Testslideshare1Testslideshare1
Testslideshare1
 
Testslideshare1
Testslideshare1Testslideshare1
Testslideshare1
 
Testslideshare1
Testslideshare1Testslideshare1
Testslideshare1
 
Testslideshare1
Testslideshare1Testslideshare1
Testslideshare1
 
Testslideshare1
Testslideshare1Testslideshare1
Testslideshare1
 
Training_Authoring
Training_AuthoringTraining_Authoring
Training_Authoring
 
Training_Authoring
Training_AuthoringTraining_Authoring
Training_Authoring
 
Training_Authoring
Training_AuthoringTraining_Authoring
Training_Authoring
 
Training_Authoring
Training_AuthoringTraining_Authoring
Training_Authoring
 
Training_Authoring
Training_AuthoringTraining_Authoring
Training_Authoring
 

Kürzlich hochgeladen

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 

Kürzlich hochgeladen (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 

Pattern Recognition in Clinical Data

  • 1. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY Pattern Recognition In Clinical Data Saket Choudhary Dual Degree Project Guide: Prof. Santosh Noronha C G C A T C G A G C T C G C G T C G A G C T October 30, 2013 C ONCLUSIONS
  • 2. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY I NTRODUCTION I NTRODUCTION Objective S IGNIFICANT M UTATIONS Next Generation Sequencing Motivation Computational Methods for Driver Detection V IRAL G ENOME D ETECION Workflow R EPRODUCIBILITY Reproducibility C ONCLUSIONS Wrapping up C ONCLUSIONS
  • 3. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS O BJECTIVE BWA v/s BWAPSSM Galaxy Workflow Viral Genome Integration Benchmarking Alignment tools Tools for Biologists Reproducible analysis Driver & Passenger Mutation Detection Galaxy Tools Better/New Algorithms Next Generation Sequencing & Cancer Research Driver & Passenger Mutation Detection Lit. Survey Ensembl Method
  • 4. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS O BJECTIVE BWA v/s BWAPSSM Galaxy Workflow Viral Genome Integration Benchmarking Alignment tools Tools for Biologists Reproducible analysis Driver & Passenger Mutation Detection Galaxy Tools Better/New Algorithms Next Generation Sequencing & Cancer Research Driver & Passenger Mutation Detection Lit. Survey Ensembl Method
  • 5. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS O BJECTIVE BWA v/s BWAPSSM Galaxy Workflow Viral Genome Integration Benchmarking Alignment tools Tools for Biologists Reproducible analysis Driver & Passenger Mutation Detection Galaxy Tools Better/New Algorithms Next Generation Sequencing & Cancer Research Driver & Passenger Mutation Detection Lit. Survey Ensembl Method
  • 6. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS O BJECTIVE BWA v/s BWAPSSM Galaxy Workflow Viral Genome Integration Benchmarking Alignment tools Tools for Biologists Reproducible analysis Driver & Passenger Mutation Detection Galaxy Tools Better/New Algorithms Next Generation Sequencing & Cancer Research Driver & Passenger Mutation Detection Lit. Survey Ensembl Method
  • 7. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS O BJECTIVE BWA v/s BWAPSSM Galaxy Workflow Viral Genome Integration Benchmarking Alignment tools Tools for Biologists Reproducible analysis Driver & Passenger Mutation Detection Galaxy Tools Better/New Algorithms Next Generation Sequencing & Cancer Research Driver & Passenger Mutation Detection Lit. Survey Ensembl Method
  • 8. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS O BJECTIVE BWA v/s BWAPSSM Galaxy Workflow Viral Genome Integration Benchmarking Alignment tools Tools for Biologists Reproducible analysis Driver & Passenger Mutation Detection Galaxy Tools Better/New Algorithms Next Generation Sequencing & Cancer Research Driver & Passenger Mutation Detection Lit. Survey Ensembl Method
  • 9. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS O BJECTIVE BWA v/s BWAPSSM Galaxy Workflow Viral Genome Integration Benchmarking Alignment tools Tools for Biologists Reproducible analysis Driver & Passenger Mutation Detection Galaxy Tools Better/New Algorithms Next Generation Sequencing & Cancer Research Driver & Passenger Mutation Detection Lit. Survey Ensembl Method
  • 10. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS O BJECTIVE BWA v/s BWAPSSM Galaxy Workflow Viral Genome Integration Benchmarking Alignment tools Tools for Biologists Reproducible analysis Driver & Passenger Mutation Detection Galaxy Tools Better/New Algorithms Next Generation Sequencing & Cancer Research Driver & Passenger Mutation Detection Lit. Survey Ensembl Method
  • 11. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS O BJECTIVE BWA v/s BWAPSSM Galaxy Workflow Viral Genome Integration Benchmarking Alignment tools Tools for Biologists Reproducible analysis Driver & Passenger Mutation Detection Galaxy Tools Better/New Algorithms Next Generation Sequencing & Cancer Research Driver & Passenger Mutation Detection Lit. Survey Ensembl Method
  • 12. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS O BJECTIVE BWA v/s BWAPSSM Galaxy Workflow Viral Genome Integration Benchmarking Alignment tools Tools for Biologists Reproducible analysis Driver & Passenger Mutation Detection Galaxy Tools Better/New Algorithms Next Generation Sequencing & Cancer Research Driver & Passenger Mutation Detection Lit. Survey Ensembl Method
  • 13. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS O BJECTIVE BWA v/s BWAPSSM Galaxy Workflow Viral Genome Integration Benchmarking Alignment tools Tools for Biologists Reproducible analysis Driver & Passenger Mutation Detection Galaxy Tools Better/New Algorithms Next Generation Sequencing & Cancer Research Driver & Passenger Mutation Detection Lit. Survey Ensembl Method
  • 14. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS O BJECTIVE BWA v/s BWAPSSM Galaxy Workflow Viral Genome Integration Benchmarking Alignment tools Tools for Biologists Reproducible analysis Driver & Passenger Mutation Detection Galaxy Tools Better/New Algorithms Next Generation Sequencing & Cancer Research Driver & Passenger Mutation Detection Lit. Survey Ensembl Method
  • 15. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS O BJECTIVE BWA v/s BWAPSSM Galaxy Workflow Viral Genome Integration Benchmarking Alignment tools Tools for Biologists Reproducible analysis Driver & Passenger Mutation Detection Galaxy Tools Better/New Algorithms Next Generation Sequencing & Cancer Research Driver & Passenger Mutation Detection Lit. Survey Ensembl Method
  • 16. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY N EXT G ENERATION S EQUENCING C G C G T C G A G C T A G C A C ONCLUSIONS
  • 17. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY N EXT G ENERATION S EQUENCING C G C G T C G A G C T A G C A C ONCLUSIONS
  • 18. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY N EXT G ENERATION S EQUENCING C G C G T C G A G C T A G C A G C G T∗ C G A G∗ C T A G∗ C ONCLUSIONS
  • 19. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY N EXT G ENERATION S EQUENCING C G C G T C G A G C T A G C A G C G T∗ C G A G∗ C T A G∗ A G C G C G T C G A G C T A G C A C A C ONCLUSIONS
  • 20. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY NGS: W HY ? Molecular Approach: Study of variations at the ’base’ level Low Cost: 1000$ genome Faster: Quicker than traditional sequencing techniques C ONCLUSIONS
  • 21. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY NGS: W HY ? Molecular Approach: Study of variations at the ’base’ level Low Cost: 1000$ genome Faster: Quicker than traditional sequencing techniques C ONCLUSIONS
  • 22. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY NGS: W HY ? Molecular Approach: Study of variations at the ’base’ level Low Cost: 1000$ genome Faster: Quicker than traditional sequencing techniques C ONCLUSIONS
  • 23. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY NGS: W HERE ? Study variations, genotype-phenotype association Look for ’markers of diseases’ Prognosis C ONCLUSIONS
  • 24. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY NGS: W HERE ? Study variations, genotype-phenotype association Look for ’markers of diseases’ Prognosis C ONCLUSIONS
  • 25. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY NGS: W HERE ? Study variations, genotype-phenotype association Look for ’markers of diseases’ Prognosis C ONCLUSIONS
  • 26. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS NGS: M UTATIONS 3x109 base pairs We are all 99.9% similar, at DNA level More than 2 million SNPs No particular pattern of SNPs If a certain mutation causes a change in an amino acid, it is referred to as non synonymous(nsSNV)
  • 27. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY D RIVERS AND PASSENGERS I Cancer is known to arise due to mutations Not all mutations are equally important! Somatic Mutations Set of mutations acquired after zygote formation, over and above the germline mutations Driver Mutations Mutations that confer growth advantages to the cell, being selected positively in the tumor tissue C ONCLUSIONS
  • 28. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS D RIVERS AND PASSENGERS Drivers are NOT simply loss of function mutations, but more than that: Loss of function: Inactivate tumor suppressor proteins Gain of function: Activates normal genes transforming them to oncogenes Drug Resistance Mutations: Mutations that have evolved to overcome the inhibitory effect of drugs
  • 29. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS D RIVERS AND PASSENGERS Drivers are NOT simply loss of function mutations, but more than that: Loss of function: Inactivate tumor suppressor proteins Gain of function: Activates normal genes transforming them to oncogenes Drug Resistance Mutations: Mutations that have evolved to overcome the inhibitory effect of drugs
  • 30. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS D RIVERS AND PASSENGERS Drivers are NOT simply loss of function mutations, but more than that: Loss of function: Inactivate tumor suppressor proteins Gain of function: Activates normal genes transforming them to oncogenes Drug Resistance Mutations: Mutations that have evolved to overcome the inhibitory effect of drugs
  • 31. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS D RIVER M UTATIONS : W HY ? Identify driver mutations −→ better therapeutic targets But how does one zero down upon the exact set? −→ experiments are too costly, probably infeasible for 2 million+ SNPs −→ Leverage computational analysis Low cost of NGS comes with a heavier roadblock of data analysis Searching among 2 million+ SNPs is a non-trivial, and a computationally intensive problem Softwares have a low consensus ratio amongst them selves ←→ Defining a driver, computationally is non-trivial However there is no tool that allows one to visualise the results on an input across the cohort of tools
  • 32. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS D RIVER M UTATIONS : W HY ? Identify driver mutations −→ better therapeutic targets But how does one zero down upon the exact set? −→ experiments are too costly, probably infeasible for 2 million+ SNPs −→ Leverage computational analysis Low cost of NGS comes with a heavier roadblock of data analysis Searching among 2 million+ SNPs is a non-trivial, and a computationally intensive problem Softwares have a low consensus ratio amongst them selves ←→ Defining a driver, computationally is non-trivial However there is no tool that allows one to visualise the results on an input across the cohort of tools
  • 33. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS D RIVER M UTATIONS : W HY ? Identify driver mutations −→ better therapeutic targets But how does one zero down upon the exact set? −→ experiments are too costly, probably infeasible for 2 million+ SNPs −→ Leverage computational analysis Low cost of NGS comes with a heavier roadblock of data analysis Searching among 2 million+ SNPs is a non-trivial, and a computationally intensive problem Softwares have a low consensus ratio amongst them selves ←→ Defining a driver, computationally is non-trivial However there is no tool that allows one to visualise the results on an input across the cohort of tools
  • 34. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS D RIVER M UTATIONS : W HY ? Identify driver mutations −→ better therapeutic targets But how does one zero down upon the exact set? −→ experiments are too costly, probably infeasible for 2 million+ SNPs −→ Leverage computational analysis Low cost of NGS comes with a heavier roadblock of data analysis Searching among 2 million+ SNPs is a non-trivial, and a computationally intensive problem Softwares have a low consensus ratio amongst them selves ←→ Defining a driver, computationally is non-trivial However there is no tool that allows one to visualise the results on an input across the cohort of tools
  • 35. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS M ACHINE L EARNING I Two datasets: Training: Labeled dataset, containing a table of features with mutations labelled as ”drivers/passengers” Test: ’Learning’ from training dataset, test the prediction model Table: Training Dataset Chromosome 1 1 2 . . . Position 27822 27832 47842 . . . Ref A T G . . . Alt G G C . . . Type Driver Driver Passenger . . .
  • 36. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY M ACHINE L EARNING II Table: Test Dataset Chromosome 1 1 Position 27824 47832 Ref A T Alt G G Type ? ? C ONCLUSIONS
  • 37. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS M ACHINE L EARNING : F EATURE S ELECTION I Machine Learning relies on a set of features for training Redundant features should be avoided CHASM [1] makes use of p(Xi ) represents the probability of occurrence of an event Xi Considering a series of events X1 , X2 , X3 ...,Xn analogous ’series of packets’ in communication theory , the information received at each step can be quantified on a log scale by: 1 = −log2 (p(Xi )) log2 (Xi ) (1) The expected value of information from a series of events is called shannon entropy: H(X): H(X) = − p(Xi ) log2 p(Xi ) i (2)
  • 38. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS M ACHINE L EARNING : F EATURE S ELECTION II Mutual Information between two random variables X, Y is defined as the amount of information gained about random variable X due to additional information gained from the second, Y: I(X, Y) = H(X) − H(X|Y) (3) Here: X: Class Label[Driver/Passenger] Y: Predictive Feature and hence I(X, Y) represents how much information was gained about the class label Y from knowledge of a feature X Simplifying : I(X, Y) = p(x, y)log2 p(x, y) p(x)p(y) (4)
  • 39. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS F UNCTIONAL I MPACT I If a certain mutation confers an advantage to the cell in terms of replication rate, it is probably going to be selected while all those mutations that reduce its fitness have a higher chance of being eliminated from the population. Certain residues in a MSA of homologous sequences are more conserved than others. A highly conserved if mutated is possibly going to cost a lot since what had ’evolved’ is disturbed! Scores can be assigned based on this ”conservation” parameter.
  • 40. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION F UNCTIONAL I MPACT II Figure: SIFT [?] algorithm R EPRODUCIBILITY C ONCLUSIONS
  • 41. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY Some of the common tools/algorithms used for driver mutation prediction: SIFT Polyphen Mutation Assesor TransFIC Condel C ONCLUSIONS
  • 42. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS F RAMEWORK FOR COMPARING VARIOUS TOOLS I Different tools use different formats, give different outputs for similar input Running analysis on multiple tools −→ keep shifting data formats Concordance? Polyphen2 Input chr1:888659 T/C chr1:1120431 G/A chr1:1387764 G/A chr1:1421991 G/A chr1:1599812 C/T chr1:1888193 C/A chr1:1900186 T/C
  • 43. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS F RAMEWORK FOR COMPARING VARIOUS TOOLS II SIFT Input 1,888659,T,C 1,1120431,G,A 1,1387764,G,A 1,1421991,G,A 1,1599812,C,T 1,1888193,C,A 1,1900186,T,C
  • 44. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY D RIVER M UTATIONS : T OOLS DON ’ T AGREE X Axis: Condel Score Y Axis: MA Score C ONCLUSIONS
  • 45. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS Solution?: Galaxy[?], an open source web-based platform for bioinformatics, makes it possible to represent the entire data analysis pipeline in an intuitive graphical interface Figure: Galaxy Workflow polyphen2 algorithm
  • 46. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION Run all tools in one go: Figure: Run all tools R EPRODUCIBILITY C ONCLUSIONS
  • 47. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION Compare all tools: Figure: Compare all tools R EPRODUCIBILITY C ONCLUSIONS
  • 48. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS V IRAL G ENOME D ETECION Cervical cancers have been proven to be associated with Human Papillomavirus(HPV) Cervical cancer datasets from Indian women was put through an analysis to detect : 1. Any possible HPV integration 2. Sites of HPV integration Who Cares? Replacing whole genome sequencing, by targeted sequencing at the sites where these virus have been detected in a cohort of samples, thus speeding up the whole process.
  • 49. I NTRODUCTION S IGNIFICANT M UTATIONS RawData V IRAL G ENOME D ETECION Align data to human genome reference R EPRODUCIBILITY Extract unmapped regions Align unmapped regions to Virus genome Extract mapped BLAST regions C ONCLUSIONS
  • 50. I NTRODUCTION S IGNIFICANT M UTATIONS RawData V IRAL G ENOME D ETECION Align data to human genome reference R EPRODUCIBILITY Extract unmapped regions Align unmapped regions to Virus genome Extract mapped BLAST regions C ONCLUSIONS
  • 51. I NTRODUCTION S IGNIFICANT M UTATIONS RawData V IRAL G ENOME D ETECION Align data to human genome reference R EPRODUCIBILITY Extract unmapped regions Align unmapped regions to Virus genome Extract mapped BLAST regions C ONCLUSIONS
  • 52. I NTRODUCTION S IGNIFICANT M UTATIONS RawData V IRAL G ENOME D ETECION Align data to human genome reference R EPRODUCIBILITY Extract unmapped regions Align unmapped regions to Virus genome Extract mapped BLAST regions C ONCLUSIONS
  • 53. I NTRODUCTION S IGNIFICANT M UTATIONS RawData V IRAL G ENOME D ETECION Align data to human genome reference R EPRODUCIBILITY Extract unmapped regions Align unmapped regions to Virus genome Extract mapped BLAST regions C ONCLUSIONS
  • 54. I NTRODUCTION S IGNIFICANT M UTATIONS RawData V IRAL G ENOME D ETECION Align data to human genome reference R EPRODUCIBILITY Extract unmapped regions Align unmapped regions to Virus genome Extract mapped BLAST regions C ONCLUSIONS
  • 55. I NTRODUCTION S IGNIFICANT M UTATIONS RawData V IRAL G ENOME D ETECION Align data to human genome reference R EPRODUCIBILITY Extract unmapped regions Align unmapped regions to Virus genome Extract mapped BLAST regions C ONCLUSIONS
  • 56. I NTRODUCTION S IGNIFICANT M UTATIONS G ALAXY W ORKFLOW V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS
  • 57. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY Figure: Aligned HPV genomes C ONCLUSIONS
  • 58. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS R EPRODUCIBILITY In pursuit of novel ’discovery’, standardizing the data analysis pipeline is often ignored, leading to dubious conclusions Analysis should be reproducible and above all, correct Parameter’s values can change the results by a big factor, they need to be documented/logged Garbage in, Garbage out
  • 59. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS C ONCLUSIONS With the Galaxy tool box for identification of significant mutations and the study of the science behind the methods, the next steps would be to: Open source the toolbox to the community: A tool makes little sense if it is not in a usable form, community feedback will be used to add more tools and improve the existing ones A new method for driver mutation prediction: all the methods have low level of concordance. A new method that takes into account the available data at all levels : mutations, transcriptome and micro array data is possible. With the Galaxy toolbox in place, it would be possible to integrate information at various levels
  • 60. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS F UTURE W ORK Develop an algorithm that integrates machine learning approach with functional approach by zeroing down upon only those attributes that are known to have an impact The algorithm would also account for information at other levels: RNA expressions, Clinical data. Integrating information at all levels would provide a deeper insight The developed Galaxy toolbox will be used a the basic framework for integrating information
  • 61. I NTRODUCTION S IGNIFICANT M UTATIONS V IRAL G ENOME D ETECION R EPRODUCIBILITY C ONCLUSIONS R EFERENCES I Hannah Carter, Sining Chen, Leyla Isik, Svitlana Tyekucheva, Victor E Velculescu, Kenneth W Kinzler, Bert Vogelstein, and Rachel Karchin. Cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations. Cancer research, 69(16):6660–6667, 2009.