SlideShare a Scribd company logo
1 of 28
Download to read offline
Machine Learning Applications in
Computational Genomics
— Some new algorithms for understanding cancer genomes
Jian Ma
Computational Biology Department
School of Computer Science
2
TCTCTCAGAGGGCCCTGATGGAAGAATCCCCCTACCACCCTTCCAGGCTGACTTCTGTCTATTTCTCCTGCAGAGTGAGCTGGACTTGGAAAAGGGCTTG
GAGATGAGAAAATGGGTCCTGTCGGGAATCCTGGCTAGCGAGGAGACTTACCTGAGCCACCTGGAGGCACTGCTGCTGGTGAGGAGGATTTAGGGAGCTG
AGCAGGGCGGGATGGGGCAGGGTGACAGGGTTGGGGAGCCTCTTTGCCCTTAAGTCCCAGGTCAGCTGTCAGAGCCTGGGTGCAGCTCGCCATCCCTGGA
GTGGATACCAGTGGAAGACTGAGTTGCCAAACCAAGCTGGTTTTAAAATTGTATTTGTTATGTGATTTAAAAATAAAAGTGCATATGTCAGGTAACCATG
ACTGTCTACTGCCATACAATGCACCTGACGGATGGCAGCCCCTCTCACCTGTGCTACCTCACTTGTGCCCTCTTCCAGCCCATGAAGCCTTTGAAAGCCG
CTGCCACCACCTCTCAGCCGGTGCTGACGAGTCAGCAGATCGAGACCATCTTCTTCAAAGTGCCTGAGCTCTACGAGATCCACAAGGAGTTCTATGATGG
GCTCTTCCCCCGCGTGCAGCAGTGGAGCCACCAGCAGCGGGTGGGCGACCTCTTCCAGAAGCTGGTGAGTAACCCAGGGCCGGTGCTGGGACTACAGGCG
TGTACCACCACGTCCAGCTAATTTTTTGCATTTTTAGTAGAGACAGGGTTTTGCTATGTTGGCCAGGCTGGTCTCAAACTCCTAACCTCAAGTGATCCAC
CTGCCTCAGCCTCCCAAAGTACTGAGATTACAGGCGTGAGCCGCCATGCCCAGCCTTTTTTTTTTTTTTTCTAATTTATATTTATTTAGATAGTTATTTT
TAAAAAGAGATGGGGACTTACTACGTTGTCCAGGCTGGAGTGCAGTGGCTATTCACAGGCGCAATTCCACTGCTCATCAGCACGGGAGTTTTGACCTCCT
TCCTTTCCAACCTTGGCTGTTTCACTCCTTCTTAGGCAAACTGATGGTTCCCGACTCCTGGGAGGTCACCATATTGATGCCAAACTTAGTGTGTAGTGCA
CTACAGCCCAGAACTCCTGACTGAAGCCATCCTCCGGCCTCAGCCTTCCGCGTAGCTGGGGCTATAGGTGCACGCCACCACACCCTGTGTGTGGCTGGGA
CTACAGGTGCACGCCATCACACCCTGTGTGCGCCATCACACCCTGTGTGCACCATCACACCCTGTGTGCACACACTTTCCCTAAAGCAGGCTTCCTCCGC
TGGGAAACAAGTCCTCTAGGGGCAGGTGTGGCCAGAGGCCAGGCCCCCCTCTAAGTGTGAAGAGCATGTGATTCCTTAAAAGCCCTTCCCCCAGCACTTC
TGGACTACCGAGACACACAGCTCTGGCCTCGGGCCTCCCCTTGGCTGGTGCTGGGGGCTGAGTTTTCTGCTCTGAGGTGTGGCTTTCCTGTAGGGGGACC
CCTCCCTCTGCCACCCTGTGCTGCAGACCCCCAGACTCCAGGCCAGAGCTAAGGCTTGAGGAACACAGAAGGCACTTAATTTGTTCCAGTTCTTGCTCCC
TGGGGCTCTTTCCCCCATGGCCAGAGAGCAGGAGGCTGTATTTTGATACATGCTGCCCCCTCCATCTTTGAAGCCCCCCCACCCCCGTTTCTCCGTGTGT
GTGTCAGCAGTTTTAAACCTAGTGGAGGGTGGTGGCTCGGGCTGGGCTCCGCGTCGGGCTGCCCCGCAGCTGCTCTTGGGCAGCCAGGGCCGCTGGGTGT
GGGGCCGCCGGGAATGGCGGGCCCGGGTGAGGGCGGGCCCGGGTGAGGGCGGGGGCGGAGAGGCGAAGAAGCTGCAGGAAGGGAGGGTGACGAGGGGGAA
GCGAAGGAAGGGGAAGAGGAAGGGAAAAGCGAGCGAGAGGGGCAAGGCGGAAGAGGAAGCAGGGCGGAAGGGAAGCCCGGGCCGCAGACGGCGAAGGAGG
CAGCGGGCCGGGGGCTGAGGCGGGAGCGAGGACACGCCCAAGAGAGGAAGCAGAGGGAGGCGGAAGCGTGGAGGAAGGGGCGAGAGGCATCATCAAAGGA
GATGAGGGGAGCGTAGGGGCCGGGAAAGAGGCACAAGGAAGAAAGTATGGGAAGGAGGAATGGAGGGTCAGGGCTAGGCGGCGGGAGGGCGCCAGGCCGG
GAAGAGTACAAGGACAAGGAGGTCAGGTTTGGGCCTACATCCCGGGGACAGGGGCGGCCATGGCGGCGGCAGCCAGGGAGGAGGAGGAGGAGGCGGCTCG
GGAGTCAGCCGCCTGCCCGGCTGCGGGGCCAGCGCTCTGGCGCCTGCCGGAAGTGCTGCTGCTGCACATGTGCTCCTACCTCGACATGCGGGCCCTCGGC
CGCCTGGCCCAGGTGTACCGCTGGCTGTGGCACTTCACCAACTGCGACCTGCTCCGGCGCCAGATAGCCTGGGCCTCGCTCAACTCCGGCTTCACGCGGC
TCGGCACCAACCTGATGACCAGTGTCCCAGTGAAGGTGTCTCAGAACTGGATAGTGGGGTGCTGCCGAGAGGGGATTCTGCTGAAGTGGAGATGCAGTCA
GATGCCCTGGATGCAGCTAGAGGATGATGCTTTGTACATATCCCAGGCTAATTTCATCCTGGCCTACCAGTTCCGTCCAGATGGTGCCAGCTTGAACCGT
CAGCCTCTGGGAGTCTGCTGGGCATGATGAGGACGTTTGCCACTTTGTGCTGGCCACCTCGCATATTGTCAGTGCAGGAGGAGATGGGAAGATTGGCCTT
GGTAAGATTCACAGCACCTTCGCTGCCAAGTACTGGGCTCATGAACAGGAGGTGAACTGTGTGGATTGCAAAGGGGGCATCATATCATTGTGAGTGGCTC
CAGGGACAGGACGGCCAAGGTGTGGCCTTTGGCCTCAGGCCAGCTGGGGTAGTGTTTATACACCATCCAGACTGAAGACCAAATCTGGTCTGTTGCTATC
Fundamental question: How the changes in genome sequences
give rise to phenotypic differences (e.g., disease states)
! When they got into the genome and how they have evolved
! Their roles in genome organization and gene regulation for 

human biology
! Their implications in human diseases such as cancer
Our goal — from base-pairs to bedside
Why Computational Genomics?
3
Why Computational Genomics?
! Key to personalized precision medicine, especially for cancer
4
David Patterson
! Cancer research has become big data science
! How to store and manage data efficiently
! How to analyze data in a distributed environment
! How to enhance data security but reduce barriers for sharing
! How to extract meaningful patterns
! How to identify mechanisms to help treatment
! …
The Human genome:
the “blueprint” of our body
5
GTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGA
TTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCC
CCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATT
AGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCT
ATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAAC
ATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATT
James Watson

Francis Crick
February 15, 2001
March, 2011
DNA, Chromosome, and Genome
6
Chapter4:DNA,Chromosomes,andGenomes
" beads-on-a-string"
form of chromatin
30-nmchromatin
fiberof packed
nucleosomes
Figure4-72 Chromatinpacking.This
modelshowssomeof the manylevelsof
chromatinpackingpostulatedto giverise
to the highlycondensedmitotic
chromosome.
sectionof
chromosomein
extendedform
condensedsection
of chromosome
entire
mitotic
chromosome
T300nm
I
Tl 1 n m
I
T30 nm
I
TI
700nm
Ii
T1400nm
I
NETRESULT:EACHDNAMOLECULEHASBEEN
ret
CHROMOSOMALDNAANDITSPACKAGINGINTHECHROMATINFIBER
(A) (B) -r^
Figure
chrom
a male
under
arethe
chrom
differ
identi
Chrom
expos
of hum
coupl
dyes.F
fromc
specif
chrom
DNAdoublehelix
5' Y
3'
hydrogen-bonded
basepairs
4-4). This complementary base-pairlng enables the base pairs to be packed in
the energetically most favorable arrangement in the interior of the double helix.
In this arrangement, each base pair is of similar width, thus holding the sugar-
phosphate backbones an equal distance apart along the DNA molecule. To max-
imize the efficiency of base-pair packing, the two sugar-phosphate backbones
blocksof DNA
phosphate
 suqar
'; +K-
sugar oase
phosphate
n e
double-strandedDNA
llilii:i:ilitffi$$iiiffiliiiii:ii:iii <CAGA>D
nucleotide
intoa poly
strand)with
backbonef
andT)exte
composed
togetherby
the pairedb
endsofthe
polaritieso
antiparall
molecule.I
leftof the fi
shownstra
twistedint
the right.F
Figure4-4
the DNAdo
chemicalst
hydrogenb
betweenA
whereatom
bonds(see
broughtclo
thedouble
3',
s',
H
N - C_ C C - N
/
 l I H - N
o

'  - L
C N

C - C C _
,-n, , ,o'l [n,,
thyminesugar-phosphate
backbone
H
Ha d e n i n e
N -HilililililO
DNA, RNA, Protein
! Central Dogma in molecular biology
• DNA
• RNA
• Protein
! In general, proteins do most of the work, and are encoded by
subsequences of DNA, known as genes.
! However, only less than 2% of the human genome codes for
proteins.
7
Most of the genome are non-coding
8
© 2005 Nature Publishing Group
SINEs
LINEs
Protein-coding
genes
Introns
Miscellaneous
unique sequences
Miscellaneous
heterochromatin
Segmental
duplications
Simple sequence
repeats
DNA transposons
LTR retrotransposons
20.4%
13.1%
1.5%
25.9%
11.6%
8%
5%
3%
2.9%
8.3%
ss II elements transpose directly from DNA to DNA, and include DNA transposons and
peat transposable elements (MITEs).
nts (and especially their extinct remnants) make up a large portion of the human genome, with
ample, the SINE Alu element) present in more than a million copies. Transposable-element
mplex interactions with the host genome and other subgenomic elements, ranging from
m. For a review of transposable-element structure, origins, impacts and evolution see REF. 17.
ent
man
% of
f the
000
nces.
f
s such
%)
s
e
www.nature.com/reviews/genetics
Nat Rev Genet, 2005
Most functional information is non-coding
! 5% highly conserved, but only 1.5% encodes proteins
9
chr2 (q31.1) 21 p14 2p12 13 31.1 q34 q35
chr2:
DLX1
DLX2
Vertebrate Cons
Chimp
Rhesus
Bushbaby
Tree_shrew
Mouse
Rat
Guinea_Pig
Shrew
Hedgehog
Dog
Cat
Horse
Cow
Armadillo
Elephant
Tenrec
Opossum
Platypus
Lizard
Chicken
Zebrafish
Tetraodon
Fugu
Stickleback
Medaka
172660000 172665000 172670000 172675000
UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics
Vertebrate Multiz Alignment & PhastCons Conservation (28 Species)
DLX1
Gaps
Human
Chimp
Rhesus
Bushbaby
Tree_shrew
Mouse
Rat
Guinea_Pig
Shrew
Hedgehog
Dog
Cat
Horse
Cow
Armadillo
Elephant
Tenrec
Opossum
Platypus
UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics
Vertebrate Multiz Alignment & PhastCons Conservation (28 Species)
K P R T I Y S S L Q L Q A L N
1
A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C
A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C
A A A C C C A G G A C G A T T T A T T C C A G C T T G C A G T T G C A G G C T T T G A A C
A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A T
A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C
A A A C C C A G G A C A A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C
A A A C C C A G G A C A A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C
C C C C C T A G G A C A A T T T A T T C C A G T T T G C A G C T G G A C G C T T T G A A T
A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C
A A G C C C A G G A C A A T C T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C
A A A C C C A G G A C G A T T T A C T C C A G T T T G C A G T T G C A G G C T T T G A A C
A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C
A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C
A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C
A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C
A A A C C C A G G A C A A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C
A A A C C T A G G A C G A T T T A T T C C A G T T T G C A G C T G C A G G C T T T G A A T
A A A C C C A G G A C T A T T T A T T C C A G T C T G C A G T T G C A G G C T T T G A A C
A A A C C C A G G A C T A T A T A T T C C A G T T T G C A G T T G C A G G C A T T G A A C
What do they do?
Annotating the non-coding regions
10
Scale
chr2:
NKI LADs (Tig3)
10 kb hg19
20,090,000 20,095,000 20,100,000 20,10
TTC32
LaminB1 (Tig3)
2 -
-2 _
GM78 CHD2 IgM
889 -
1 _
GM78 Pol2 IgM
156.2 -
0 _
GM78 Pol2 Std
259.8 -
0 _
GM78 Rad2 IgR
8.7 -
0 _
GM78 TBP IgM
40.1 -
0 _
GM78 Z274 Std
16 -
1 _
K562 CHD2 IgR
1785 -
1 _
K562 Pol2 IgM
27.9 -
0 _
K562 IFa3 Pol2 Sd
211.5 -
0 _
K562 IFa6 Pol2 Sd
199.4 -
0 _
K562 IFg3 Pol2 Sd
241.7 -
0 _
K562 IFg6 Pol2 Sd
261.1 -
0 _
K562 Pol2 Std
343.1 -
0 _
K562 Rad2 Std
8.6 -
0 _
K562 TBP IgM
397 -
1 _
K562 Z274 UCD
5.4 -
0 _
ChromHMM also enables the analy
across multiple cell types. When the ch
mon across the cell types, a common m
a virtual ‘concatenation’ of the chrom
Alternatively a model can be learned by
marks across cell types, or independent
each cell type. Lastly, ChromHMM sup
models with different number of chrom
relations in their emission parameters (
We wrote the software in Java, whic
virtually any computer. ChromHMM an
tion is freely available at http://compbio
he observed combination of chromatin marks using
ndependent Bernoulli random variables2, which
t learning of complex patterns of many chromatin
. As input, it receives a list of aligned reads for each
ark, which are automatically converted into pres-
ce calls for each mark across the genome, based on
kground distribution. One can use an optional addi-
f aligned reads for a control dataset to either adjust
for present or absent calls, or as an additional input
tively, the user can input files that contain calls from
nt peak caller. By default, chromatin states are ana-
ase-pair intervals that roughly approximate nucleo-
t smaller or larger windows
ied. We also developed an
ameter-initialization proce-
bles relatively efficient infer-
arable models across differ-
of states (Supplementary
e outputs of ChromHMM.
hromatin-state annotation
from ChromHMM and visualized
Scale
chr4:
GM12878
1_Active_Promoter
2_Weak_Promoter
3_Poised_Promoter
4_Strong_Enhancer
5_Strong_Enhancer
6_Weak_Enhancer
7_Weak_Enhancer
8_Insulator
9_Txn_Transition
10_Txn_Elongation
11_Weak_Txn
12_Repressed
13_Heterochrom/lo
14_Repetitive/CNV
15_Repetitive/CNV
50 kb
103650000 103700000
RefSeq Genes
GM12878 (User ordered)
GM12878 (User ordered)
NFKB1
NFKB1
a
b cEmission parameters Transition parameters
ChromHMM — Ernst and Kellis, Nature Methods 2012
Alternatively a model can be learned by a virtual ‘stacking’ of all
marks across cell types, or independent models can be learned in
each cell type. Lastly, ChromHMM supports the comparison of
models with different number of chromatin states based on cor-
relations in their emission parameters (Supplementary Fig. 4).
We wrote the software in Java, which allows it to be run on
virtually any computer. ChromHMM and additional documenta-
tion is freely available at http://compbio.mit.edu/ChromHMM/.
verted into pres-
enome, based on
an optional addi-
et to either adjust
n additional input
contain calls from
in states are ana-
roximate nucleo-
Scale
chr4:
GM12878
1_Active_Promoter
2_Weak_Promoter
3_Poised_Promoter
4_Strong_Enhancer
5_Strong_Enhancer
6_Weak_Enhancer
7_Weak_Enhancer
8_Insulator
9_Txn_Transition
10_Txn_Elongation
11_Weak_Txn
12_Repressed
13_Heterochrom/lo
14_Repetitive/CNV
15_Repetitive/CNV
50 kb
103650000 103700000 103750000
RefSeq Genes
GM12878 (User ordered)
GM12878 (User ordered)
NFKB1
NFKB1
MANBA
a
b cEmission parameters
State(userorder)
State(userorder)
Statefrom(userorder)
Transition parameters
Mark
CTCF
H3K27me3
H3K36me3
H4K20me1
H3K4me1
H3K4me2
H3K4me3
H3K27ac
H3K9ac
WCE
Genome(%)
RefSeqTSS
CpGisland
RefSeqTSS2kb
RefSeqexon
RefSeqgene
RefSeqTES
Conserved
Lamina
State to (user order)
Category
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
GM12878 fold enrichments
Cancer genomics
workflow
Each type of cancer is different
12
widespread—remain and eventua
promising the function of the lun
organs. From a genetics persp
seem that there must be mutatio
primary cancer to a metastatic o
are mutations that convert a nor
nign tumor, or a benign tumor to
(Fig. 2). Despite intensive effor
sistent genetic alterations that dis
that metastasize from cancers th
metastasized remain to be identi
One potential explanation in
or epigenetic changes that are
tify with current technologies (see
matter” below). Another explana
static lesions have not yet been
ficient detail to identify these ge
particularly if the mutations ar
in nature. But another possibl
that there are no metastasis gen
primary tumor can take many y
size, but this process is, in prin
by stochastic processes alone (17
tumors release millions of cells
tion each day, but these cells hav
and only a miniscule fraction es
lesions (19). Conceivably, these
may, in a nondeterministic man
and randomly lodge in a capillary
that provides a favorable micro
growth. The bigger the primary
more likely that this process w
scenario, the continual evolutio
tumor would reflect local selec
rather than future selective adva
that growth at metastatic sites is n
additional genetic alterations is a
recent results showing that eve
when placed in suitable enviro
lymph nodes, can grow into org
with a functioning vasculature (
1500
1000
500
Colorectal(MSI)
Lung(SCLC)
Lung(NSCLC)
Melanoma
Esophageal(ESCC)
Non-Hodgkinlymphoma
Colorectal(MSS)
Headandneck
Esophageal(EAC)
Gastric
Endometrial(endometrioid)
Pancreaticadenocarcinoma
Ovarian(high-gradeserous)
Prostate
Hepatocellular
Glioblastoma
Breast
Endometrial(serous)
Lung(neversmokedNSCLC)
Chroniclymphocyticleukemia
Acutemyeloidleukemia
Glioblastoma
Neuroblastoma
Acutelymphoblasticleukemia
Medulloblastoma
Rhabdoid
Mutagens
Non-synonymousmutationspertumor
(median+/-onequartile)
250
225
200
175
150
125
100
50
75
25
0
B
Adult solid tumors Liquid Pediatric
Number of nonsynonymous
mutations in representative
human cancers, detected by
genome-wide sequencing
studies.
Vogelstein et al.
Science 2013
Each individual tumor is different
! Data from TCGA’s analyses show that most cancer types has
a great number of mutations that occur at a low frequency.
! Long-tail distribution
13
doi: 10.1038/nature08645 SUPPLEMENTARY INFORMATION
SI Guide
Supplementary Figure 1 Haploid physical coverage of breast cancer samples. Physical
coverage indicates the number of DNA fragments of which both ends have been sequenced
that on average overlie any position in the genome.
Supplementary Figure 2 Genome wide circos plots of somatic rearrangements in all 24
breast cancers in the study.
Stephens et al. Nature 2009
Supervised learning Un-supervised learning
genes
samples
Analyzing gene expression data
How to deal with high dimension? 

Identify the most important genes
! d is the damping factor, a
parameter representing
the extent to which the
ranking depends on the
structure of the graph.
! f is the prior probability
of the gene which we set
to the absolute
differential expression.
! is the in-
degree of i
15
Gene Network
Gene Expression
Somatic
Alteration Data
(SNP, CNV, etc.)
Ranks of
Genes
▪ A ranking framework based on
PageRank that considers the
impact of genes in the network
▪ Impact includes connectivity and
the amount downstream genes
to be differentially expressed
▪ Dynamic damping factor is
used to improve the original
PageRank in ranking genes
DawnRank
Personalized Driver Alterations
rt+1
j = (1 dj)fj + dj
NX
i=1
Ajirt
i
degi
degi =
PN
j=1 Aji
Hou and Ma, Genome Med 2014
Tumor heterogeneity vs.
gene networks
16
NCIS - Liu et al. BMC Bioinfo 2014
C3 - Hou et al. Bioinformatics 2016
LDGM - Tian, Gu, and Ma, Nucleic Acids Res 2016
NRAS
GABBR1
ATF2
MAPK1
PRKACA
GNAI2
PRKACB
CREB3L4
ADCY2
KCNJ3
PLCB4
GRB2
GNAI3
SRC
PIK3CD
CALML6
ESR1
GABBR2
ADCY4
FOS
ADCY3
NOS3
PLCB2
OPRM1
AKT1
GNAS
CREB3
PIK3CA
HRAS
PLCB3KCNJ6
CREB3L1
GNAO1
SHC1
MAP2K1
PIK3R5
ADCY5
MAPK3
PLCB1PIK3R3
SOS1
GNAI1
CALML3
MMP2PRKACG
PRKCD
CREB3L2HBEGF
SHC4
PIK3CB
AKT3
CREB5
GRM1
ADCY1
MMP9
EGFRJUN
ADCY7
ATF6B
SHC2
PIK3R1
CALM1
SOS2
ADCY9
ATF4
PIK3R2
SHC3
SP1
# interactions
# interactions
ESR1degreeRankofESR1
A
B
C
Lumina A
Basal-like
% degree from Luminal A
% degree from Basal-like
LDGM
Glasso
JGL
CNJGL
Figure 5: Differential networks on estrogen signaling pathway reconstructed based on gene expression data from breast
cancer Luminal A and Basal-like subtypes. (A) The degree of ESR1 in estimated differential networks with increased
number of interactions. (B) The rank of ESR1 by its degree in differential networks. The number of interactions is up to
1,000 in (A) and (B). (C) A differential network b estimated by LDGM with = 0.362. Node size is proportional to the
node’s degree. Width of an interaction i j is proportional to the score |bij|. The origin of interactions in the differential
network is inferred by a principle of majority approach based on Glasso (see Supplementary Text).
J.P.Hou et al.
Deep learning applications
17
x y
Features Model ResultsClean data
A
D
Feature
extraction
Discriminative features
Raw data
Label
C
Intron
Exon
Feature
extraction Training Evaluation
Supervised Unsupervised
x
• Linear regression
• Logistic regression
• Random Forest
• SVM
• …
• PCA
• Factor analysis
• Clustering
• Outlier detection
• …
B
A C G T C
G C G T A
G T C C G
T T A G T
C G T A G
G A G A A
T
A
G
C
T
G
CA
C
G
T
G
A
CC
A
T
G
A
G
T
C
A
T
G
CT
G
CG
T
C
C
G
TA
TC
G
A
T
G
T
C
C
G
A
G
T
A
C
A
CC
ACC
GA
G
TG
T
G
TC
A
T
G
C
T
A
C
A
G
C
T
AT
G
C
G
C
T
AG
C
T
G
AC
T
G
A
CT
AT
C
G
G
C
T
A
T
G
C
A
G
A
G
C
A
C
G
A
CGG
CT
C
GA
T
G
CC
T
G
A
T
C
C
C
A
G
T
A
G
CT
A
G
C
T
A
CCA
G
C
CA
G
CT
CT
G
A
CG
T
C
T
A
C
GA
T
C
GT
G
A
CA
T
C
GG
C
A
G
CA
T
GG
C
A
G
CA
T
C
G
T
A
C
G
A
T
C
G
A
T
G
C
A
C
G
TC
G
A
T
T
G
A
T
A
G
A
C
GC
GA
C
T
GA
T
CA
T
GA
C
T
GT
A
G
C
G
T
A
G
C
T
A
G
C
TC
G
AC
A
TC
G
A
T
G
A
T
T
C
A
T
AG
A
TC
T
A
C
G
T
A
Layer 1
A
T
G
C
A
G
A
G
C
A
C
G
A
CGG
CT
C
GA
T
G
CC
T
G
A
T
C
C
G
T
A
G
C
T
A
G
C
TC
G
AC
A
TC
G
A
T
G
A
T
T
C
A
T
AG
A
TC
T
A
C
G
T
A
Raw data
Pre-
processing
Raw data
Layer 2 Intron ExonTSS
Figure 1. Machine learning and representation learning.
(A) The classical machine learning workflow can be broken down into four steps: data pre-processing, feature extraction, model learning and model evaluation. (B) Supervised
machine learning methods relate input features x to an output label y, whereas unsupervised method learns factors about x without observed labels. (C) Raw input data are
often high-dimensional and related to the corresponding label in a complicated way, which is challenging for many classical machine learning algorithms (left plot).
Alternatively, higher-level features extracted using a deep model may be able to better discriminate between classes (right plot). (D) Deep networks use a hierarchical
structure to learn increasingly abstract feature representations from the raw data.
Molecular Systems Biology Deep learning for computational biology Christof Angermueller et al
Published online: July 29, 2016
x y
Features Model ResultsClean data
A
D
Feature
extraction
Discriminative features
Raw data
Label
C
Intron
Exon
Feature
extraction Training Evaluation
Supervised Unsupervised
x
• Linear regression
• Logistic regression
• Random Forest
• SVM
• …
• PCA
• Factor analysis
• Clustering
• Outlier detection
• …
B
A C G T C
G C G T A
G T C C G
T T A G T
C G T A G
G A G A A
T
A
G
C
T
G
CA
C
G
T
G
A
CC
A
T
G
A
G
T
C
A
T
G
CT
G
CG
T
C
C
G
TA
TC
G
A
T
G
T
C
C
G
A
G
T
A
C
A
CC
ACC
GA
G
TG
T
G
TC
A
T
G
C
T
A
C
A
G
C
T
AT
G
C
G
C
T
AG
C
T
G
AC
T
G
A
CT
AT
C
G
G
C
T
A
T
G
C
A
G
A
G
C
A
C
G
A
CGG
CT
C
GA
T
G
CC
T
G
A
T
C
C
C
A
G
T
A
G
CT
A
G
C
T
A
CCA
G
C
CA
G
CT
CT
G
A
CG
T
C
T
A
C
GA
T
C
GT
G
A
CA
T
C
GG
C
A
G
CA
T
GG
C
A
G
CA
T
C
G
T
A
C
G
A
T
C
G
A
T
G
C
A
C
G
TC
G
A
T
T
G
A
T
A
G
A
C
GC
GA
C
T
GA
T
CA
T
GA
C
T
GT
A
G
C
G
T
A
G
C
T
A
G
C
TC
G
AC
A
TC
G
A
T
G
A
T
T
C
A
T
AG
A
TC
T
A
C
G
T
A
Layer 1
A
T
G
C
A
G
A
G
C
A
C
G
A
CGG
CT
C
GA
T
G
CC
T
G
A
T
C
C
G
T
A
G
C
T
A
G
C
TC
G
AC
A
TC
G
A
T
G
A
T
T
C
A
T
AG
A
TC
T
A
C
G
T
A
Raw data
Pre-
processing
Raw data
Layer 2 Intron ExonTSS
Figure 1. Machine learning and representation learning.
(A) The classical machine learning workflow can be broken down into four steps: data pre-processing, feature extraction, model learning and model evaluation. (B) Supervised
machine learning methods relate input features x to an output label y, whereas unsupervised method learns factors about x without observed labels. (C) Raw input data are
often high-dimensional and related to the corresponding label in a complicated way, which is challenging for many classical machine learning algorithms (left plot).
Molecular Systems Biology Deep learning for computational biology Christof Angermueller et al
Published online: July 29, 2016
More traditional Machine Learning Applications to Deep Learning Application
Angermueller et al., Mol Sys Bio 2016
DeepBIND
18
A N A LY S I S
t
i
P
a
v
p
f
P
a
t
a
s
a
t
i
t
Figure 1 DeepBind’s input data, training procedure and applications. 1. The sequence specificities
of DNA- and RNA-binding proteins can now be measured by several types of high-throughput
assay, including PBM, SELEX, and ChIP- and CLIP-seq techniques. 2. DeepBind captures these
binding specificities from raw sequence data by jointly discovering new sequence motifs along with
rules for combining them into a predictive binding score. Graphics processing units (GPUs) are
used to automatically train high-quality models, with expert tuning allowed but not required. 3. The
resulting DeepBind models can then be used to identify binding sites in test sequences and
Alipanahi et al. Nat Biotech 2015
(Other methods: DeepSEA — Zhou & Troyanskaya, Nat Methods 2015;

DanQ — Quang & Xie, Nucleic Acids Res 2016)
Cancer genome
19
MCF-7 http://www.path.cam.ac.uk/~pawefish/
20
Structural variations (SVs) 

in cancer genomes
inversion translocation
gain loss duplication
Whole genome sequencing
Methods: DELLY, Meerkat, BreakDancer, 

CREST, CNVnator, 

CONSERTING, and many others
Aneuploidy — Common feature of cancer cells
21
MCF-7 http://www.path.cam.ac.uk/~pawefish/
! Allele-specific copy
number (ASCN) tools
• ABSOLUTE, ASCAT,
Patchwork
! SVs can further modify
the aneuploid cancer
genome into a mixture of
genomic segments with
extensive range of CNAs
! We need methods that
combine SV and ASCN
! How SVs interact with
ASCNs? How different
SVs interact with each
other?
A N A LY S I S
Percent of
samples with
WGD 6245 43 1143 2059 64 5327
a
0
0.5
1.0
Purity
1
2
3
4
5+
LUAD
LUSC
HNSC
KIRC
BRCA
BLCA
CRC
UCEC
GBM
OV
Ploidy
0 500 1,000
Samples
(all lineages)
Near diploid
1 WGD
2+ WGD
ple
Near diploid WGD samples Amplification Deletion
b Zack et al. Nature Genetics 2013
Goal — Quantify allele-specific SVs
22
Goal - Quantify Allele-Specific SVs
4
Goal - Quantify Allele-Specific SVs
4
Goal - Quantify Allele-Specific SVs
4
Weaver — algorithm overview
23
Probabilistic Graphical Model
(Markov Random Field)
Mappability GC Content
Purity ASCNG ASCNS Timing of SV Phasing
SV list BAM file
1KGP haplotypes
SNP list
Cancer Genome Graph SNP linkage SNP LD
(B) (C)
R1 R3 R4 R5
Rm
R6 R10
R1 R2 R3 R4 R5 R6 R10
R12 R13 R14 R16 R17 R18 R21
R1
R2
R3 R4
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R18 R19 R20 R21
R2
(A)
interchr
del
dup intrachrintrachr
R11 R12 R13 R14 R15 R16 R18 R19 R20 R21deln
m p
t
s
q
chrA chrB
Coverage from read mapping
Input
Output
Li et al. Cell Systems 2016
24
Purity ASCNG ASCNS Timing of SV Phasing
100 kb
21,850,000 21,900,000 21,950,000 22,000,000 22,050,000 22,100,000 22,150,000 22,200,000 22,250,000
MTAP C9orf53
CDKN2A
CDKN2A
CDKN2B-AS1
CDKN2B
142_
0 _
chr9 9p23 21.3 21.1 12 9q12 13 31.1 32 33.1
Coverage
LOH & first
amplification Deletion
Second
amplification
(B)
Del1
Del2
ASCNS and Timing of SV
Del1
Del2
Del1
Del2
Figure 1: (A) Schema diagram for Weaver. Dark green boxes show the different types of analyses, unique to Weaver that
are not dealt with by other methods, while light green ones show ‘by-products’ of Weaver shown to have an improvement
over existing methods. (B) An example demonstrating a Weaver output focused on ASCNS and Timing of SV. Dark blue
segments (two copies) and light blue segment (one copy) represent a portion of the MCF-7 genome that originated from the
same allele on chr9. The other allele was lost during tumorigenesis, resulting in LOH. The predicted evolution of this region
! MRF:
• genome node, cancer node, genome edge, cancer edge
25
(B) (C)
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21
R12
R13
R14
R15
R16
R17
R18
R19
R20
R11 R21
R1 R3 R4 R5
Rm
Rp
Rs
Rq
R6
R7
R8
R9
R10
RnR2
dup intrachrintrachr
mq→(12,14)
m(12,14)→q
m(12,14)→s
ms→(12,14)
R(12,14)
R15
R11 R21
Rp
Rs
Rq
R(3,4) R(5,6)
R(19,20)
Rt
m +R2 -R2
n +R6 -R10
p +R4 -R16
q -R12 -R21
s +R14 +R18
t +R16 -R18
label L_pos R_pos
2 2 1
1 1 1
2 2 1
2 2 1
2 2 1
1 1 2
L_allele R_allele CN
R1 30 0.33
R2 40 0.5
R5 20 0
R7 10 0
R12 20 0.5
R17 10 0
label cov allele_freq
2 1
2 2
2 0
1 0
1 1
0 1
CN_1 CN_2
Genomic
regions
SVs
Inputs Outputs(D) (E)
𝜇0 = 0; 𝜇1 = 1; b = 10
Time
Post-
Pre-
chrA
chrB
R1
R2
R10
R(7,9)
Rt R17
R18
RnRm
R16
Figure 5: (A) Hypothetical cancer chromosomes with rearrangement structure hidden. Orange and blue segments repre-
sent paternal/maternal allele. Red dashed line represent linkages by SVs. (B) The Cancer Genome Graph, constructed
from (A), with nodes (boxes) representing genomic regions and edges representing reference (solid lines) or cancer (dashed
Cancer Genome Graph
(B) (C)
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21
R12
R13
R14
R15
R16
R17
R18
R19
R20
R11 R21
R1 R3 R4 R5
Rm
Rp
Rs
Rq
R6
R7
R8
R9
R10
RnR2
R12 R13 R14 R16 R17 R18 R21
R1 R3 R4R2
dup intrachrintrachr
mq→(12,14)
m(12,14)→q
m(12,14)→s
ms→(12,14)
R(12,14)
R15
R11 R21
Rp
Rs
Rq
R(3,4) R(5,6)
R(19,20)
Rt
m +R2 -R2
n +R6 -R10
p +R4 -R16
q -R12 -R21
s +R14 +R18
t +R16 -R18
label L_pos R_pos
2 2 1
1 1 1
2 2 1
2 2 1
2 2 1
1 1 2
L_allele R_allele CN
R1 30 0.33
R2 40 0.5
R5 20 0
R7 10 0
R12 20 0.5
R17 10 0
label cov allele_freq
2 1
2 2
2 0
1 0
1 1
0 1
CN_1 CN_2
Genomic
regions
SVs
Inputs Outputs(D) (E)
𝜇0 = 0; 𝜇1 = 1; b = 10
m p
s
q
Time
Post-
Pre-
chrA
chrB
R1
R2
R10
R(7,9)
Rt R17
R18
RnRm
R16
Figure 5: (A) Hypothetical cancer chromosomes with rearrangement structure hidden. Orange and blue segments repre-
sent paternal/maternal allele. Red dashed line represent linkages by SVs. (B) The Cancer Genome Graph, constructed
from (A), with nodes (boxes) representing genomic regions and edges representing reference (solid lines) or cancer (dashed
MRF representation
rozygosity (LOH) regions
is known that most of the
h using SV boundaries has
segmentation methods in
me Graph G := {R, E}
eference adjacencies (Er)
adjacencies (Ec) (dashed
configurations E between
senting the tail (right) and
adjacent regions from the
ndom Field (MRF, M :=
nt probabilities. The MRF
en sequencing data can be
explained in the following
n hidden Markov models
between ‘local’ variables,
er genomes with complex
ed steps are described in
ions, and formal function
ed in the Supplementary
2 million from Weaver based on various datasets, depending on the size of
and the number of SVs. The rationale behind the segmentation step with SV
time ASCNG boundaries coincide with SV breakpoints [14]. Our segmentati
the advantage to provide base-level ASCNG boundaries as compared to exist
copy number analysis, which typically use fixed segmentation size.
Given the segmentation of the genome and SV set C, we then build C
(Fig. 5(B)), with nodes representing genomic region sets (R) and edges rep
(solid lines in the figure) if two nodes are adjacent in the normal genome
lines in the figure) if two nodes are adjacent in the cancer genome by SV c lin
node Ri and Rj can be represented as: ( iRi ⇠ jRj), 2 {+, }, with + a
head (left) of a given genomic region R, e.g., (+Ri ⇠ Ri+1) 2 Er, if Ri an
same chromosome in the normal genome.
We then convert the original Cancer Genome Graph G := {R, E} into
{R, Rc, Er, Ec}), which is a widely used probabilistic graphical model to e
can be viewed as undirected graph and the aggregated inference problem in W
viewed as a maximum a posteriori (MAP) problem with hidden states and ob
sections. Unlike conventional methods for estimating copy number chang
(HMMs), which are designed for sequential data and only consider the dep
MAP solution of MRF model provides the most probable configuration of ane
SVs, involving ‘global’ variable dependencies defined by long-range SVs.
Supplementary Note 6. In the following sections, we describe hidden stat
of the MRF MAP problem. Details on potential functions on nodes and edge
Note.
Hidden states H
=
NLD(Gi , Gi+1) ⇥ NLD(Gi , Gi+1)
NLD(Ga
i , Ga
i+1) ⇥ NLD(Gb
i , Gb
i+1) + NLD(Ga
i , Gb
i+1) ⇥ NLD(Gb
i , Ga
i+1)
where NLD(Ga
i , Ga
i+1) is the number of phased haplotypes (total number 1092 ⇥ 2 in phase 1) in 1KGP with
genotype (Ga
i , Ga
i+1). Other genotype configurations can be similarly calculated.
(ii) Similarly, we define the read linkage score for the phasing Ga
i , Ga
i+1/Gb
i , Gb
i+1 as:
RL(Ga
i , Ga
i+1/Gb
i , Gb
i+1) =
NRL(Ga
i , Ga
i+1) + NRL(Gb
i , Gb
i+1)
NRL(Ri, Ri+1)
where NRL(Ri, Ri+1) is the total number of reads covering genomic regions (Ri, Ri+1) and NRL(Ga
i , Ga
i+1) is
total number of reads covering (Ga
i , Ga
i+1). If there are no reads covering (Ri, Ri+1) (NRL(i, i+1) = 0), RL = 0
Therefore, we define genotype linkage as
GL(Ga
i , Ga
i+1/Gb
i , Gb
i+1) = log(LD(Ga
i , Ga
i+1/Gb
i , Gb
i+1) ⇤ RL(Ga
i , Ga
i+1/Gb
i , Gb
i+1))
In real data application, we have found that RL and LD correlate very well. For example, in the MCF-7 analysis
when we chose SNP pairs with 100% RL support as gold standard, we found AUC= 0.9964 using LD scores.
Markov random field model M
After we convert G into MRF M using steps in Supplementary Note 6, the MRF MAP problem is given by
ˆH = argmaxH
8
<
:
X
i2R
⇥R(O|Hi) +
X
c2C
⇥C(O|Hc) +
X
i2R
R(O|Hi, Hi+1) +
X
c2C
X
i2N (c)
C(Hi, Hc)
9
=
;
7
genome node

potential function
cancer node

potential function
genome edge

potential function
cancer edge

potential function
ASCN and SVs in MCF-7
26
! 83% of SVs have copy number > 1
! 68% of the regions have imbalanced copy number
! We found 276 SVs after whole chromosome dup
! We have used physical mapping to validate the results
ASCN and SVs in HeLa
! WGS reads obtained from
Adey et al. Nature 2013
! ASCNG are 97% consistent
with Adey et al. (Fosmid seq)
27
Structural variants were identified by clustering discordantly
mapped reads from 40-kb and 3-kb mate-pair libraries (Supplemen-
tary Fig. 8). Twenty interchromosomal links were identified, including
links for marker chromosomes M11 (9q33–11p14) and M14 (13q21–
19p13). In addition, 209 HeLa-specific deletions and 8 inversions were
found (Supplementary Figs 9 and 11, and Supplementary Table 10).
Only two genes that are impacted by HeLa-specific structural rearran-
gements (Supplementary Table 11) intersected with SCGC (STK11
(ref. 18), FHIT), both of which are recurrently deleted in cervical
18,19
pool. Alleles that were p
given clone were assigned
and the unobserved alle
haplotype. When overlap
this resulted in haplotyp
which 50% of the total len
550 kb containing 90.6%
inherited.
Most of the HeLa gen
1 2 3 4 5
X
6 7 8 9 10 11
12 13 14 15 16 17 18 19 20 21 22
HPV integration
3q11
M5, S
Linked position
Marker
chromosome
name
Supported by
Sequence data
Colour indicates
suspected
haplotype
Haplotype A
Haplotype B
Tandem
duplication
Probable
contiguity
a
1q11
M1
1q11
M25
15q11
M18
9p11
M10
3p21
M10
5q11
M4
3p11
M4
12q15
M12
5p
2xM7
11p14
M11,S
9q33
M11,S
9q33
M11
19p13
M14,S
21q11
M18
20p11
M15
13q21
M14,S
15q11
M18
3q11
M1
1p11
M2
9q11
M2
15q
M13
21q11
M25 11q22
M11
5p
marker
M7
HPV locus
4q31-35 6q13-21
18q1
2
3
4
5
6
7
8
3q24-29
LOH
Chr18
/
S3 window ratios
CCL-2 window ratios
S3 copy-number calls
S3-specific differences
Windowratio;copynumber
b
Genomic position
RESEARCH LETTER
Adey et al. Nature, 2013
Application to TCGA Data
! Inter-chromosomal chromothripsis
28
1X
62X
(A) (B)
FOXG1
4
2214
6(C)
Supplementary Figure 14: (A) Overview of the genomic landscape of a TCGA ovarian cancer sample (TCGA-36-1571).
(B) Most of SVs linking chr4 and chr22 have copy number one. (C) Three high-coverage fold-back inversions are observed
at boundaries of highly amplified region of chr14, indicating many rounds of BFB cycles happened. Gene FOXG1 is on the
fold-back inversion boundary and highly amplified.
! Breakage-fusion-bridge amplifications
1X
62X
A) (B)
FOXG1
4
2214
6(C)
Supplementary Figure 14: (A) Overview of the genomic landscape of a TCGA ovarian cancer sample (TCGA-36-1571).
(B) Most of SVs linking chr4 and chr22 have copy number one. (C) Three high-coverage fold-back inversions are observed
at boundaries of highly amplified region of chr14, indicating many rounds of BFB cycles happened. Gene FOXG1 is on the
fold-back inversion boundary and highly amplified.
1X
62X
A) (B)
FOXG1
4
2214
6(C)
Supplementary Figure 14: (A) Overview of the genomic landscape of a TCGA ovarian cancer sample (TCGA-36-1571).
(B) Most of SVs linking chr4 and chr22 have copy number one. (C) Three high-coverage fold-back inversions are observed
TCGA-36-1571

More Related Content

Similar to The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

Protein synthesis dna rna flipbook_jw
Protein synthesis dna rna flipbook_jwProtein synthesis dna rna flipbook_jw
Protein synthesis dna rna flipbook_jwpunxsyscience
 
PFediganProteinSynthesisFlipBook
PFediganProteinSynthesisFlipBookPFediganProteinSynthesisFlipBook
PFediganProteinSynthesisFlipBookpunxsyscience
 
Testifire_XTR2_Brochure.pdf
Testifire_XTR2_Brochure.pdfTestifire_XTR2_Brochure.pdf
Testifire_XTR2_Brochure.pdfHans Bronkhorst
 
Testifire_XTR2_Brochure.pdf
Testifire_XTR2_Brochure.pdfTestifire_XTR2_Brochure.pdf
Testifire_XTR2_Brochure.pdfHans Bronkhorst
 
Engaging Scientific Communities in Contributing to a Biological Database
Engaging Scientific Communities in Contributing to a Biological DatabaseEngaging Scientific Communities in Contributing to a Biological Database
Engaging Scientific Communities in Contributing to a Biological DatabasePaul Gardner
 
Anatomy of an Entrepreneur - Rice Alliance July 2010
Anatomy of an Entrepreneur - Rice Alliance July 2010Anatomy of an Entrepreneur - Rice Alliance July 2010
Anatomy of an Entrepreneur - Rice Alliance July 2010Marc Nathan
 
PdfgdfgsfghsmhfgcihgfghjgfdkhgsRO E+”.pdf
PdfgdfgsfghsmhfgcihgfghjgfdkhgsRO E+”.pdfPdfgdfgsfghsmhfgcihgfghjgfdkhgsRO E+”.pdf
PdfgdfgsfghsmhfgcihgfghjgfdkhgsRO E+”.pdfdisanortizm
 
Lin.protein.synthesis
Lin.protein.synthesisLin.protein.synthesis
Lin.protein.synthesispunxsyscience
 
20161023 누가복음 6장 37 49절, 이 사랑을 듣고 행하라
20161023 누가복음 6장 37 49절, 이 사랑을 듣고 행하라20161023 누가복음 6장 37 49절, 이 사랑을 듣고 행하라
20161023 누가복음 6장 37 49절, 이 사랑을 듣고 행하라Myoung-Ryun Mission Presbyterian Church
 
Marketing to Kingmakers
Marketing to KingmakersMarketing to Kingmakers
Marketing to KingmakersNordic APIs
 
как это работает4
как это работает4как это работает4
как это работает4Vladislav Troshin
 
Sopa de letras resuelta
Sopa de letras resueltaSopa de letras resuelta
Sopa de letras resueltaMafeRincon7
 
proses sterilisasi tuna kaleng dan bahaya sterilisasi tidak sempurna
proses sterilisasi tuna kaleng dan bahaya sterilisasi tidak sempurnaproses sterilisasi tuna kaleng dan bahaya sterilisasi tidak sempurna
proses sterilisasi tuna kaleng dan bahaya sterilisasi tidak sempurnaDeasy Lucyana
 

Similar to The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU (20)

Protein synthesis dna rna flipbook_jw
Protein synthesis dna rna flipbook_jwProtein synthesis dna rna flipbook_jw
Protein synthesis dna rna flipbook_jw
 
PFediganProteinSynthesisFlipBook
PFediganProteinSynthesisFlipBookPFediganProteinSynthesisFlipBook
PFediganProteinSynthesisFlipBook
 
Poster Pubblicazione
Poster PubblicazionePoster Pubblicazione
Poster Pubblicazione
 
Flip book
Flip bookFlip book
Flip book
 
Testifire_XTR2_Brochure.pdf
Testifire_XTR2_Brochure.pdfTestifire_XTR2_Brochure.pdf
Testifire_XTR2_Brochure.pdf
 
CONJUNTOS
CONJUNTOSCONJUNTOS
CONJUNTOS
 
Bio flip
Bio flipBio flip
Bio flip
 
Testifire_XTR2_Brochure.pdf
Testifire_XTR2_Brochure.pdfTestifire_XTR2_Brochure.pdf
Testifire_XTR2_Brochure.pdf
 
Engaging Scientific Communities in Contributing to a Biological Database
Engaging Scientific Communities in Contributing to a Biological DatabaseEngaging Scientific Communities in Contributing to a Biological Database
Engaging Scientific Communities in Contributing to a Biological Database
 
Anatomy of an Entrepreneur - Rice Alliance July 2010
Anatomy of an Entrepreneur - Rice Alliance July 2010Anatomy of an Entrepreneur - Rice Alliance July 2010
Anatomy of an Entrepreneur - Rice Alliance July 2010
 
PdfgdfgsfghsmhfgcihgfghjgfdkhgsRO E+”.pdf
PdfgdfgsfghsmhfgcihgfghjgfdkhgsRO E+”.pdfPdfgdfgsfghsmhfgcihgfghjgfdkhgsRO E+”.pdf
PdfgdfgsfghsmhfgcihgfghjgfdkhgsRO E+”.pdf
 
In silico analysis for unknown data
In silico analysis for unknown dataIn silico analysis for unknown data
In silico analysis for unknown data
 
Lin.protein.synthesis
Lin.protein.synthesisLin.protein.synthesis
Lin.protein.synthesis
 
20161023 누가복음 6장 37 49절, 이 사랑을 듣고 행하라
20161023 누가복음 6장 37 49절, 이 사랑을 듣고 행하라20161023 누가복음 6장 37 49절, 이 사랑을 듣고 행하라
20161023 누가복음 6장 37 49절, 이 사랑을 듣고 행하라
 
Marketing to Kingmakers
Marketing to KingmakersMarketing to Kingmakers
Marketing to Kingmakers
 
как это работает4
как это работает4как это работает4
как это работает4
 
Sopa de letras resuelta
Sopa de letras resueltaSopa de letras resuelta
Sopa de letras resuelta
 
proses sterilisasi tuna kaleng dan bahaya sterilisasi tidak sempurna
proses sterilisasi tuna kaleng dan bahaya sterilisasi tidak sempurnaproses sterilisasi tuna kaleng dan bahaya sterilisasi tidak sempurna
proses sterilisasi tuna kaleng dan bahaya sterilisasi tidak sempurna
 
Sopa de letras
Sopa de letrasSopa de letras
Sopa de letras
 
Sopa de letras
Sopa de letrasSopa de letras
Sopa de letras
 

More from The Hive

"Responsible AI", by Charlie Muirhead
"Responsible AI", by Charlie Muirhead"Responsible AI", by Charlie Muirhead
"Responsible AI", by Charlie MuirheadThe Hive
 
Translating a Trillion Points of Data into Therapies, Diagnostics, and New In...
Translating a Trillion Points of Data into Therapies, Diagnostics, and New In...Translating a Trillion Points of Data into Therapies, Diagnostics, and New In...
Translating a Trillion Points of Data into Therapies, Diagnostics, and New In...The Hive
 
Digital Transformation; Digital Twins for Delivering Business Value in IIoT
Digital Transformation; Digital Twins for Delivering Business Value in IIoTDigital Transformation; Digital Twins for Delivering Business Value in IIoT
Digital Transformation; Digital Twins for Delivering Business Value in IIoTThe Hive
 
Quantum Computing (IBM Q) - Hive Think Tank Event w/ Dr. Bob Sutor - 02.22.18
Quantum Computing (IBM Q) - Hive Think Tank Event w/ Dr. Bob Sutor - 02.22.18Quantum Computing (IBM Q) - Hive Think Tank Event w/ Dr. Bob Sutor - 02.22.18
Quantum Computing (IBM Q) - Hive Think Tank Event w/ Dr. Bob Sutor - 02.22.18The Hive
 
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...The Hive
 
Data Science in the Enterprise
Data Science in the EnterpriseData Science in the Enterprise
Data Science in the EnterpriseThe Hive
 
AI in Software for Augmenting Intelligence Across the Enterprise
AI in Software for Augmenting Intelligence Across the EnterpriseAI in Software for Augmenting Intelligence Across the Enterprise
AI in Software for Augmenting Intelligence Across the EnterpriseThe Hive
 
“ High Precision Analytics for Healthcare: Promises and Challenges” by Sriram...
“ High Precision Analytics for Healthcare: Promises and Challenges” by Sriram...“ High Precision Analytics for Healthcare: Promises and Challenges” by Sriram...
“ High Precision Analytics for Healthcare: Promises and Challenges” by Sriram...The Hive
 
"The Future of Manufacturing" by Sujeet Chand, SVP&CTO, Rockwell Automation
"The Future of Manufacturing" by Sujeet Chand, SVP&CTO, Rockwell Automation"The Future of Manufacturing" by Sujeet Chand, SVP&CTO, Rockwell Automation
"The Future of Manufacturing" by Sujeet Chand, SVP&CTO, Rockwell AutomationThe Hive
 
Social Impact & Ethics of AI by Steve Omohundro
Social Impact & Ethics of AI by Steve OmohundroSocial Impact & Ethics of AI by Steve Omohundro
Social Impact & Ethics of AI by Steve OmohundroThe Hive
 
The Hive Think Tank: AI in The Enterprise by Venkat Srinivasan
The Hive Think Tank: AI in The Enterprise by Venkat SrinivasanThe Hive Think Tank: AI in The Enterprise by Venkat Srinivasan
The Hive Think Tank: AI in The Enterprise by Venkat SrinivasanThe Hive
 
The Hive Think Tank: The Future Of Customer Support - AI Driven Automation
The Hive Think Tank: The Future Of Customer Support - AI Driven AutomationThe Hive Think Tank: The Future Of Customer Support - AI Driven Automation
The Hive Think Tank: The Future Of Customer Support - AI Driven AutomationThe Hive
 
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...The Hive
 
The Hive Think Tank: The Content Trap - Strategist's Guide to Digital Change
The Hive Think Tank: The Content Trap - Strategist's Guide to Digital ChangeThe Hive Think Tank: The Content Trap - Strategist's Guide to Digital Change
The Hive Think Tank: The Content Trap - Strategist's Guide to Digital ChangeThe Hive
 
Deep Visual Understanding from Deep Learning by Prof. Jitendra Malik
Deep Visual Understanding from Deep Learning by Prof. Jitendra MalikDeep Visual Understanding from Deep Learning by Prof. Jitendra Malik
Deep Visual Understanding from Deep Learning by Prof. Jitendra MalikThe Hive
 
The Hive Think Tank: Heron at Twitter
The Hive Think Tank: Heron at TwitterThe Hive Think Tank: Heron at Twitter
The Hive Think Tank: Heron at TwitterThe Hive
 
The Hive Think Tank: Unpacking AI for Healthcare
The Hive Think Tank: Unpacking AI for Healthcare The Hive Think Tank: Unpacking AI for Healthcare
The Hive Think Tank: Unpacking AI for Healthcare The Hive
 
The Hive Think Tank: Translating IoT into Innovation at Every Level by Prith ...
The Hive Think Tank: Translating IoT into Innovation at Every Level by Prith ...The Hive Think Tank: Translating IoT into Innovation at Every Level by Prith ...
The Hive Think Tank: Translating IoT into Innovation at Every Level by Prith ...The Hive
 
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive
 
The Hive Think Tank - Design Thinking by Bernie Roth, Professor at Stanford U...
The Hive Think Tank - Design Thinking by Bernie Roth, Professor at Stanford U...The Hive Think Tank - Design Thinking by Bernie Roth, Professor at Stanford U...
The Hive Think Tank - Design Thinking by Bernie Roth, Professor at Stanford U...The Hive
 

More from The Hive (20)

"Responsible AI", by Charlie Muirhead
"Responsible AI", by Charlie Muirhead"Responsible AI", by Charlie Muirhead
"Responsible AI", by Charlie Muirhead
 
Translating a Trillion Points of Data into Therapies, Diagnostics, and New In...
Translating a Trillion Points of Data into Therapies, Diagnostics, and New In...Translating a Trillion Points of Data into Therapies, Diagnostics, and New In...
Translating a Trillion Points of Data into Therapies, Diagnostics, and New In...
 
Digital Transformation; Digital Twins for Delivering Business Value in IIoT
Digital Transformation; Digital Twins for Delivering Business Value in IIoTDigital Transformation; Digital Twins for Delivering Business Value in IIoT
Digital Transformation; Digital Twins for Delivering Business Value in IIoT
 
Quantum Computing (IBM Q) - Hive Think Tank Event w/ Dr. Bob Sutor - 02.22.18
Quantum Computing (IBM Q) - Hive Think Tank Event w/ Dr. Bob Sutor - 02.22.18Quantum Computing (IBM Q) - Hive Think Tank Event w/ Dr. Bob Sutor - 02.22.18
Quantum Computing (IBM Q) - Hive Think Tank Event w/ Dr. Bob Sutor - 02.22.18
 
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
 
Data Science in the Enterprise
Data Science in the EnterpriseData Science in the Enterprise
Data Science in the Enterprise
 
AI in Software for Augmenting Intelligence Across the Enterprise
AI in Software for Augmenting Intelligence Across the EnterpriseAI in Software for Augmenting Intelligence Across the Enterprise
AI in Software for Augmenting Intelligence Across the Enterprise
 
“ High Precision Analytics for Healthcare: Promises and Challenges” by Sriram...
“ High Precision Analytics for Healthcare: Promises and Challenges” by Sriram...“ High Precision Analytics for Healthcare: Promises and Challenges” by Sriram...
“ High Precision Analytics for Healthcare: Promises and Challenges” by Sriram...
 
"The Future of Manufacturing" by Sujeet Chand, SVP&CTO, Rockwell Automation
"The Future of Manufacturing" by Sujeet Chand, SVP&CTO, Rockwell Automation"The Future of Manufacturing" by Sujeet Chand, SVP&CTO, Rockwell Automation
"The Future of Manufacturing" by Sujeet Chand, SVP&CTO, Rockwell Automation
 
Social Impact & Ethics of AI by Steve Omohundro
Social Impact & Ethics of AI by Steve OmohundroSocial Impact & Ethics of AI by Steve Omohundro
Social Impact & Ethics of AI by Steve Omohundro
 
The Hive Think Tank: AI in The Enterprise by Venkat Srinivasan
The Hive Think Tank: AI in The Enterprise by Venkat SrinivasanThe Hive Think Tank: AI in The Enterprise by Venkat Srinivasan
The Hive Think Tank: AI in The Enterprise by Venkat Srinivasan
 
The Hive Think Tank: The Future Of Customer Support - AI Driven Automation
The Hive Think Tank: The Future Of Customer Support - AI Driven AutomationThe Hive Think Tank: The Future Of Customer Support - AI Driven Automation
The Hive Think Tank: The Future Of Customer Support - AI Driven Automation
 
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...
 
The Hive Think Tank: The Content Trap - Strategist's Guide to Digital Change
The Hive Think Tank: The Content Trap - Strategist's Guide to Digital ChangeThe Hive Think Tank: The Content Trap - Strategist's Guide to Digital Change
The Hive Think Tank: The Content Trap - Strategist's Guide to Digital Change
 
Deep Visual Understanding from Deep Learning by Prof. Jitendra Malik
Deep Visual Understanding from Deep Learning by Prof. Jitendra MalikDeep Visual Understanding from Deep Learning by Prof. Jitendra Malik
Deep Visual Understanding from Deep Learning by Prof. Jitendra Malik
 
The Hive Think Tank: Heron at Twitter
The Hive Think Tank: Heron at TwitterThe Hive Think Tank: Heron at Twitter
The Hive Think Tank: Heron at Twitter
 
The Hive Think Tank: Unpacking AI for Healthcare
The Hive Think Tank: Unpacking AI for Healthcare The Hive Think Tank: Unpacking AI for Healthcare
The Hive Think Tank: Unpacking AI for Healthcare
 
The Hive Think Tank: Translating IoT into Innovation at Every Level by Prith ...
The Hive Think Tank: Translating IoT into Innovation at Every Level by Prith ...The Hive Think Tank: Translating IoT into Innovation at Every Level by Prith ...
The Hive Think Tank: Translating IoT into Innovation at Every Level by Prith ...
 
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
 
The Hive Think Tank - Design Thinking by Bernie Roth, Professor at Stanford U...
The Hive Think Tank - Design Thinking by Bernie Roth, Professor at Stanford U...The Hive Think Tank - Design Thinking by Bernie Roth, Professor at Stanford U...
The Hive Think Tank - Design Thinking by Bernie Roth, Professor at Stanford U...
 

Recently uploaded

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 

Recently uploaded (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

  • 1. Machine Learning Applications in Computational Genomics — Some new algorithms for understanding cancer genomes Jian Ma Computational Biology Department School of Computer Science
  • 2. 2 TCTCTCAGAGGGCCCTGATGGAAGAATCCCCCTACCACCCTTCCAGGCTGACTTCTGTCTATTTCTCCTGCAGAGTGAGCTGGACTTGGAAAAGGGCTTG GAGATGAGAAAATGGGTCCTGTCGGGAATCCTGGCTAGCGAGGAGACTTACCTGAGCCACCTGGAGGCACTGCTGCTGGTGAGGAGGATTTAGGGAGCTG AGCAGGGCGGGATGGGGCAGGGTGACAGGGTTGGGGAGCCTCTTTGCCCTTAAGTCCCAGGTCAGCTGTCAGAGCCTGGGTGCAGCTCGCCATCCCTGGA GTGGATACCAGTGGAAGACTGAGTTGCCAAACCAAGCTGGTTTTAAAATTGTATTTGTTATGTGATTTAAAAATAAAAGTGCATATGTCAGGTAACCATG ACTGTCTACTGCCATACAATGCACCTGACGGATGGCAGCCCCTCTCACCTGTGCTACCTCACTTGTGCCCTCTTCCAGCCCATGAAGCCTTTGAAAGCCG CTGCCACCACCTCTCAGCCGGTGCTGACGAGTCAGCAGATCGAGACCATCTTCTTCAAAGTGCCTGAGCTCTACGAGATCCACAAGGAGTTCTATGATGG GCTCTTCCCCCGCGTGCAGCAGTGGAGCCACCAGCAGCGGGTGGGCGACCTCTTCCAGAAGCTGGTGAGTAACCCAGGGCCGGTGCTGGGACTACAGGCG TGTACCACCACGTCCAGCTAATTTTTTGCATTTTTAGTAGAGACAGGGTTTTGCTATGTTGGCCAGGCTGGTCTCAAACTCCTAACCTCAAGTGATCCAC CTGCCTCAGCCTCCCAAAGTACTGAGATTACAGGCGTGAGCCGCCATGCCCAGCCTTTTTTTTTTTTTTTCTAATTTATATTTATTTAGATAGTTATTTT TAAAAAGAGATGGGGACTTACTACGTTGTCCAGGCTGGAGTGCAGTGGCTATTCACAGGCGCAATTCCACTGCTCATCAGCACGGGAGTTTTGACCTCCT TCCTTTCCAACCTTGGCTGTTTCACTCCTTCTTAGGCAAACTGATGGTTCCCGACTCCTGGGAGGTCACCATATTGATGCCAAACTTAGTGTGTAGTGCA CTACAGCCCAGAACTCCTGACTGAAGCCATCCTCCGGCCTCAGCCTTCCGCGTAGCTGGGGCTATAGGTGCACGCCACCACACCCTGTGTGTGGCTGGGA CTACAGGTGCACGCCATCACACCCTGTGTGCGCCATCACACCCTGTGTGCACCATCACACCCTGTGTGCACACACTTTCCCTAAAGCAGGCTTCCTCCGC TGGGAAACAAGTCCTCTAGGGGCAGGTGTGGCCAGAGGCCAGGCCCCCCTCTAAGTGTGAAGAGCATGTGATTCCTTAAAAGCCCTTCCCCCAGCACTTC TGGACTACCGAGACACACAGCTCTGGCCTCGGGCCTCCCCTTGGCTGGTGCTGGGGGCTGAGTTTTCTGCTCTGAGGTGTGGCTTTCCTGTAGGGGGACC CCTCCCTCTGCCACCCTGTGCTGCAGACCCCCAGACTCCAGGCCAGAGCTAAGGCTTGAGGAACACAGAAGGCACTTAATTTGTTCCAGTTCTTGCTCCC TGGGGCTCTTTCCCCCATGGCCAGAGAGCAGGAGGCTGTATTTTGATACATGCTGCCCCCTCCATCTTTGAAGCCCCCCCACCCCCGTTTCTCCGTGTGT GTGTCAGCAGTTTTAAACCTAGTGGAGGGTGGTGGCTCGGGCTGGGCTCCGCGTCGGGCTGCCCCGCAGCTGCTCTTGGGCAGCCAGGGCCGCTGGGTGT GGGGCCGCCGGGAATGGCGGGCCCGGGTGAGGGCGGGCCCGGGTGAGGGCGGGGGCGGAGAGGCGAAGAAGCTGCAGGAAGGGAGGGTGACGAGGGGGAA GCGAAGGAAGGGGAAGAGGAAGGGAAAAGCGAGCGAGAGGGGCAAGGCGGAAGAGGAAGCAGGGCGGAAGGGAAGCCCGGGCCGCAGACGGCGAAGGAGG CAGCGGGCCGGGGGCTGAGGCGGGAGCGAGGACACGCCCAAGAGAGGAAGCAGAGGGAGGCGGAAGCGTGGAGGAAGGGGCGAGAGGCATCATCAAAGGA GATGAGGGGAGCGTAGGGGCCGGGAAAGAGGCACAAGGAAGAAAGTATGGGAAGGAGGAATGGAGGGTCAGGGCTAGGCGGCGGGAGGGCGCCAGGCCGG GAAGAGTACAAGGACAAGGAGGTCAGGTTTGGGCCTACATCCCGGGGACAGGGGCGGCCATGGCGGCGGCAGCCAGGGAGGAGGAGGAGGAGGCGGCTCG GGAGTCAGCCGCCTGCCCGGCTGCGGGGCCAGCGCTCTGGCGCCTGCCGGAAGTGCTGCTGCTGCACATGTGCTCCTACCTCGACATGCGGGCCCTCGGC CGCCTGGCCCAGGTGTACCGCTGGCTGTGGCACTTCACCAACTGCGACCTGCTCCGGCGCCAGATAGCCTGGGCCTCGCTCAACTCCGGCTTCACGCGGC TCGGCACCAACCTGATGACCAGTGTCCCAGTGAAGGTGTCTCAGAACTGGATAGTGGGGTGCTGCCGAGAGGGGATTCTGCTGAAGTGGAGATGCAGTCA GATGCCCTGGATGCAGCTAGAGGATGATGCTTTGTACATATCCCAGGCTAATTTCATCCTGGCCTACCAGTTCCGTCCAGATGGTGCCAGCTTGAACCGT CAGCCTCTGGGAGTCTGCTGGGCATGATGAGGACGTTTGCCACTTTGTGCTGGCCACCTCGCATATTGTCAGTGCAGGAGGAGATGGGAAGATTGGCCTT GGTAAGATTCACAGCACCTTCGCTGCCAAGTACTGGGCTCATGAACAGGAGGTGAACTGTGTGGATTGCAAAGGGGGCATCATATCATTGTGAGTGGCTC CAGGGACAGGACGGCCAAGGTGTGGCCTTTGGCCTCAGGCCAGCTGGGGTAGTGTTTATACACCATCCAGACTGAAGACCAAATCTGGTCTGTTGCTATC Fundamental question: How the changes in genome sequences give rise to phenotypic differences (e.g., disease states) ! When they got into the genome and how they have evolved ! Their roles in genome organization and gene regulation for 
 human biology ! Their implications in human diseases such as cancer Our goal — from base-pairs to bedside
  • 4. Why Computational Genomics? ! Key to personalized precision medicine, especially for cancer 4 David Patterson ! Cancer research has become big data science ! How to store and manage data efficiently ! How to analyze data in a distributed environment ! How to enhance data security but reduce barriers for sharing ! How to extract meaningful patterns ! How to identify mechanisms to help treatment ! …
  • 5. The Human genome: the “blueprint” of our body 5 GTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGA TTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCC CCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATT AGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCT ATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAAC ATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATT James Watson
 Francis Crick February 15, 2001 March, 2011
  • 6. DNA, Chromosome, and Genome 6 Chapter4:DNA,Chromosomes,andGenomes " beads-on-a-string" form of chromatin 30-nmchromatin fiberof packed nucleosomes Figure4-72 Chromatinpacking.This modelshowssomeof the manylevelsof chromatinpackingpostulatedto giverise to the highlycondensedmitotic chromosome. sectionof chromosomein extendedform condensedsection of chromosome entire mitotic chromosome T300nm I Tl 1 n m I T30 nm I TI 700nm Ii T1400nm I NETRESULT:EACHDNAMOLECULEHASBEEN ret CHROMOSOMALDNAANDITSPACKAGINGINTHECHROMATINFIBER (A) (B) -r^ Figure chrom a male under arethe chrom differ identi Chrom expos of hum coupl dyes.F fromc specif chrom DNAdoublehelix 5' Y 3' hydrogen-bonded basepairs 4-4). This complementary base-pairlng enables the base pairs to be packed in the energetically most favorable arrangement in the interior of the double helix. In this arrangement, each base pair is of similar width, thus holding the sugar- phosphate backbones an equal distance apart along the DNA molecule. To max- imize the efficiency of base-pair packing, the two sugar-phosphate backbones blocksof DNA phosphate suqar '; +K- sugar oase phosphate n e double-strandedDNA llilii:i:ilitffi$$iiiffiliiiii:ii:iii <CAGA>D nucleotide intoa poly strand)with backbonef andT)exte composed togetherby the pairedb endsofthe polaritieso antiparall molecule.I leftof the fi shownstra twistedint the right.F Figure4-4 the DNAdo chemicalst hydrogenb betweenA whereatom bonds(see broughtclo thedouble 3', s', H N - C_ C C - N / l I H - N o ' - L C N C - C C _ ,-n, , ,o'l [n,, thyminesugar-phosphate backbone H Ha d e n i n e N -HilililililO
  • 7. DNA, RNA, Protein ! Central Dogma in molecular biology • DNA • RNA • Protein ! In general, proteins do most of the work, and are encoded by subsequences of DNA, known as genes. ! However, only less than 2% of the human genome codes for proteins. 7
  • 8. Most of the genome are non-coding 8 © 2005 Nature Publishing Group SINEs LINEs Protein-coding genes Introns Miscellaneous unique sequences Miscellaneous heterochromatin Segmental duplications Simple sequence repeats DNA transposons LTR retrotransposons 20.4% 13.1% 1.5% 25.9% 11.6% 8% 5% 3% 2.9% 8.3% ss II elements transpose directly from DNA to DNA, and include DNA transposons and peat transposable elements (MITEs). nts (and especially their extinct remnants) make up a large portion of the human genome, with ample, the SINE Alu element) present in more than a million copies. Transposable-element mplex interactions with the host genome and other subgenomic elements, ranging from m. For a review of transposable-element structure, origins, impacts and evolution see REF. 17. ent man % of f the 000 nces. f s such %) s e www.nature.com/reviews/genetics Nat Rev Genet, 2005
  • 9. Most functional information is non-coding ! 5% highly conserved, but only 1.5% encodes proteins 9 chr2 (q31.1) 21 p14 2p12 13 31.1 q34 q35 chr2: DLX1 DLX2 Vertebrate Cons Chimp Rhesus Bushbaby Tree_shrew Mouse Rat Guinea_Pig Shrew Hedgehog Dog Cat Horse Cow Armadillo Elephant Tenrec Opossum Platypus Lizard Chicken Zebrafish Tetraodon Fugu Stickleback Medaka 172660000 172665000 172670000 172675000 UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics Vertebrate Multiz Alignment & PhastCons Conservation (28 Species) DLX1 Gaps Human Chimp Rhesus Bushbaby Tree_shrew Mouse Rat Guinea_Pig Shrew Hedgehog Dog Cat Horse Cow Armadillo Elephant Tenrec Opossum Platypus UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics Vertebrate Multiz Alignment & PhastCons Conservation (28 Species) K P R T I Y S S L Q L Q A L N 1 A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C A A A C C C A G G A C G A T T T A T T C C A G C T T G C A G T T G C A G G C T T T G A A C A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A T A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C A A A C C C A G G A C A A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C A A A C C C A G G A C A A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C C C C C C T A G G A C A A T T T A T T C C A G T T T G C A G C T G G A C G C T T T G A A T A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C A A G C C C A G G A C A A T C T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C A A A C C C A G G A C G A T T T A C T C C A G T T T G C A G T T G C A G G C T T T G A A C A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C A A A C C C A G G A C A A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C A A A C C T A G G A C G A T T T A T T C C A G T T T G C A G C T G C A G G C T T T G A A T A A A C C C A G G A C T A T T T A T T C C A G T C T G C A G T T G C A G G C T T T G A A C A A A C C C A G G A C T A T A T A T T C C A G T T T G C A G T T G C A G G C A T T G A A C What do they do?
  • 10. Annotating the non-coding regions 10 Scale chr2: NKI LADs (Tig3) 10 kb hg19 20,090,000 20,095,000 20,100,000 20,10 TTC32 LaminB1 (Tig3) 2 - -2 _ GM78 CHD2 IgM 889 - 1 _ GM78 Pol2 IgM 156.2 - 0 _ GM78 Pol2 Std 259.8 - 0 _ GM78 Rad2 IgR 8.7 - 0 _ GM78 TBP IgM 40.1 - 0 _ GM78 Z274 Std 16 - 1 _ K562 CHD2 IgR 1785 - 1 _ K562 Pol2 IgM 27.9 - 0 _ K562 IFa3 Pol2 Sd 211.5 - 0 _ K562 IFa6 Pol2 Sd 199.4 - 0 _ K562 IFg3 Pol2 Sd 241.7 - 0 _ K562 IFg6 Pol2 Sd 261.1 - 0 _ K562 Pol2 Std 343.1 - 0 _ K562 Rad2 Std 8.6 - 0 _ K562 TBP IgM 397 - 1 _ K562 Z274 UCD 5.4 - 0 _ ChromHMM also enables the analy across multiple cell types. When the ch mon across the cell types, a common m a virtual ‘concatenation’ of the chrom Alternatively a model can be learned by marks across cell types, or independent each cell type. Lastly, ChromHMM sup models with different number of chrom relations in their emission parameters ( We wrote the software in Java, whic virtually any computer. ChromHMM an tion is freely available at http://compbio he observed combination of chromatin marks using ndependent Bernoulli random variables2, which t learning of complex patterns of many chromatin . As input, it receives a list of aligned reads for each ark, which are automatically converted into pres- ce calls for each mark across the genome, based on kground distribution. One can use an optional addi- f aligned reads for a control dataset to either adjust for present or absent calls, or as an additional input tively, the user can input files that contain calls from nt peak caller. By default, chromatin states are ana- ase-pair intervals that roughly approximate nucleo- t smaller or larger windows ied. We also developed an ameter-initialization proce- bles relatively efficient infer- arable models across differ- of states (Supplementary e outputs of ChromHMM. hromatin-state annotation from ChromHMM and visualized Scale chr4: GM12878 1_Active_Promoter 2_Weak_Promoter 3_Poised_Promoter 4_Strong_Enhancer 5_Strong_Enhancer 6_Weak_Enhancer 7_Weak_Enhancer 8_Insulator 9_Txn_Transition 10_Txn_Elongation 11_Weak_Txn 12_Repressed 13_Heterochrom/lo 14_Repetitive/CNV 15_Repetitive/CNV 50 kb 103650000 103700000 RefSeq Genes GM12878 (User ordered) GM12878 (User ordered) NFKB1 NFKB1 a b cEmission parameters Transition parameters ChromHMM — Ernst and Kellis, Nature Methods 2012 Alternatively a model can be learned by a virtual ‘stacking’ of all marks across cell types, or independent models can be learned in each cell type. Lastly, ChromHMM supports the comparison of models with different number of chromatin states based on cor- relations in their emission parameters (Supplementary Fig. 4). We wrote the software in Java, which allows it to be run on virtually any computer. ChromHMM and additional documenta- tion is freely available at http://compbio.mit.edu/ChromHMM/. verted into pres- enome, based on an optional addi- et to either adjust n additional input contain calls from in states are ana- roximate nucleo- Scale chr4: GM12878 1_Active_Promoter 2_Weak_Promoter 3_Poised_Promoter 4_Strong_Enhancer 5_Strong_Enhancer 6_Weak_Enhancer 7_Weak_Enhancer 8_Insulator 9_Txn_Transition 10_Txn_Elongation 11_Weak_Txn 12_Repressed 13_Heterochrom/lo 14_Repetitive/CNV 15_Repetitive/CNV 50 kb 103650000 103700000 103750000 RefSeq Genes GM12878 (User ordered) GM12878 (User ordered) NFKB1 NFKB1 MANBA a b cEmission parameters State(userorder) State(userorder) Statefrom(userorder) Transition parameters Mark CTCF H3K27me3 H3K36me3 H4K20me1 H3K4me1 H3K4me2 H3K4me3 H3K27ac H3K9ac WCE Genome(%) RefSeqTSS CpGisland RefSeqTSS2kb RefSeqexon RefSeqgene RefSeqTES Conserved Lamina State to (user order) Category 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 GM12878 fold enrichments
  • 12. Each type of cancer is different 12 widespread—remain and eventua promising the function of the lun organs. From a genetics persp seem that there must be mutatio primary cancer to a metastatic o are mutations that convert a nor nign tumor, or a benign tumor to (Fig. 2). Despite intensive effor sistent genetic alterations that dis that metastasize from cancers th metastasized remain to be identi One potential explanation in or epigenetic changes that are tify with current technologies (see matter” below). Another explana static lesions have not yet been ficient detail to identify these ge particularly if the mutations ar in nature. But another possibl that there are no metastasis gen primary tumor can take many y size, but this process is, in prin by stochastic processes alone (17 tumors release millions of cells tion each day, but these cells hav and only a miniscule fraction es lesions (19). Conceivably, these may, in a nondeterministic man and randomly lodge in a capillary that provides a favorable micro growth. The bigger the primary more likely that this process w scenario, the continual evolutio tumor would reflect local selec rather than future selective adva that growth at metastatic sites is n additional genetic alterations is a recent results showing that eve when placed in suitable enviro lymph nodes, can grow into org with a functioning vasculature ( 1500 1000 500 Colorectal(MSI) Lung(SCLC) Lung(NSCLC) Melanoma Esophageal(ESCC) Non-Hodgkinlymphoma Colorectal(MSS) Headandneck Esophageal(EAC) Gastric Endometrial(endometrioid) Pancreaticadenocarcinoma Ovarian(high-gradeserous) Prostate Hepatocellular Glioblastoma Breast Endometrial(serous) Lung(neversmokedNSCLC) Chroniclymphocyticleukemia Acutemyeloidleukemia Glioblastoma Neuroblastoma Acutelymphoblasticleukemia Medulloblastoma Rhabdoid Mutagens Non-synonymousmutationspertumor (median+/-onequartile) 250 225 200 175 150 125 100 50 75 25 0 B Adult solid tumors Liquid Pediatric Number of nonsynonymous mutations in representative human cancers, detected by genome-wide sequencing studies. Vogelstein et al. Science 2013
  • 13. Each individual tumor is different ! Data from TCGA’s analyses show that most cancer types has a great number of mutations that occur at a low frequency. ! Long-tail distribution 13 doi: 10.1038/nature08645 SUPPLEMENTARY INFORMATION SI Guide Supplementary Figure 1 Haploid physical coverage of breast cancer samples. Physical coverage indicates the number of DNA fragments of which both ends have been sequenced that on average overlie any position in the genome. Supplementary Figure 2 Genome wide circos plots of somatic rearrangements in all 24 breast cancers in the study. Stephens et al. Nature 2009
  • 14. Supervised learning Un-supervised learning genes samples Analyzing gene expression data
  • 15. How to deal with high dimension? 
 Identify the most important genes ! d is the damping factor, a parameter representing the extent to which the ranking depends on the structure of the graph. ! f is the prior probability of the gene which we set to the absolute differential expression. ! is the in- degree of i 15 Gene Network Gene Expression Somatic Alteration Data (SNP, CNV, etc.) Ranks of Genes ▪ A ranking framework based on PageRank that considers the impact of genes in the network ▪ Impact includes connectivity and the amount downstream genes to be differentially expressed ▪ Dynamic damping factor is used to improve the original PageRank in ranking genes DawnRank Personalized Driver Alterations rt+1 j = (1 dj)fj + dj NX i=1 Ajirt i degi degi = PN j=1 Aji Hou and Ma, Genome Med 2014
  • 16. Tumor heterogeneity vs. gene networks 16 NCIS - Liu et al. BMC Bioinfo 2014 C3 - Hou et al. Bioinformatics 2016 LDGM - Tian, Gu, and Ma, Nucleic Acids Res 2016 NRAS GABBR1 ATF2 MAPK1 PRKACA GNAI2 PRKACB CREB3L4 ADCY2 KCNJ3 PLCB4 GRB2 GNAI3 SRC PIK3CD CALML6 ESR1 GABBR2 ADCY4 FOS ADCY3 NOS3 PLCB2 OPRM1 AKT1 GNAS CREB3 PIK3CA HRAS PLCB3KCNJ6 CREB3L1 GNAO1 SHC1 MAP2K1 PIK3R5 ADCY5 MAPK3 PLCB1PIK3R3 SOS1 GNAI1 CALML3 MMP2PRKACG PRKCD CREB3L2HBEGF SHC4 PIK3CB AKT3 CREB5 GRM1 ADCY1 MMP9 EGFRJUN ADCY7 ATF6B SHC2 PIK3R1 CALM1 SOS2 ADCY9 ATF4 PIK3R2 SHC3 SP1 # interactions # interactions ESR1degreeRankofESR1 A B C Lumina A Basal-like % degree from Luminal A % degree from Basal-like LDGM Glasso JGL CNJGL Figure 5: Differential networks on estrogen signaling pathway reconstructed based on gene expression data from breast cancer Luminal A and Basal-like subtypes. (A) The degree of ESR1 in estimated differential networks with increased number of interactions. (B) The rank of ESR1 by its degree in differential networks. The number of interactions is up to 1,000 in (A) and (B). (C) A differential network b estimated by LDGM with = 0.362. Node size is proportional to the node’s degree. Width of an interaction i j is proportional to the score |bij|. The origin of interactions in the differential network is inferred by a principle of majority approach based on Glasso (see Supplementary Text). J.P.Hou et al.
  • 17. Deep learning applications 17 x y Features Model ResultsClean data A D Feature extraction Discriminative features Raw data Label C Intron Exon Feature extraction Training Evaluation Supervised Unsupervised x • Linear regression • Logistic regression • Random Forest • SVM • … • PCA • Factor analysis • Clustering • Outlier detection • … B A C G T C G C G T A G T C C G T T A G T C G T A G G A G A A T A G C T G CA C G T G A CC A T G A G T C A T G CT G CG T C C G TA TC G A T G T C C G A G T A C A CC ACC GA G TG T G TC A T G C T A C A G C T AT G C G C T AG C T G AC T G A CT AT C G G C T A T G C A G A G C A C G A CGG CT C GA T G CC T G A T C C C A G T A G CT A G C T A CCA G C CA G CT CT G A CG T C T A C GA T C GT G A CA T C GG C A G CA T GG C A G CA T C G T A C G A T C G A T G C A C G TC G A T T G A T A G A C GC GA C T GA T CA T GA C T GT A G C G T A G C T A G C TC G AC A TC G A T G A T T C A T AG A TC T A C G T A Layer 1 A T G C A G A G C A C G A CGG CT C GA T G CC T G A T C C G T A G C T A G C TC G AC A TC G A T G A T T C A T AG A TC T A C G T A Raw data Pre- processing Raw data Layer 2 Intron ExonTSS Figure 1. Machine learning and representation learning. (A) The classical machine learning workflow can be broken down into four steps: data pre-processing, feature extraction, model learning and model evaluation. (B) Supervised machine learning methods relate input features x to an output label y, whereas unsupervised method learns factors about x without observed labels. (C) Raw input data are often high-dimensional and related to the corresponding label in a complicated way, which is challenging for many classical machine learning algorithms (left plot). Alternatively, higher-level features extracted using a deep model may be able to better discriminate between classes (right plot). (D) Deep networks use a hierarchical structure to learn increasingly abstract feature representations from the raw data. Molecular Systems Biology Deep learning for computational biology Christof Angermueller et al Published online: July 29, 2016 x y Features Model ResultsClean data A D Feature extraction Discriminative features Raw data Label C Intron Exon Feature extraction Training Evaluation Supervised Unsupervised x • Linear regression • Logistic regression • Random Forest • SVM • … • PCA • Factor analysis • Clustering • Outlier detection • … B A C G T C G C G T A G T C C G T T A G T C G T A G G A G A A T A G C T G CA C G T G A CC A T G A G T C A T G CT G CG T C C G TA TC G A T G T C C G A G T A C A CC ACC GA G TG T G TC A T G C T A C A G C T AT G C G C T AG C T G AC T G A CT AT C G G C T A T G C A G A G C A C G A CGG CT C GA T G CC T G A T C C C A G T A G CT A G C T A CCA G C CA G CT CT G A CG T C T A C GA T C GT G A CA T C GG C A G CA T GG C A G CA T C G T A C G A T C G A T G C A C G TC G A T T G A T A G A C GC GA C T GA T CA T GA C T GT A G C G T A G C T A G C TC G AC A TC G A T G A T T C A T AG A TC T A C G T A Layer 1 A T G C A G A G C A C G A CGG CT C GA T G CC T G A T C C G T A G C T A G C TC G AC A TC G A T G A T T C A T AG A TC T A C G T A Raw data Pre- processing Raw data Layer 2 Intron ExonTSS Figure 1. Machine learning and representation learning. (A) The classical machine learning workflow can be broken down into four steps: data pre-processing, feature extraction, model learning and model evaluation. (B) Supervised machine learning methods relate input features x to an output label y, whereas unsupervised method learns factors about x without observed labels. (C) Raw input data are often high-dimensional and related to the corresponding label in a complicated way, which is challenging for many classical machine learning algorithms (left plot). Molecular Systems Biology Deep learning for computational biology Christof Angermueller et al Published online: July 29, 2016 More traditional Machine Learning Applications to Deep Learning Application Angermueller et al., Mol Sys Bio 2016
  • 18. DeepBIND 18 A N A LY S I S t i P a v p f P a t a s a t i t Figure 1 DeepBind’s input data, training procedure and applications. 1. The sequence specificities of DNA- and RNA-binding proteins can now be measured by several types of high-throughput assay, including PBM, SELEX, and ChIP- and CLIP-seq techniques. 2. DeepBind captures these binding specificities from raw sequence data by jointly discovering new sequence motifs along with rules for combining them into a predictive binding score. Graphics processing units (GPUs) are used to automatically train high-quality models, with expert tuning allowed but not required. 3. The resulting DeepBind models can then be used to identify binding sites in test sequences and Alipanahi et al. Nat Biotech 2015 (Other methods: DeepSEA — Zhou & Troyanskaya, Nat Methods 2015;
 DanQ — Quang & Xie, Nucleic Acids Res 2016)
  • 20. 20 Structural variations (SVs) 
 in cancer genomes inversion translocation gain loss duplication Whole genome sequencing Methods: DELLY, Meerkat, BreakDancer, 
 CREST, CNVnator, 
 CONSERTING, and many others
  • 21. Aneuploidy — Common feature of cancer cells 21 MCF-7 http://www.path.cam.ac.uk/~pawefish/ ! Allele-specific copy number (ASCN) tools • ABSOLUTE, ASCAT, Patchwork ! SVs can further modify the aneuploid cancer genome into a mixture of genomic segments with extensive range of CNAs ! We need methods that combine SV and ASCN ! How SVs interact with ASCNs? How different SVs interact with each other? A N A LY S I S Percent of samples with WGD 6245 43 1143 2059 64 5327 a 0 0.5 1.0 Purity 1 2 3 4 5+ LUAD LUSC HNSC KIRC BRCA BLCA CRC UCEC GBM OV Ploidy 0 500 1,000 Samples (all lineages) Near diploid 1 WGD 2+ WGD ple Near diploid WGD samples Amplification Deletion b Zack et al. Nature Genetics 2013
  • 22. Goal — Quantify allele-specific SVs 22 Goal - Quantify Allele-Specific SVs 4 Goal - Quantify Allele-Specific SVs 4 Goal - Quantify Allele-Specific SVs 4
  • 23. Weaver — algorithm overview 23 Probabilistic Graphical Model (Markov Random Field) Mappability GC Content Purity ASCNG ASCNS Timing of SV Phasing SV list BAM file 1KGP haplotypes SNP list Cancer Genome Graph SNP linkage SNP LD (B) (C) R1 R3 R4 R5 Rm R6 R10 R1 R2 R3 R4 R5 R6 R10 R12 R13 R14 R16 R17 R18 R21 R1 R2 R3 R4 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R18 R19 R20 R21 R2 (A) interchr del dup intrachrintrachr R11 R12 R13 R14 R15 R16 R18 R19 R20 R21deln m p t s q chrA chrB Coverage from read mapping Input Output Li et al. Cell Systems 2016
  • 24. 24 Purity ASCNG ASCNS Timing of SV Phasing 100 kb 21,850,000 21,900,000 21,950,000 22,000,000 22,050,000 22,100,000 22,150,000 22,200,000 22,250,000 MTAP C9orf53 CDKN2A CDKN2A CDKN2B-AS1 CDKN2B 142_ 0 _ chr9 9p23 21.3 21.1 12 9q12 13 31.1 32 33.1 Coverage LOH & first amplification Deletion Second amplification (B) Del1 Del2 ASCNS and Timing of SV Del1 Del2 Del1 Del2 Figure 1: (A) Schema diagram for Weaver. Dark green boxes show the different types of analyses, unique to Weaver that are not dealt with by other methods, while light green ones show ‘by-products’ of Weaver shown to have an improvement over existing methods. (B) An example demonstrating a Weaver output focused on ASCNS and Timing of SV. Dark blue segments (two copies) and light blue segment (one copy) represent a portion of the MCF-7 genome that originated from the same allele on chr9. The other allele was lost during tumorigenesis, resulting in LOH. The predicted evolution of this region
  • 25. ! MRF: • genome node, cancer node, genome edge, cancer edge 25 (B) (C) R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R12 R13 R14 R15 R16 R17 R18 R19 R20 R11 R21 R1 R3 R4 R5 Rm Rp Rs Rq R6 R7 R8 R9 R10 RnR2 dup intrachrintrachr mq→(12,14) m(12,14)→q m(12,14)→s ms→(12,14) R(12,14) R15 R11 R21 Rp Rs Rq R(3,4) R(5,6) R(19,20) Rt m +R2 -R2 n +R6 -R10 p +R4 -R16 q -R12 -R21 s +R14 +R18 t +R16 -R18 label L_pos R_pos 2 2 1 1 1 1 2 2 1 2 2 1 2 2 1 1 1 2 L_allele R_allele CN R1 30 0.33 R2 40 0.5 R5 20 0 R7 10 0 R12 20 0.5 R17 10 0 label cov allele_freq 2 1 2 2 2 0 1 0 1 1 0 1 CN_1 CN_2 Genomic regions SVs Inputs Outputs(D) (E) 𝜇0 = 0; 𝜇1 = 1; b = 10 Time Post- Pre- chrA chrB R1 R2 R10 R(7,9) Rt R17 R18 RnRm R16 Figure 5: (A) Hypothetical cancer chromosomes with rearrangement structure hidden. Orange and blue segments repre- sent paternal/maternal allele. Red dashed line represent linkages by SVs. (B) The Cancer Genome Graph, constructed from (A), with nodes (boxes) representing genomic regions and edges representing reference (solid lines) or cancer (dashed Cancer Genome Graph (B) (C) R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21 R12 R13 R14 R15 R16 R17 R18 R19 R20 R11 R21 R1 R3 R4 R5 Rm Rp Rs Rq R6 R7 R8 R9 R10 RnR2 R12 R13 R14 R16 R17 R18 R21 R1 R3 R4R2 dup intrachrintrachr mq→(12,14) m(12,14)→q m(12,14)→s ms→(12,14) R(12,14) R15 R11 R21 Rp Rs Rq R(3,4) R(5,6) R(19,20) Rt m +R2 -R2 n +R6 -R10 p +R4 -R16 q -R12 -R21 s +R14 +R18 t +R16 -R18 label L_pos R_pos 2 2 1 1 1 1 2 2 1 2 2 1 2 2 1 1 1 2 L_allele R_allele CN R1 30 0.33 R2 40 0.5 R5 20 0 R7 10 0 R12 20 0.5 R17 10 0 label cov allele_freq 2 1 2 2 2 0 1 0 1 1 0 1 CN_1 CN_2 Genomic regions SVs Inputs Outputs(D) (E) 𝜇0 = 0; 𝜇1 = 1; b = 10 m p s q Time Post- Pre- chrA chrB R1 R2 R10 R(7,9) Rt R17 R18 RnRm R16 Figure 5: (A) Hypothetical cancer chromosomes with rearrangement structure hidden. Orange and blue segments repre- sent paternal/maternal allele. Red dashed line represent linkages by SVs. (B) The Cancer Genome Graph, constructed from (A), with nodes (boxes) representing genomic regions and edges representing reference (solid lines) or cancer (dashed MRF representation rozygosity (LOH) regions is known that most of the h using SV boundaries has segmentation methods in me Graph G := {R, E} eference adjacencies (Er) adjacencies (Ec) (dashed configurations E between senting the tail (right) and adjacent regions from the ndom Field (MRF, M := nt probabilities. The MRF en sequencing data can be explained in the following n hidden Markov models between ‘local’ variables, er genomes with complex ed steps are described in ions, and formal function ed in the Supplementary 2 million from Weaver based on various datasets, depending on the size of and the number of SVs. The rationale behind the segmentation step with SV time ASCNG boundaries coincide with SV breakpoints [14]. Our segmentati the advantage to provide base-level ASCNG boundaries as compared to exist copy number analysis, which typically use fixed segmentation size. Given the segmentation of the genome and SV set C, we then build C (Fig. 5(B)), with nodes representing genomic region sets (R) and edges rep (solid lines in the figure) if two nodes are adjacent in the normal genome lines in the figure) if two nodes are adjacent in the cancer genome by SV c lin node Ri and Rj can be represented as: ( iRi ⇠ jRj), 2 {+, }, with + a head (left) of a given genomic region R, e.g., (+Ri ⇠ Ri+1) 2 Er, if Ri an same chromosome in the normal genome. We then convert the original Cancer Genome Graph G := {R, E} into {R, Rc, Er, Ec}), which is a widely used probabilistic graphical model to e can be viewed as undirected graph and the aggregated inference problem in W viewed as a maximum a posteriori (MAP) problem with hidden states and ob sections. Unlike conventional methods for estimating copy number chang (HMMs), which are designed for sequential data and only consider the dep MAP solution of MRF model provides the most probable configuration of ane SVs, involving ‘global’ variable dependencies defined by long-range SVs. Supplementary Note 6. In the following sections, we describe hidden stat of the MRF MAP problem. Details on potential functions on nodes and edge Note. Hidden states H = NLD(Gi , Gi+1) ⇥ NLD(Gi , Gi+1) NLD(Ga i , Ga i+1) ⇥ NLD(Gb i , Gb i+1) + NLD(Ga i , Gb i+1) ⇥ NLD(Gb i , Ga i+1) where NLD(Ga i , Ga i+1) is the number of phased haplotypes (total number 1092 ⇥ 2 in phase 1) in 1KGP with genotype (Ga i , Ga i+1). Other genotype configurations can be similarly calculated. (ii) Similarly, we define the read linkage score for the phasing Ga i , Ga i+1/Gb i , Gb i+1 as: RL(Ga i , Ga i+1/Gb i , Gb i+1) = NRL(Ga i , Ga i+1) + NRL(Gb i , Gb i+1) NRL(Ri, Ri+1) where NRL(Ri, Ri+1) is the total number of reads covering genomic regions (Ri, Ri+1) and NRL(Ga i , Ga i+1) is total number of reads covering (Ga i , Ga i+1). If there are no reads covering (Ri, Ri+1) (NRL(i, i+1) = 0), RL = 0 Therefore, we define genotype linkage as GL(Ga i , Ga i+1/Gb i , Gb i+1) = log(LD(Ga i , Ga i+1/Gb i , Gb i+1) ⇤ RL(Ga i , Ga i+1/Gb i , Gb i+1)) In real data application, we have found that RL and LD correlate very well. For example, in the MCF-7 analysis when we chose SNP pairs with 100% RL support as gold standard, we found AUC= 0.9964 using LD scores. Markov random field model M After we convert G into MRF M using steps in Supplementary Note 6, the MRF MAP problem is given by ˆH = argmaxH 8 < : X i2R ⇥R(O|Hi) + X c2C ⇥C(O|Hc) + X i2R R(O|Hi, Hi+1) + X c2C X i2N (c) C(Hi, Hc) 9 = ; 7 genome node
 potential function cancer node
 potential function genome edge
 potential function cancer edge
 potential function
  • 26. ASCN and SVs in MCF-7 26 ! 83% of SVs have copy number > 1 ! 68% of the regions have imbalanced copy number ! We found 276 SVs after whole chromosome dup ! We have used physical mapping to validate the results
  • 27. ASCN and SVs in HeLa ! WGS reads obtained from Adey et al. Nature 2013 ! ASCNG are 97% consistent with Adey et al. (Fosmid seq) 27 Structural variants were identified by clustering discordantly mapped reads from 40-kb and 3-kb mate-pair libraries (Supplemen- tary Fig. 8). Twenty interchromosomal links were identified, including links for marker chromosomes M11 (9q33–11p14) and M14 (13q21– 19p13). In addition, 209 HeLa-specific deletions and 8 inversions were found (Supplementary Figs 9 and 11, and Supplementary Table 10). Only two genes that are impacted by HeLa-specific structural rearran- gements (Supplementary Table 11) intersected with SCGC (STK11 (ref. 18), FHIT), both of which are recurrently deleted in cervical 18,19 pool. Alleles that were p given clone were assigned and the unobserved alle haplotype. When overlap this resulted in haplotyp which 50% of the total len 550 kb containing 90.6% inherited. Most of the HeLa gen 1 2 3 4 5 X 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 HPV integration 3q11 M5, S Linked position Marker chromosome name Supported by Sequence data Colour indicates suspected haplotype Haplotype A Haplotype B Tandem duplication Probable contiguity a 1q11 M1 1q11 M25 15q11 M18 9p11 M10 3p21 M10 5q11 M4 3p11 M4 12q15 M12 5p 2xM7 11p14 M11,S 9q33 M11,S 9q33 M11 19p13 M14,S 21q11 M18 20p11 M15 13q21 M14,S 15q11 M18 3q11 M1 1p11 M2 9q11 M2 15q M13 21q11 M25 11q22 M11 5p marker M7 HPV locus 4q31-35 6q13-21 18q1 2 3 4 5 6 7 8 3q24-29 LOH Chr18 / S3 window ratios CCL-2 window ratios S3 copy-number calls S3-specific differences Windowratio;copynumber b Genomic position RESEARCH LETTER Adey et al. Nature, 2013
  • 28. Application to TCGA Data ! Inter-chromosomal chromothripsis 28 1X 62X (A) (B) FOXG1 4 2214 6(C) Supplementary Figure 14: (A) Overview of the genomic landscape of a TCGA ovarian cancer sample (TCGA-36-1571). (B) Most of SVs linking chr4 and chr22 have copy number one. (C) Three high-coverage fold-back inversions are observed at boundaries of highly amplified region of chr14, indicating many rounds of BFB cycles happened. Gene FOXG1 is on the fold-back inversion boundary and highly amplified. ! Breakage-fusion-bridge amplifications 1X 62X A) (B) FOXG1 4 2214 6(C) Supplementary Figure 14: (A) Overview of the genomic landscape of a TCGA ovarian cancer sample (TCGA-36-1571). (B) Most of SVs linking chr4 and chr22 have copy number one. (C) Three high-coverage fold-back inversions are observed at boundaries of highly amplified region of chr14, indicating many rounds of BFB cycles happened. Gene FOXG1 is on the fold-back inversion boundary and highly amplified. 1X 62X A) (B) FOXG1 4 2214 6(C) Supplementary Figure 14: (A) Overview of the genomic landscape of a TCGA ovarian cancer sample (TCGA-36-1571). (B) Most of SVs linking chr4 and chr22 have copy number one. (C) Three high-coverage fold-back inversions are observed TCGA-36-1571