The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

Machine Learning Applications in
Computational Genomics
— Some new algorithms for understanding cancer genomes
Jian Ma
Computational Biology Department
School of Computer Science

2
TCTCTCAGAGGGCCCTGATGGAAGAATCCCCCTACCACCCTTCCAGGCTGACTTCTGTCTATTTCTCCTGCAGAGTGAGCTGGACTTGGAAAAGGGCTTG
GAGATGAGAAAATGGGTCCTGTCGGGAATCCTGGCTAGCGAGGAGACTTACCTGAGCCACCTGGAGGCACTGCTGCTGGTGAGGAGGATTTAGGGAGCTG
AGCAGGGCGGGATGGGGCAGGGTGACAGGGTTGGGGAGCCTCTTTGCCCTTAAGTCCCAGGTCAGCTGTCAGAGCCTGGGTGCAGCTCGCCATCCCTGGA
GTGGATACCAGTGGAAGACTGAGTTGCCAAACCAAGCTGGTTTTAAAATTGTATTTGTTATGTGATTTAAAAATAAAAGTGCATATGTCAGGTAACCATG
ACTGTCTACTGCCATACAATGCACCTGACGGATGGCAGCCCCTCTCACCTGTGCTACCTCACTTGTGCCCTCTTCCAGCCCATGAAGCCTTTGAAAGCCG
CTGCCACCACCTCTCAGCCGGTGCTGACGAGTCAGCAGATCGAGACCATCTTCTTCAAAGTGCCTGAGCTCTACGAGATCCACAAGGAGTTCTATGATGG
GCTCTTCCCCCGCGTGCAGCAGTGGAGCCACCAGCAGCGGGTGGGCGACCTCTTCCAGAAGCTGGTGAGTAACCCAGGGCCGGTGCTGGGACTACAGGCG
TGTACCACCACGTCCAGCTAATTTTTTGCATTTTTAGTAGAGACAGGGTTTTGCTATGTTGGCCAGGCTGGTCTCAAACTCCTAACCTCAAGTGATCCAC
CTGCCTCAGCCTCCCAAAGTACTGAGATTACAGGCGTGAGCCGCCATGCCCAGCCTTTTTTTTTTTTTTTCTAATTTATATTTATTTAGATAGTTATTTT
TAAAAAGAGATGGGGACTTACTACGTTGTCCAGGCTGGAGTGCAGTGGCTATTCACAGGCGCAATTCCACTGCTCATCAGCACGGGAGTTTTGACCTCCT
TCCTTTCCAACCTTGGCTGTTTCACTCCTTCTTAGGCAAACTGATGGTTCCCGACTCCTGGGAGGTCACCATATTGATGCCAAACTTAGTGTGTAGTGCA
CTACAGCCCAGAACTCCTGACTGAAGCCATCCTCCGGCCTCAGCCTTCCGCGTAGCTGGGGCTATAGGTGCACGCCACCACACCCTGTGTGTGGCTGGGA
CTACAGGTGCACGCCATCACACCCTGTGTGCGCCATCACACCCTGTGTGCACCATCACACCCTGTGTGCACACACTTTCCCTAAAGCAGGCTTCCTCCGC
TGGGAAACAAGTCCTCTAGGGGCAGGTGTGGCCAGAGGCCAGGCCCCCCTCTAAGTGTGAAGAGCATGTGATTCCTTAAAAGCCCTTCCCCCAGCACTTC
TGGACTACCGAGACACACAGCTCTGGCCTCGGGCCTCCCCTTGGCTGGTGCTGGGGGCTGAGTTTTCTGCTCTGAGGTGTGGCTTTCCTGTAGGGGGACC
CCTCCCTCTGCCACCCTGTGCTGCAGACCCCCAGACTCCAGGCCAGAGCTAAGGCTTGAGGAACACAGAAGGCACTTAATTTGTTCCAGTTCTTGCTCCC
TGGGGCTCTTTCCCCCATGGCCAGAGAGCAGGAGGCTGTATTTTGATACATGCTGCCCCCTCCATCTTTGAAGCCCCCCCACCCCCGTTTCTCCGTGTGT
GTGTCAGCAGTTTTAAACCTAGTGGAGGGTGGTGGCTCGGGCTGGGCTCCGCGTCGGGCTGCCCCGCAGCTGCTCTTGGGCAGCCAGGGCCGCTGGGTGT
GGGGCCGCCGGGAATGGCGGGCCCGGGTGAGGGCGGGCCCGGGTGAGGGCGGGGGCGGAGAGGCGAAGAAGCTGCAGGAAGGGAGGGTGACGAGGGGGAA
GCGAAGGAAGGGGAAGAGGAAGGGAAAAGCGAGCGAGAGGGGCAAGGCGGAAGAGGAAGCAGGGCGGAAGGGAAGCCCGGGCCGCAGACGGCGAAGGAGG
CAGCGGGCCGGGGGCTGAGGCGGGAGCGAGGACACGCCCAAGAGAGGAAGCAGAGGGAGGCGGAAGCGTGGAGGAAGGGGCGAGAGGCATCATCAAAGGA
GATGAGGGGAGCGTAGGGGCCGGGAAAGAGGCACAAGGAAGAAAGTATGGGAAGGAGGAATGGAGGGTCAGGGCTAGGCGGCGGGAGGGCGCCAGGCCGG
GAAGAGTACAAGGACAAGGAGGTCAGGTTTGGGCCTACATCCCGGGGACAGGGGCGGCCATGGCGGCGGCAGCCAGGGAGGAGGAGGAGGAGGCGGCTCG
GGAGTCAGCCGCCTGCCCGGCTGCGGGGCCAGCGCTCTGGCGCCTGCCGGAAGTGCTGCTGCTGCACATGTGCTCCTACCTCGACATGCGGGCCCTCGGC
CGCCTGGCCCAGGTGTACCGCTGGCTGTGGCACTTCACCAACTGCGACCTGCTCCGGCGCCAGATAGCCTGGGCCTCGCTCAACTCCGGCTTCACGCGGC
TCGGCACCAACCTGATGACCAGTGTCCCAGTGAAGGTGTCTCAGAACTGGATAGTGGGGTGCTGCCGAGAGGGGATTCTGCTGAAGTGGAGATGCAGTCA
GATGCCCTGGATGCAGCTAGAGGATGATGCTTTGTACATATCCCAGGCTAATTTCATCCTGGCCTACCAGTTCCGTCCAGATGGTGCCAGCTTGAACCGT
CAGCCTCTGGGAGTCTGCTGGGCATGATGAGGACGTTTGCCACTTTGTGCTGGCCACCTCGCATATTGTCAGTGCAGGAGGAGATGGGAAGATTGGCCTT
GGTAAGATTCACAGCACCTTCGCTGCCAAGTACTGGGCTCATGAACAGGAGGTGAACTGTGTGGATTGCAAAGGGGGCATCATATCATTGTGAGTGGCTC
CAGGGACAGGACGGCCAAGGTGTGGCCTTTGGCCTCAGGCCAGCTGGGGTAGTGTTTATACACCATCCAGACTGAAGACCAAATCTGGTCTGTTGCTATC
Fundamental question: How the changes in genome sequences
give rise to phenotypic differences (e.g., disease states)
! When they got into the genome and how they have evolved
! Their roles in genome organization and gene regulation for  
human biology
! Their implications in human diseases such as cancer
Our goal — from base-pairs to bedside

Why Computational Genomics?
! Key to personalized precision medicine, especially for cancer
4
David Patterson
! Cancer research has become big data science
! How to store and manage data efficiently
! How to analyze data in a distributed environment
! How to enhance data security but reduce barriers for sharing
! How to extract meaningful patterns
! How to identify mechanisms to help treatment
! …

The Human genome:
the “blueprint” of our body
5
GTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGA
TTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCC
CCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATT
AGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCT
ATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAAC
ATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATT
James Watson 
Francis Crick
February 15, 2001
March, 2011

DNA, Chromosome, and Genome
6
Chapter4:DNA,Chromosomes,andGenomes
" beads-on-a-string"
form of chromatin
30-nmchromatin
fiberof packed
nucleosomes
Figure4-72 Chromatinpacking.This
modelshowssomeof the manylevelsof
chromatinpackingpostulatedto giverise
to the highlycondensedmitotic
chromosome.
sectionof
chromosomein
extendedform
condensedsection
of chromosome
entire
mitotic
chromosome
T300nm
I
Tl 1 n m
I
T30 nm
I
TI
700nm
Ii
T1400nm
I
NETRESULT:EACHDNAMOLECULEHASBEEN
ret
CHROMOSOMALDNAANDITSPACKAGINGINTHECHROMATINFIBER
(A) (B) -r^
Figure
chrom
a male
under
arethe
chrom
differ
identi
Chrom
expos
of hum
coupl
dyes.F
fromc
specif
chrom
DNAdoublehelix
5' Y
3'
hydrogen-bonded
basepairs
4-4). This complementary base-pairlng enables the base pairs to be packed in
the energetically most favorable arrangement in the interior of the double helix.
In this arrangement, each base pair is of similar width, thus holding the sugar-
phosphate backbones an equal distance apart along the DNA molecule. To max-
imize the efficiency of base-pair packing, the two sugar-phosphate backbones
blocksof DNA
phosphate
suqar
'; +K-
sugar oase
phosphate
n e
double-strandedDNA
llilii:i:ilitffi$$iiiffiliiiii:ii:iii <CAGA>D
nucleotide
intoa poly
strand)with
backbonef
andT)exte
composed
togetherby
the pairedb
endsofthe
polaritieso
antiparall
molecule.I
leftof the fi
shownstra
twistedint
the right.F
Figure4-4
the DNAdo
chemicalst
hydrogenb
betweenA
whereatom
bonds(see
broughtclo
thedouble
3',
s',
H
N - C_ C C - N
/
l I H - N
o

' - L
C N

C - C C _
,-n, , ,o'l [n,,
thyminesugar-phosphate
backbone
H
Ha d e n i n e
N -HilililililO

DNA, RNA, Protein
! Central Dogma in molecular biology
• DNA
• RNA
• Protein
! In general, proteins do most of the work, and are encoded by
subsequences of DNA, known as genes.
! However, only less than 2% of the human genome codes for
proteins.
7

Most of the genome are non-coding
8
© 2005 Nature Publishing Group
SINEs
LINEs
Protein-coding
genes
Introns
Miscellaneous
unique sequences
Miscellaneous
heterochromatin
Segmental
duplications
Simple sequence
repeats
DNA transposons
LTR retrotransposons
20.4%
13.1%
1.5%
25.9%
11.6%
8%
5%
3%
2.9%
8.3%
ss II elements transpose directly from DNA to DNA, and include DNA transposons and
peat transposable elements (MITEs).
nts (and especially their extinct remnants) make up a large portion of the human genome, with
ample, the SINE Alu element) present in more than a million copies. Transposable-element
mplex interactions with the host genome and other subgenomic elements, ranging from
m. For a review of transposable-element structure, origins, impacts and evolution see REF. 17.
ent
man
% of
f the
000
nces.
f
s such
%)
s
e
www.nature.com/reviews/genetics
Nat Rev Genet, 2005

Most functional information is non-coding
! 5% highly conserved, but only 1.5% encodes proteins
9
chr2 (q31.1) 21 p14 2p12 13 31.1 q34 q35
chr2:
DLX1
DLX2
Vertebrate Cons
Chimp
Rhesus
Bushbaby
Tree_shrew
Mouse
Rat
Guinea_Pig
Shrew
Hedgehog
Dog
Cat
Horse
Cow
Armadillo
Elephant
Tenrec
Opossum
Platypus
Lizard
Chicken
Zebrafish
Tetraodon
Fugu
Stickleback
Medaka
172660000 172665000 172670000 172675000
UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics
Vertebrate Multiz Alignment & PhastCons Conservation (28 Species)
DLX1
Gaps
Human
Chimp
Rhesus
Bushbaby
Tree_shrew
Mouse
Rat
Guinea_Pig
Shrew
Hedgehog
Dog
Cat
Horse
Cow
Armadillo
Elephant
Tenrec
Opossum
Platypus
UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics
Vertebrate Multiz Alignment & PhastCons Conservation (28 Species)
K P R T I Y S S L Q L Q A L N
1
A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C
A A A C C C A G G A C G A T T T A T T C C A G C T T G C A G T T G C A G G C T T T G A A C
A A A C C C A G G A C G A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A T
A A A C C C A G G A C A A T T T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C
C C C C C T A G G A C A A T T T A T T C C A G T T T G C A G C T G G A C G C T T T G A A T
A A G C C C A G G A C A A T C T A T T C C A G T T T G C A G T T G C A G G C T T T G A A C
A A A C C C A G G A C G A T T T A C T C C A G T T T G C A G T T G C A G G C T T T G A A C
A A A C C T A G G A C G A T T T A T T C C A G T T T G C A G C T G C A G G C T T T G A A T
A A A C C C A G G A C T A T T T A T T C C A G T C T G C A G T T G C A G G C T T T G A A C
A A A C C C A G G A C T A T A T A T T C C A G T T T G C A G T T G C A G G C A T T G A A C
What do they do?

Annotating the non-coding regions
10
Scale
chr2:
NKI LADs (Tig3)
10 kb hg19
20,090,000 20,095,000 20,100,000 20,10
TTC32
LaminB1 (Tig3)
2 -
-2 _
GM78 CHD2 IgM
889 -
1 _
GM78 Pol2 IgM
156.2 -
0 _
GM78 Pol2 Std
259.8 -
0 _
GM78 Rad2 IgR
8.7 -
0 _
GM78 TBP IgM
40.1 -
0 _
GM78 Z274 Std
16 -
1 _
K562 CHD2 IgR
1785 -
1 _
K562 Pol2 IgM
27.9 -
0 _
K562 IFa3 Pol2 Sd
211.5 -
0 _
K562 IFa6 Pol2 Sd
199.4 -
0 _
K562 IFg3 Pol2 Sd
241.7 -
0 _
K562 IFg6 Pol2 Sd
261.1 -
0 _
K562 Pol2 Std
343.1 -
0 _
K562 Rad2 Std
8.6 -
0 _
K562 TBP IgM
397 -
1 _
K562 Z274 UCD
5.4 -
0 _
ChromHMM also enables the analy
across multiple cell types. When the ch
mon across the cell types, a common m
a virtual ‘concatenation’ of the chrom
Alternatively a model can be learned by
marks across cell types, or independent
each cell type. Lastly, ChromHMM sup
models with different number of chrom
relations in their emission parameters (
We wrote the software in Java, whic
virtually any computer. ChromHMM an
tion is freely available at http://compbio
he observed combination of chromatin marks using
ndependent Bernoulli random variables2, which
t learning of complex patterns of many chromatin
. As input, it receives a list of aligned reads for each
ark, which are automatically converted into pres-
ce calls for each mark across the genome, based on
kground distribution. One can use an optional addi-
f aligned reads for a control dataset to either adjust
for present or absent calls, or as an additional input
tively, the user can input files that contain calls from
nt peak caller. By default, chromatin states are ana-
ase-pair intervals that roughly approximate nucleo-
t smaller or larger windows
ied. We also developed an
ameter-initialization proce-
bles relatively efficient infer-
arable models across differ-
of states (Supplementary
e outputs of ChromHMM.
hromatin-state annotation
from ChromHMM and visualized
Scale
chr4:
GM12878
1_Active_Promoter
2_Weak_Promoter
3_Poised_Promoter
4_Strong_Enhancer
5_Strong_Enhancer
6_Weak_Enhancer
7_Weak_Enhancer
8_Insulator
9_Txn_Transition
10_Txn_Elongation
11_Weak_Txn
12_Repressed
13_Heterochrom/lo
14_Repetitive/CNV
15_Repetitive/CNV
50 kb
103650000 103700000
RefSeq Genes
GM12878 (User ordered)
NFKB1
NFKB1
a
b cEmission parameters Transition parameters
ChromHMM — Ernst and Kellis, Nature Methods 2012
Alternatively a model can be learned by a virtual ‘stacking’ of all
marks across cell types, or independent models can be learned in
each cell type. Lastly, ChromHMM supports the comparison of
models with different number of chromatin states based on cor-
relations in their emission parameters (Supplementary Fig. 4).
We wrote the software in Java, which allows it to be run on
virtually any computer. ChromHMM and additional documenta-
tion is freely available at http://compbio.mit.edu/ChromHMM/.
verted into pres-
enome, based on
an optional addi-
et to either adjust
n additional input
contain calls from
in states are ana-
roximate nucleo-
Scale
chr4:
GM12878
1_Active_Promoter
2_Weak_Promoter
3_Poised_Promoter
4_Strong_Enhancer
5_Strong_Enhancer
6_Weak_Enhancer
7_Weak_Enhancer
8_Insulator
9_Txn_Transition
10_Txn_Elongation
11_Weak_Txn
12_Repressed
13_Heterochrom/lo
14_Repetitive/CNV
15_Repetitive/CNV
50 kb
103650000 103700000 103750000
RefSeq Genes
NFKB1
NFKB1
MANBA
a
b cEmission parameters
State(userorder)
State(userorder)
Statefrom(userorder)
Transition parameters
Mark
CTCF
H3K27me3
H3K36me3
H4K20me1
H3K4me1
H3K4me2
H3K4me3
H3K27ac
H3K9ac
WCE
Genome(%)
RefSeqTSS
CpGisland
RefSeqTSS2kb
RefSeqexon
RefSeqgene
RefSeqTES
Conserved
Lamina
State to (user order)
Category
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
GM12878 fold enrichments

Each type of cancer is different
12
widespread—remain and eventua
promising the function of the lun
organs. From a genetics persp
seem that there must be mutatio
primary cancer to a metastatic o
are mutations that convert a nor
nign tumor, or a benign tumor to
(Fig. 2). Despite intensive effor
sistent genetic alterations that dis
that metastasize from cancers th
metastasized remain to be identi
One potential explanation in
or epigenetic changes that are
tify with current technologies (see
matter” below). Another explana
static lesions have not yet been
ficient detail to identify these ge
particularly if the mutations ar
in nature. But another possibl
that there are no metastasis gen
primary tumor can take many y
size, but this process is, in prin
by stochastic processes alone (17
tumors release millions of cells
tion each day, but these cells hav
and only a miniscule fraction es
lesions (19). Conceivably, these
may, in a nondeterministic man
and randomly lodge in a capillary
that provides a favorable micro
growth. The bigger the primary
more likely that this process w
scenario, the continual evolutio
tumor would reflect local selec
rather than future selective adva
that growth at metastatic sites is n
additional genetic alterations is a
recent results showing that eve
when placed in suitable enviro
lymph nodes, can grow into org
with a functioning vasculature (
1500
1000
500
Colorectal(MSI)
Lung(SCLC)
Lung(NSCLC)
Melanoma
Esophageal(ESCC)
Non-Hodgkinlymphoma
Colorectal(MSS)
Headandneck
Esophageal(EAC)
Gastric
Endometrial(endometrioid)
Pancreaticadenocarcinoma
Ovarian(high-gradeserous)
Prostate
Hepatocellular
Glioblastoma
Breast
Endometrial(serous)
Lung(neversmokedNSCLC)
Chroniclymphocyticleukemia
Acutemyeloidleukemia
Glioblastoma
Neuroblastoma
Acutelymphoblasticleukemia
Medulloblastoma
Rhabdoid
Mutagens
Non-synonymousmutationspertumor
(median+/-onequartile)
250
225
200
175
150
125
100
50
75
25
0
B
Adult solid tumors Liquid Pediatric
Number of nonsynonymous
mutations in representative
human cancers, detected by
genome-wide sequencing
studies.
Vogelstein et al.
Science 2013

Each individual tumor is different
! Data from TCGA’s analyses show that most cancer types has
a great number of mutations that occur at a low frequency.
! Long-tail distribution
13
doi: 10.1038/nature08645 SUPPLEMENTARY INFORMATION
SI Guide
Supplementary Figure 1 Haploid physical coverage of breast cancer samples. Physical
coverage indicates the number of DNA fragments of which both ends have been sequenced
that on average overlie any position in the genome.
Supplementary Figure 2 Genome wide circos plots of somatic rearrangements in all 24
breast cancers in the study.
Stephens et al. Nature 2009

Supervised learning Un-supervised learning
genes
samples
Analyzing gene expression data

How to deal with high dimension?  
Identify the most important genes
! d is the damping factor, a
parameter representing
the extent to which the
ranking depends on the
structure of the graph.
! f is the prior probability
of the gene which we set
to the absolute
differential expression.
! is the in-
degree of i
15
Gene Network
Gene Expression
Somatic
Alteration Data
(SNP, CNV, etc.)
Ranks of
Genes
▪ A ranking framework based on
PageRank that considers the
impact of genes in the network
▪ Impact includes connectivity and
the amount downstream genes
to be differentially expressed
▪ Dynamic damping factor is
used to improve the original
PageRank in ranking genes
DawnRank
Personalized Driver Alterations
rt+1
j = (1 dj)fj + dj
NX
i=1
Ajirt
i
degi
degi =
PN
j=1 Aji
Hou and Ma, Genome Med 2014

Tumor heterogeneity vs.
gene networks
16
NCIS - Liu et al. BMC Bioinfo 2014
C3 - Hou et al. Bioinformatics 2016
LDGM - Tian, Gu, and Ma, Nucleic Acids Res 2016
NRAS
GABBR1
ATF2
MAPK1
PRKACA
GNAI2
PRKACB
CREB3L4
ADCY2
KCNJ3
PLCB4
GRB2
GNAI3
SRC
PIK3CD
CALML6
ESR1
GABBR2
ADCY4
FOS
ADCY3
NOS3
PLCB2
OPRM1
AKT1
GNAS
CREB3
PIK3CA
HRAS
PLCB3KCNJ6
CREB3L1
GNAO1
SHC1
MAP2K1
PIK3R5
ADCY5
MAPK3
PLCB1PIK3R3
SOS1
GNAI1
CALML3
MMP2PRKACG
PRKCD
CREB3L2HBEGF
SHC4
PIK3CB
AKT3
CREB5
GRM1
ADCY1
MMP9
EGFRJUN
ADCY7
ATF6B
SHC2
PIK3R1
CALM1
SOS2
ADCY9
ATF4
PIK3R2
SHC3
SP1
# interactions
# interactions
ESR1degreeRankofESR1
A
B
C
Lumina A
Basal-like
% degree from Luminal A
% degree from Basal-like
LDGM
Glasso
JGL
CNJGL
Figure 5: Differential networks on estrogen signaling pathway reconstructed based on gene expression data from breast
cancer Luminal A and Basal-like subtypes. (A) The degree of ESR1 in estimated differential networks with increased
number of interactions. (B) The rank of ESR1 by its degree in differential networks. The number of interactions is up to
1,000 in (A) and (B). (C) A differential network b estimated by LDGM with = 0.362. Node size is proportional to the
node’s degree. Width of an interaction i j is proportional to the score |bij|. The origin of interactions in the differential
network is inferred by a principle of majority approach based on Glasso (see Supplementary Text).
J.P.Hou et al.

Deep learning applications
17
x y
Features Model ResultsClean data
A
D
Feature
extraction
Discriminative features
Raw data
Label
C
Intron
Exon
Feature
extraction Training Evaluation
Supervised Unsupervised
x
• Linear regression
• Logistic regression
• Random Forest
• SVM
• …
• PCA
• Factor analysis
• Clustering
• Outlier detection
• …
B
A C G T C
G C G T A
G T C C G
T T A G T
C G T A G
G A G A A
T
A
G
C
T
G
CA
C
G
T
G
A
CC
A
T
G
A
G
T
C
A
T
G
CT
G
CG
T
C
C
G
TA
TC
G
A
T
G
T
C
C
G
A
G
T
A
C
A
CC
ACC
GA
G
TG
T
G
TC
A
T
G
C
T
A
C
A
G
C
T
AT
G
C
G
C
T
AG
C
T
G
AC
T
G
A
CT
AT
C
G
G
C
T
A
T
G
C
A
G
A
G
C
A
C
G
A
CGG
CT
C
GA
T
G
CC
T
G
A
T
C
C
C
A
G
T
A
G
CT
A
G
C
T
A
CCA
G
C
CA
G
CT
CT
G
A
CG
T
C
T
A
C
GA
T
C
GT
G
A
CA
T
C
GG
C
A
G
CA
T
GG
C
A
G
CA
T
C
G
T
A
C
G
A
T
C
G
A
T
G
C
A
C
G
TC
G
A
T
T
G
A
T
A
G
A
C
GC
GA
C
T
GA
T
CA
T
GA
C
T
GT
A
G
C
G
T
A
G
C
T
A
G
C
TC
G
AC
A
TC
G
A
T
G
A
T
T
C
A
T
AG
A
TC
T
A
C
G
T
A
Layer 1
A
T
G
C
A
G
A
G
C
A
C
G
A
CGG
CT
C
GA
T
G
CC
T
G
A
T
C
C
G
T
A
G
C
T
A
G
C
TC
G
AC
A
TC
G
A
T
G
A
T
T
C
A
T
AG
A
TC
T
A
C
G
T
A
Raw data
Pre-
processing
Raw data
Layer 2 Intron ExonTSS
Figure 1. Machine learning and representation learning.
(A) The classical machine learning workflow can be broken down into four steps: data pre-processing, feature extraction, model learning and model evaluation. (B) Supervised
machine learning methods relate input features x to an output label y, whereas unsupervised method learns factors about x without observed labels. (C) Raw input data are
often high-dimensional and related to the corresponding label in a complicated way, which is challenging for many classical machine learning algorithms (left plot).
Alternatively, higher-level features extracted using a deep model may be able to better discriminate between classes (right plot). (D) Deep networks use a hierarchical
structure to learn increasingly abstract feature representations from the raw data.
Molecular Systems Biology Deep learning for computational biology Christof Angermueller et al
Published online: July 29, 2016
x y
Features Model ResultsClean data
A
D
Feature
extraction
Discriminative features
Raw data
Label
C
Intron
Exon
Feature
extraction Training Evaluation
Supervised Unsupervised
x
• Linear regression
• Logistic regression
• Random Forest
• SVM
• …
• PCA
• Factor analysis
• Clustering
• Outlier detection
• …
B
A C G T C
G C G T A
G T C C G
T T A G T
C G T A G
G A G A A
T
A
G
C
T
G
CA
C
G
T
G
A
CC
A
T
G
A
G
T
C
A
T
G
CT
G
CG
T
C
C
G
TA
TC
G
A
T
G
T
C
C
G
A
G
T
A
C
A
CC
ACC
GA
G
TG
T
G
TC
A
T
G
C
T
A
C
A
G
C
T
AT
G
C
G
C
T
AG
C
T
G
AC
T
G
A
CT
AT
C
G
G
C
T
A
T
G
C
A
G
A
G
C
A
C
G
A
CGG
CT
C
GA
T
G
CC
T
G
A
T
C
C
C
A
G
T
A
G
CT
A
G
C
T
A
CCA
G
C
CA
G
CT
CT
G
A
CG
T
C
T
A
C
GA
T
C
GT
G
A
CA
T
C
GG
C
A
G
CA
T
GG
C
A
G
CA
T
C
G
T
A
C
G
A
T
C
G
A
T
G
C
A
C
G
TC
G
A
T
T
G
A
T
A
G
A
C
GC
GA
C
T
GA
T
CA
T
GA
C
T
GT
A
G
C
G
T
A
G
C
T
A
G
C
TC
G
AC
A
TC
G
A
T
G
A
T
T
C
A
T
AG
A
TC
T
A
C
G
T
A
Layer 1
A
T
G
C
A
G
A
G
C
A
C
G
A
CGG
CT
C
GA
T
G
CC
T
G
A
T
C
C
G
T
A
G
C
T
A
G
C
TC
G
AC
A
TC
G
A
T
G
A
T
T
C
A
T
AG
A
TC
T
A
C
G
T
A
Raw data
Pre-
processing
Raw data
Layer 2 Intron ExonTSS
Figure 1. Machine learning and representation learning.
(A) The classical machine learning workflow can be broken down into four steps: data pre-processing, feature extraction, model learning and model evaluation. (B) Supervised
machine learning methods relate input features x to an output label y, whereas unsupervised method learns factors about x without observed labels. (C) Raw input data are
often high-dimensional and related to the corresponding label in a complicated way, which is challenging for many classical machine learning algorithms (left plot).
Molecular Systems Biology Deep learning for computational biology Christof Angermueller et al
Published online: July 29, 2016
More traditional Machine Learning Applications to Deep Learning Application
Angermueller et al., Mol Sys Bio 2016

DeepBIND
18
A N A LY S I S
t
i
P
a
v
p
f
P
a
t
a
s
a
t
i
t
Figure 1 DeepBind’s input data, training procedure and applications. 1. The sequence specificities
of DNA- and RNA-binding proteins can now be measured by several types of high-throughput
assay, including PBM, SELEX, and ChIP- and CLIP-seq techniques. 2. DeepBind captures these
binding specificities from raw sequence data by jointly discovering new sequence motifs along with
rules for combining them into a predictive binding score. Graphics processing units (GPUs) are
used to automatically train high-quality models, with expert tuning allowed but not required. 3. The
resulting DeepBind models can then be used to identify binding sites in test sequences and
Alipanahi et al. Nat Biotech 2015
(Other methods: DeepSEA — Zhou & Troyanskaya, Nat Methods 2015; 
DanQ — Quang & Xie, Nucleic Acids Res 2016)

Cancer genome
19
MCF-7 http://www.path.cam.ac.uk/~pawefish/

20
Structural variations (SVs)  
in cancer genomes
inversion translocation
gain loss duplication
Whole genome sequencing
Methods: DELLY, Meerkat, BreakDancer,  
CREST, CNVnator,  
CONSERTING, and many others

Aneuploidy — Common feature of cancer cells
21
MCF-7 http://www.path.cam.ac.uk/~pawefish/
! Allele-specific copy
number (ASCN) tools
• ABSOLUTE, ASCAT,
Patchwork
! SVs can further modify
the aneuploid cancer
genome into a mixture of
genomic segments with
extensive range of CNAs
! We need methods that
combine SV and ASCN
! How SVs interact with
ASCNs? How different
SVs interact with each
other?
A N A LY S I S
Percent of
samples with
WGD 6245 43 1143 2059 64 5327
a
0
0.5
1.0
Purity
1
2
3
4
5+
LUAD
LUSC
HNSC
KIRC
BRCA
BLCA
CRC
UCEC
GBM
OV
Ploidy
0 500 1,000
Samples
(all lineages)
Near diploid
1 WGD
2+ WGD
ple
Near diploid WGD samples Amplification Deletion
b Zack et al. Nature Genetics 2013

Goal — Quantify allele-specific SVs
22
Goal - Quantify Allele-Specific SVs
4
4
4

Weaver — algorithm overview
23
Probabilistic Graphical Model
(Markov Random Field)
Mappability GC Content
Purity ASCNG ASCNS Timing of SV Phasing
SV list BAM ﬁle
1KGP haplotypes
SNP list
Cancer Genome Graph SNP linkage SNP LD
(B) (C)
R1 R3 R4 R5
Rm
R6 R10
R1 R2 R3 R4 R5 R6 R10
R12 R13 R14 R16 R17 R18 R21
R1
R2
R3 R4
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R18 R19 R20 R21
R2
(A)
interchr
del
dup intrachrintrachr
R11 R12 R13 R14 R15 R16 R18 R19 R20 R21deln
m p
t
s
q
chrA chrB
Coverage from read mapping
Input
Output
Li et al. Cell Systems 2016

24
Purity ASCNG ASCNS Timing of SV Phasing
100 kb
21,850,000 21,900,000 21,950,000 22,000,000 22,050,000 22,100,000 22,150,000 22,200,000 22,250,000
MTAP C9orf53
CDKN2A
CDKN2A
CDKN2B-AS1
CDKN2B
142_
0 _
chr9 9p23 21.3 21.1 12 9q12 13 31.1 32 33.1
Coverage
LOH & first
amplification Deletion
Second
amplification
(B)
Del1
Del2
ASCNS and Timing of SV
Del1
Del2
Del1
Del2
Figure 1: (A) Schema diagram for Weaver. Dark green boxes show the different types of analyses, unique to Weaver that
are not dealt with by other methods, while light green ones show ‘by-products’ of Weaver shown to have an improvement
over existing methods. (B) An example demonstrating a Weaver output focused on ASCNS and Timing of SV. Dark blue
segments (two copies) and light blue segment (one copy) represent a portion of the MCF-7 genome that originated from the
same allele on chr9. The other allele was lost during tumorigenesis, resulting in LOH. The predicted evolution of this region

! MRF:
• genome node, cancer node, genome edge, cancer edge
25
(B) (C)
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21
R12
R13
R14
R15
R16
R17
R18
R19
R20
R11 R21
R1 R3 R4 R5
Rm
Rp
Rs
Rq
R6
R7
R8
R9
R10
RnR2
mq→(12,14)
m(12,14)→q
m(12,14)→s
ms→(12,14)
R(12,14)
R15
R11 R21
Rp
Rs
Rq
R(3,4) R(5,6)
R(19,20)
Rt
m +R2 -R2
n +R6 -R10
p +R4 -R16
q -R12 -R21
s +R14 +R18
t +R16 -R18
label L_pos R_pos
2 2 1
1 1 1
2 2 1
2 2 1
2 2 1
1 1 2
L_allele R_allele CN
R1 30 0.33
R2 40 0.5
R5 20 0
R7 10 0
R12 20 0.5
R17 10 0
label cov allele_freq
2 1
2 2
2 0
1 0
1 1
0 1
CN_1 CN_2
Genomic
regions
SVs
Inputs Outputs(D) (E)
𝜇0 = 0; 𝜇1 = 1; b = 10
Time
Post-
Pre-
chrA
chrB
R1
R2
R10
R(7,9)
Rt R17
R18
RnRm
R16
Figure 5: (A) Hypothetical cancer chromosomes with rearrangement structure hidden. Orange and blue segments repre-
sent paternal/maternal allele. Red dashed line represent linkages by SVs. (B) The Cancer Genome Graph, constructed
from (A), with nodes (boxes) representing genomic regions and edges representing reference (solid lines) or cancer (dashed
Cancer Genome Graph
(B) (C)
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
R11 R12 R13 R14 R15 R16 R17 R18 R19 R20 R21
R12
R13
R14
R15
R16
R17
R18
R19
R20
R11 R21
R1 R3 R4 R5
Rm
Rp
Rs
Rq
R6
R7
R8
R9
R10
RnR2
R12 R13 R14 R16 R17 R18 R21
R1 R3 R4R2
mq→(12,14)
m(12,14)→q
m(12,14)→s
ms→(12,14)
R(12,14)
R15
R11 R21
Rp
Rs
Rq
R(3,4) R(5,6)
R(19,20)
Rt
m +R2 -R2
n +R6 -R10
p +R4 -R16
q -R12 -R21
s +R14 +R18
t +R16 -R18
label L_pos R_pos
2 2 1
1 1 1
2 2 1
2 2 1
2 2 1
1 1 2
L_allele R_allele CN
R1 30 0.33
R2 40 0.5
R5 20 0
R7 10 0
R12 20 0.5
R17 10 0
label cov allele_freq
2 1
2 2
2 0
1 0
1 1
0 1
CN_1 CN_2
Genomic
regions
SVs
Inputs Outputs(D) (E)
𝜇0 = 0; 𝜇1 = 1; b = 10
m p
s
q
Time
Post-
Pre-
chrA
chrB
R1
R2
R10
R(7,9)
Rt R17
R18
RnRm
R16
Figure 5: (A) Hypothetical cancer chromosomes with rearrangement structure hidden. Orange and blue segments repre-
sent paternal/maternal allele. Red dashed line represent linkages by SVs. (B) The Cancer Genome Graph, constructed
from (A), with nodes (boxes) representing genomic regions and edges representing reference (solid lines) or cancer (dashed
MRF representation
rozygosity (LOH) regions
is known that most of the
h using SV boundaries has
segmentation methods in
me Graph G := {R, E}
eference adjacencies (Er)
adjacencies (Ec) (dashed
configurations E between
senting the tail (right) and
adjacent regions from the
ndom Field (MRF, M :=
nt probabilities. The MRF
en sequencing data can be
explained in the following
n hidden Markov models
between ‘local’ variables,
er genomes with complex
ed steps are described in
ions, and formal function
ed in the Supplementary
2 million from Weaver based on various datasets, depending on the size of
and the number of SVs. The rationale behind the segmentation step with SV
time ASCNG boundaries coincide with SV breakpoints [14]. Our segmentati
the advantage to provide base-level ASCNG boundaries as compared to exist
copy number analysis, which typically use fixed segmentation size.
Given the segmentation of the genome and SV set C, we then build C
(Fig. 5(B)), with nodes representing genomic region sets (R) and edges rep
(solid lines in the figure) if two nodes are adjacent in the normal genome
lines in the figure) if two nodes are adjacent in the cancer genome by SV c lin
node Ri and Rj can be represented as: ( iRi ⇠ jRj), 2 {+, }, with + a
head (left) of a given genomic region R, e.g., (+Ri ⇠ Ri+1) 2 Er, if Ri an
same chromosome in the normal genome.
We then convert the original Cancer Genome Graph G := {R, E} into
{R, Rc, Er, Ec}), which is a widely used probabilistic graphical model to e
can be viewed as undirected graph and the aggregated inference problem in W
viewed as a maximum a posteriori (MAP) problem with hidden states and ob
sections. Unlike conventional methods for estimating copy number chang
(HMMs), which are designed for sequential data and only consider the dep
MAP solution of MRF model provides the most probable configuration of ane
SVs, involving ‘global’ variable dependencies defined by long-range SVs.
Supplementary Note 6. In the following sections, we describe hidden stat
of the MRF MAP problem. Details on potential functions on nodes and edge
Note.
Hidden states H
=
NLD(Gi , Gi+1) ⇥ NLD(Gi , Gi+1)
NLD(Ga
i , Ga
i+1) ⇥ NLD(Gb
i , Gb
i+1) + NLD(Ga
i , Gb
i+1) ⇥ NLD(Gb
i , Ga
i+1)
where NLD(Ga
i , Ga
i+1) is the number of phased haplotypes (total number 1092 ⇥ 2 in phase 1) in 1KGP with
genotype (Ga
i , Ga
i+1). Other genotype configurations can be similarly calculated.
(ii) Similarly, we define the read linkage score for the phasing Ga
i , Ga
i+1/Gb
i , Gb
i+1 as:
RL(Ga
i , Ga
i+1/Gb
i , Gb
i+1) =
NRL(Ga
i , Ga
i+1) + NRL(Gb
i , Gb
i+1)
NRL(Ri, Ri+1)
where NRL(Ri, Ri+1) is the total number of reads covering genomic regions (Ri, Ri+1) and NRL(Ga
i , Ga
i+1) is
total number of reads covering (Ga
i , Ga
i+1). If there are no reads covering (Ri, Ri+1) (NRL(i, i+1) = 0), RL = 0
Therefore, we define genotype linkage as
GL(Ga
i , Ga
i+1/Gb
i , Gb
i+1) = log(LD(Ga
i , Ga
i+1/Gb
i , Gb
i+1) ⇤ RL(Ga
i , Ga
i+1/Gb
i , Gb
i+1))
In real data application, we have found that RL and LD correlate very well. For example, in the MCF-7 analysis
when we chose SNP pairs with 100% RL support as gold standard, we found AUC= 0.9964 using LD scores.
Markov random field model M
After we convert G into MRF M using steps in Supplementary Note 6, the MRF MAP problem is given by
ˆH = argmaxH
8
<
:
X
i2R
⇥R(O|Hi) +
X
c2C
⇥C(O|Hc) +
X
i2R
R(O|Hi, Hi+1) +
X
c2C
X
i2N (c)
C(Hi, Hc)
9
=
;
7
genome node 
potential function
cancer node 
potential function
genome edge 
potential function
cancer edge 
potential function

ASCN and SVs in MCF-7
26
! 83% of SVs have copy number > 1
! 68% of the regions have imbalanced copy number
! We found 276 SVs after whole chromosome dup
! We have used physical mapping to validate the results

ASCN and SVs in HeLa
! WGS reads obtained from
Adey et al. Nature 2013
! ASCNG are 97% consistent
with Adey et al. (Fosmid seq)
27
Structural variants were identified by clustering discordantly
mapped reads from 40-kb and 3-kb mate-pair libraries (Supplemen-
tary Fig. 8). Twenty interchromosomal links were identified, including
links for marker chromosomes M11 (9q33–11p14) and M14 (13q21–
19p13). In addition, 209 HeLa-specific deletions and 8 inversions were
found (Supplementary Figs 9 and 11, and Supplementary Table 10).
Only two genes that are impacted by HeLa-specific structural rearran-
gements (Supplementary Table 11) intersected with SCGC (STK11
(ref. 18), FHIT), both of which are recurrently deleted in cervical
18,19
pool. Alleles that were p
given clone were assigned
and the unobserved alle
haplotype. When overlap
this resulted in haplotyp
which 50% of the total len
550 kb containing 90.6%
inherited.
Most of the HeLa gen
1 2 3 4 5
X
6 7 8 9 10 11
12 13 14 15 16 17 18 19 20 21 22
HPV integration
3q11
M5, S
Linked position
Marker
chromosome
name
Supported by
Sequence data
Colour indicates
suspected
haplotype
Haplotype A
Haplotype B
Tandem
duplication
Probable
contiguity
a
1q11
M1
1q11
M25
15q11
M18
9p11
M10
3p21
M10
5q11
M4
3p11
M4
12q15
M12
5p
2xM7
11p14
M11,S
9q33
M11,S
9q33
M11
19p13
M14,S
21q11
M18
20p11
M15
13q21
M14,S
15q11
M18
3q11
M1
1p11
M2
9q11
M2
15q
M13
21q11
M25 11q22
M11
5p
marker
M7
HPV locus
4q31-35 6q13-21
18q1
2
3
4
5
6
7
8
3q24-29
LOH
Chr18
/
S3 window ratios
CCL-2 window ratios
S3 copy-number calls
S3-specific differences
Windowratio;copynumber
b
Genomic position
RESEARCH LETTER
Adey et al. Nature, 2013

Application to TCGA Data
! Inter-chromosomal chromothripsis
28
1X
62X
(A) (B)
FOXG1
4
2214
6(C)
Supplementary Figure 14: (A) Overview of the genomic landscape of a TCGA ovarian cancer sample (TCGA-36-1571).
(B) Most of SVs linking chr4 and chr22 have copy number one. (C) Three high-coverage fold-back inversions are observed
at boundaries of highly amplified region of chr14, indicating many rounds of BFB cycles happened. Gene FOXG1 is on the
fold-back inversion boundary and highly amplified.
! Breakage-fusion-bridge amplifications
1X
62X
A) (B)
FOXG1
4
2214
6(C)
at boundaries of highly amplified region of chr14, indicating many rounds of BFB cycles happened. Gene FOXG1 is on the
fold-back inversion boundary and highly amplified.
1X
62X
A) (B)
FOXG1
4
2214
6(C)
TCGA-36-1571

The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

Recommended

Recommended

More Related Content

Similar to The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU

Similar to The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU (20)

More from The Hive

More from The Hive (20)

Recently uploaded

Recently uploaded (20)

The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian Ma, CMU