Jillian ms defense-4-14-14-ja-novideo

Bacterial
Gene
Neighborhood

Inves5ga5on
Environment:

A

Scalable
Genome
Visualiza5on
for

Big
Displays

Jillian
Aurisano

Master
of
Science
Defense

April
16,
2014

Science
has
historically
looked
like
this:

Up
un5l
very
recently

“Observa)ons!”

Exper5se

Explore

Collect
samples,

Catch
errors

“No
one
looks
under
a
microscope
anymore.

Its
all
DNA.
”

How
do

scien)sts
make

discoveries?

How
do
we
bring
experts
into
the

loop?

•  From
direct
collec5on
of

data,
direct
observa5on
of

results
direct

interpreta5on
and
analysis

•  To
automated
data

collec5on,
automated

ﬁltering
and
automated

analysis

•  Need
visualiza5on
to
bring

experts
into
the
loop

•  But
how
do
we
handle
big

data?

•  What’s
our
Big
Data

microscope?

“
Picard:

Computer;
scan
everything,

run
diagnos5cs,
and
tell
us
the

answer.”

“Computer:
Results
are
inconclusive”

Can
Big
Displays
help?

•  Evidence
suggests
that
these
environments

can
have
a
posi5ve
impact
on
percep5on
and

cogni5on

•  But
how
do
we
use
them
to
eﬀec5vely

address
big
data
problems?

•  Can
exis5ng
visualiza5ons
simply
be
‘scaled-‐
up’
to
ﬁt
or
are
new
approaches
needed?

In
this
thesis
I
will…

Examine
a
speciﬁc
big
data
visualiza5on
problem:

compara5ve
gene
neighborhood
analysis
in

bacterial
genomics

I
worked
closely
over
several
years
with
a
team
of

computa5onal
biologists

This
work
has
led
to
the
design
and
implementa5on

of
a
new
visualiza5on
approach
designed
to
scale
to

big
data
and
big
displays

BactoGeNIE

(‘Bact(o)erial
Gene
Neighborhood
Inves5ga5on

Environment’)

Outline

1)  Describe
compara5ve
bacterial
gene

neighborhood
analysis
to
understand
how
to

bring
experts
into
the
loop

2)  Examine
poten5al
impact
of
Big
Displays
on
Big

Data
visualiza5on

3)  Evaluate
scalability
in
exis5ng
compara5ve

genomics
visualiza5ons

My
work:
BactoGeNIE

4/5/6)

Describe
my
design,
implementa5on,
results

7)  Think
about
the
future

In
the
process,
learn
something
about
scaling
up

visual
approaches
to
big
data
and
big
displays

Warning:

Biology
is
used
in
this
thesis!

Genome
sequencing
boom

•  Sequencing
costs

decreasing
faster

than
Moore’s
Law

•  So,
we
are
able
to

produce
massive

volumes
of

sequence
data

•  Bacterial
genomes

are
small,
so
we
are

genera5ng

thousands
of

complete
bacterial

genome
sequences
Wejerstrand
K.A.,
DNA
Sequencing
Costs:
Data
from
the
NHGRI
Large-‐
Scale
Genome
Sequencing
Program,
2012
<
www.genome.gov/sequencingcosts>

What
is
a
genome?

What
is
a
gene?

•  Genomes
consists
of
one
or

more
long
molecules
of
‘DNA’

•  DNA
consists
of
chained

nucleo5de
molecules
(A,
C,
T,

G)
also
called
‘base
pairs’

•  All
the
genes
in
an
organism

are
in
its
‘genome’

•  Genes
determine
traits
in
an

organism

•  Genes
‘code’
for
proteins,
and

proteins
do
the
work
to
make

traits
happen

How
are
genomes
sequenced?

•  Sequencing

•  Assembly

•  Annota5on

•  Output:

– Genome
feature

ﬁles

– Raw
sequence

ﬁles

Michael
Schatz

Cold
Spring
Harbor

Lots
of
genome
sequences-‐>

opportunity

Big
challenge:
Hard
to
ﬁgure
out
what
a
novel
gene

does

•  Tradi5onally:
do
wet-‐lab
research
to
ﬁgure
out

–  but
expensive,
5me-‐consuming

•  Sequence
the
gene,
and
use
computa5onal

methods
to
predict
the
func5on
of
the
protein

–  If
novel
gene,
may
not
provide
answer

•  Can
complete
genome
sequences
help?

•  Compara5ve
gene
neighborhood
analysis

From
genome
structure

to
gene-‐product
func5on

•  In
bacteria,
genes

whose
products
are

involved
in
similar

func5ons
onen
placed

close
to
each
other
in

the
genome.

•  Research
suggests
that

it
is
possible
to
predict

gene-‐product
func5on

in
bacteria
based
on

commonly
recurring

gene
neighbors

•  But,
need
to
examine

lots
of
genomes
for

sta5s5cal
signiﬁcance?

gene1 gene2 gene3 gene4
Biological process
?

Comparing
gene
neighborhoods
across

different
genomes

•  Genes
with
similar
sequences
likely
produce

proteins
with
similar
func5ons

•  Orthologs:
similar
genes
from
different
genomes

•  Algorithms
to
compare
genes
between
different

genomes

DeMeo
et
al.
BMC
Molecular

Biology
2008
9:2

doi:
10.1186/1471-‐2199-‐9-‐2

Role
for
visualiza5on
in
this
problem

•  Why
not
use
automated
methods
to
ﬁnd

common
sets
of
genes
around
gene
targets?

•  Why
visualiza5on?

•  3
E’s:
Explora5on,
Exper5se,
Errors

Automated methods:
Target: gene B
Common subsequences:
Strains 1, 2, 3: {A, B, C, D}

•  Pajerns
and

anomalies

without

knowing
in

advance
what

you
are

looking
for

Explora5on

Automated methods:
Target: gene B
Strains 1, 2, 3: {A, B, C, D}
Duplication
Strain 1
Strain 2
Strain 3
A B D
A
A
C
CC
D
D
B C
CBB
B
Truncation
Strain 1
Strain 2
Strain 3
A B C D
A
A B C
D
D
B C
Deletion
Strain 1
Strain 2
Strain 3
A B
C
D
A
A
C
D
D
B
B
Inversion
Strain 1
Strain 2
Strain 3
A B C D
A
A B C
D
D
CB

Exper5se

•  Experts
make
connec5ons
that
will
be
missed
by

automated
methods

– Not
just
the
anomaly,
but
significance
of
the
anomaly

– Knowledge
about
strains,
protein
families
involved
in

finding
significant
anomalies

StrainA
StrainB
StrainC
!

Errors

•  Verify

automated

methods

•  Uncertainty

and
errors
in

data

genera5on

Data
Strain 1
Strain 2
Strain 3
Automated methods:
Strains 1 and 3: {A, B, C, D}
Strain 2: {A, D}
Ground truth
Strain 1
Strain 2
Strain 3
A B C D
A B C D
A
A B C
D
D
A
A B C
D
D
Data
Strain 1
Strain 2
Strain 3
Automated methods:
Strains 1 and 3: {A, B, C, D}
Strain 2: {A, B}
Ground truth
Strain 1
Strain 2
Strain 3
Strain 2
A B C
Breaks
in
assembly
Missed
gene
boundaries

To
address
this
problem:

•  Visualiza5on
must
help
bring
experts
into
the

data
mining
loop

1)  Helps
experts
iden5fy
sources
of
error

2)  Allows
experts
explore
the
data

3)  Enable
researchers
to
integrate
exper(se
in
data

analysis

So:
overview
visualiza5on
not
enough.

Need
gene-‐neighborhood
details

•  Visualiza5on
must
scale
to
enable
comparisons

between
hundreds
to
thousands
of
genomes

Big
displays:
Opportunity
for
big
data?

•  The
ques5on
is:

can
these
environments
be
used
to

visualize
big
data
sets
bejer?

•  Evidence
suggests
yes:

–  Physical
naviga5on
over
virtual
naviga5on

•  Reduced
need
pan
and
zoom

•  Reduced
need
for
context
switching

•  U5lize
embodied
cogni5on

•  Mul5ple
levels-‐of
detail
accessible
through
physical
movement

–  Externalize
more
informa5on
that
can
be
accessed

simultaneously

Lance
Long

Por5ng
from
small
to
big
displays

•  Maybe
por5ng
genome
visualiza5ons
to
these

environments
is
suﬃcient?

•  Ruddle2013:

–  Export
high-‐resolu5on
graphical
output
from

exis5ng
genomics
visualiza5ons

–  Display
these
large
images
on
big
display

–  Evidence
that
this
had
a
posi5ve
impact
on

researcher
reasoning

•  However,
eﬀec5ve
visualiza5on
on
big
displays

involves
more
than
simply
scaling
up
the

representa5on

Pixel-‐Density
Scalability

•  As
pixel-‐density
increases,
does
a
visual
approach
take

advantage
of
increased
pixels-‐per-‐inch
to
show
more

en55es,
rela5onships
or
to
show
data
at
higher
detail

Evalua5on:

•  High-‐Density
Representa5on?

•  use
increased
pixels
per
inch
to
show
more
en55es
and

rela5onships
at
higher
detail?

•  Simultaneous
detail
and
overview?

•  With
increased
pixel
density,
representa5on
shows
details

and
overviews
at
the
same
5me,
without
relying
on
Focus
+Context

Display-‐Size
Scalability

•  As
display
size
increases,
does
a
visual
approach
take

advantage
of
the
increased
space
to
depict
more

en55es
or
rela5onships?

Evalua5on

•  Encode
big
data
spa5ally

•  Cluster
related
elements:

•  spa5al
memory

•  direct,
visual
comparisons

•  Physical
naviga5on
over
virtual
naviga5on:

•  Overviews
at
a
distance,
details
up-‐close

Perceptual
and
Analy5c
Task

Scalability

•  Does
a
visual
approach
scale
up
to
enable
the

performance
of
an
analy5c
task
across
more

data,
more
space,
more
pixels.

•  Does
percep5on
suﬀer
if
you
scale
the
approach

up?

•  Analy5c
tasks
performed
pre-‐ajen5vely

•  Analy5c
tasks
aided
by
visual
queries

•  Aids
to
visual
search
for
performing
analy5c
tasks

Examining
current
genomic
data

visualiza5ons

•  Does
it
address
this
problem?

•  Show
gene
neighborhoods

•  Compara5ve

•  Does
this
visualiza5on
allow
comparison
between

more
than
a
few
gene
neighborhoods?

•  If
you
scale
the
visual
approach
up,
does
it:

•  Allow
more
comparisons
of
gene
neighborhoods
(Analy5c

Task
Scalability)

•  Take
advantage
of
big
displays
in
size
and
pixel-‐density

(Display
Resolu5on
Scalability
and
Display
Size
Scalability)

•  In
the
process,
remain
sensible
to
a
human
viewer

(Perceptual
scalability)

Line-‐based
compara5ve
approaches

•  On
load,
align
1-‐2
genes
to

a
chosen
gene
in
a

reference
genome

•  Draw
a
line
or
a
band
to

connect
orthologs

•  In
many
cases,
repurpose

genome
browsers
to
be

compara5ve
by
adding

compara5ve
track

•  Tools:
PSAT,

GBrowse_syn,
SynView,

ACT,
CGAT,
Combo,

MizBee,
Mauve

Pan,
X.
et
al.
(2005).

SynBrowse:
a
synteny

browser
for

compara5ve
sequence

analysis.
Bioinforma)cs

(Oxford,
England).

McKay
et
al.
Using

the
Generic
Synteny

Browser

(GBrowse_syn).

Current
protocols
in

Bioinforma)cs

Hoboken,
NJ,
USA:

John
Wiley
&
Sons

Line-‐based
approaches
expanded:

Mauve

•  Like
parallel

coordinates

•  Draw
lines
between

orthologs

•  Color
genes
by
their

block
with
that

genome
(not
colored

by
orthology)

•  Example
shows
9

genomes

Darling,
Aaron
CE,
et
al.
"Mauve:
mul5ple
alignment
of
conserved

genomic
sequence
with
rearrangements."
Genome
research
14.7

(2004):
1394-‐140

Line-‐based
approaches:
Cri5que

•  Pixel-‐density
scalable?

–  Not
a
high-‐density
representa5on

–  Need
space
for
the
‘compara5ve
track’

•  Display
size
scalable?

–  Hard
to
follow
lines
across
a
display

–  Hard
to
compare
similar
neighborhoods

across
the
display

–  No
overview
from
a
distance,
details
up

close

•  Perceptual
scalability
for
comparing

gene
neighborhoods?

–  Lots
of
visual
clujer

–  Comparisons
not
pre-‐ajen5ve

–  No
aid
to
visual
search

•  Number
of
genomes

–  Published
up
to
9

–  Private
groups
have
adapted
frameworks

for
10-‐50
genomes
on
big
display

Darling,
Aaron
CE,
et
al.
"Mauve:
mul5ple

alignment
of
conserved
genomic
sequence
with

rearrangements."
Genome
research
14.7
(2004):

1394-‐140

PSAT:
Color
and
alignment

•  PSAT

– Orthologs
encoded

using
color

– Strand
on
which
gene

is
posi5oned
is

encoded
by

orienta5on
to
the

center
line

– Text
is
given
by

default

Fong,
Chris5ne,
et
al.
"PSAT:
a

web
tool
to
compare
genomic

neighborhoods
of
mul5ple

prokaryo5c
genomes."
BMC

bioinforma5cs
9.1
(2008):
170.

PSAT:
Cri5que

•  Pixel-‐Density

Scalability

– Not
high-‐density

representa5on

because
of
text
labels

•  Perceptual
scalability

for
comparing
gene

neighborhoods?

– Can’t
scale
to
large

number
of
genes-‐
not

enough
colors

Fong,
Chris5ne,
et
al.
"PSAT:
a

web
tool
to
compare
genomic

neighborhoods
of
mul5ple

prokaryo5c
genomes."
BMC

bioinforma5cs
9.1
(2008):
170.

GeneRiViT:
Alignment
and
color

•  GeneRiViT

–  Align
against
arbitrary

gene

–  Color
by
presence/
absence

–  Examples
show
4
genomes

–  Cri5que:

•  No
discussion
of
scalability

•  Overview
visualiza5on

•  Doesn’t
address
our

problem

Price,
A.
et
al
"Gene-‐RiViT:
A
visualiza5on
tool

for
compara5ve
analysis
of
gene

neighborhoods
in
prokaryotes."
Biological

Data
Visualiza5on
(BioVis),
2012
IEEE

Symposium
on.
IEEE,
2012.

Dot
plots

•  Coordinates
of
genes
in

two
genomes
are
used

as
x
and
y
axis

•  Orthologous
genes
in

other
genomes
are

plojed

•  Each
genome
given
a

unique
color

•  Cri5que:

–  Doesn’t
provide
‘gene-‐
neighborhood’
view

–  Overview
tool

–  Hard
to
follow
beyond

a
few
genomes

Price,
A.
et
al
"Gene-‐RiViT:
A
visualiza5on
tool

for
compara5ve
analysis
of
gene

neighborhoods
in
prokaryotes."
Biological

Data
Visualiza5on
(BioVis),
2012
IEEE

Symposium
on.
IEEE,
2012.

Overview
Visualizaiton:
Sequence

Surveyor

•  Not
this
domain

problem,
but

interes5ng
approach

•  Each
gene
is
drawn
as
a

rectangle

•  Several
possible

variables
for
posi5on:

Ordinal
posi5on

•  Several
possible

variables
for
color:

–  Posi5on
in
one

reference
genome

–  Use
a
color
ramp,
for

wide
range
of
colors

Albers,D.
et
al
"Sequence
surveyor:
Leveraging
overview
for
scalable

genomic
alignment
visualiza5on."
Visualiza5on
and
Computer

Graphics,
IEEE
Transac5ons
on
17.12
(2011):
2392-‐2401.

Overview
Visualizaiton:
Sequence

Surveyor

•  Pixel-‐density
scalable

–  High-‐density
representa5on

–  High-‐detail
representa5on

•  Display
size
scalability

–  May
be
diﬃcult
to
compare

pajerns
from
one
side
of

display
to
another

•  Perceptual
Scalability

–  Colors
allow
for
pre-‐ajen5ve

iden5ﬁca5on
of
pajerns

–  Avoids
visual
clujer

Albers,D.
et
al
"Sequence
surveyor:
Leveraging
overview

for
scalable
genomic
alignment
visualiza5on."

Visualiza5on
and
Computer
Graphics,
IEEE
Transac5ons

on
17.12
(2011):
2392-‐2401.

Copy
number
varia5ons
on
big

displays

•  Orchestral:

–  Visualiza5on
of
a
different
data
type

–  Effec5ve
use
of
color
to
enable
pre-‐ajen5vely

iden5fica5on
of
similari5es
across
genomes

–  High-‐density
representa5on

–  Details-‐up-‐close,
overview
from
a
distance

Ruddle,
Roy
A.,
et
al.
"Leveraging

wall-‐sized
high-‐resolu5on
displays
for

compara5ve
genomics
analyses
of

copy
number
varia5on."
Biological

Data
Visualiza5on
(BioVis),
2013
IEEE

Symposium
on.
IEEE,
2013.

BactoGeNIE
Demo

•  Video
at:
hjps://www.youtube.com/watch?
v=yrSyi1RWcUw

Program
details

•  Implemented
in
C++
using
Qt
and
the
QGraphicsView

framework

•  Upload:

–  genome
feature
files

–  Fasta
files
(raw
gene
sequences)

•  Cd-‐hit
algorithm
processes
sequence
files
to
compute

ortholog
‘clusters’

•  MySQL
database
to
store
big
datasets

–  Loads
1000
con5gs
into
memory,
rest
stored
in
database

•  Op5mized
for
PubMed
datasets

•  Prototyped
on
E.Coli
dran
genomes

–  Capable
of
displaying
any
con5gs
from
thousands
of
E.Coli
dran

genomes

•  On
EVL
Cyber-‐commons
wall,
around
400
con5gs
in
view

BactoGeNIE:
High
density

representa5on

•  Compressed
genome

encoding

•  No
text
labels,
instead

‘on-‐demand’

•  No
‘compara5ve
track’

•  Encode
orthology
using

–  User
applied
color:
pre-‐
ajen5ve
orthology

iden5ﬁca5on

–  Coordinated

highligh5ng:
scalable

visual
query

–  Alignment:
use
space
to

encode
similarity

Use
space
to
encode
similarity

•  Goals:

–  Make
it
easier
to
perform
comparisons
across
many

genomes
(Analy5c
task
scalability)

–  Accommodate
increased
display
size
(Display
Size

Scalability)

–  Make
similari5es
and
diﬀerences
easy
to
see

(Perceptual
Scalability)

•  Sor5ng
and
Alignment

–  Sort
by
con5g
length

–  Sort
by
gene
content

–  Dynamically
align
against
any
gene

Interac5vity

•  On
hovering,
con5g
expands
in
height,
so
easier

to
select
genes
of
interest
in
high-‐density
view

•  ‘Pop-‐up’
menu
for
each
gene
that
gives
info
and

allows
for:

–  applica5on
of
color:

•  ‘tagging’
opera5on

•  Scalable
query

–  “targe5ng”
opera5on
(described
next)

•  User
can
sort
genomes
by
:

–  Gene
target

–  Con5g
length

‘Gene
Targe5ng’
Func5on
to
create

high
resolu5on,
compara5ve
‘maps’

•  User
selects
a
gene
of
interest

•  This
gene
is
given
a
base
color

•  Two
color
ramps
are
applied
to
adjacent
genes,

one
‘upstream’
and
one
‘downstream’

•  Orthologous
genes
in
related
genomes
are
given

the
same
colors

•  Con5gs
containing
this
gene
are
brought
to
the

top

•  The
target
gene
is
centered

•  Orthologs
are
aligned
to
the
target

Gene
targe5ng
func5on

•  Clustering
to

promote
direct

comparisons

•  Overviews
at
a

distance

•  Details
up
close

•  Pre-‐ajen5ve

iden5ﬁca5on
of

similari5es
and

diﬀerences
between

gene
neighborhoods

Lance
Long

Pixel-‐density
Scalability

BactoGeNIE
ﬁts

the
pixel-‐density

scalability

criteria:

High-‐density
data

display,
iden5ﬁer

display
and

orthology

encoding

Display
Size
Scalability

•  BactoGeNIE

is
the
only

approach
to

use

clustering

and
show

mul5ple

levels
of

detail

Perceptual
Scalability
and
Analy5c

Tasks

BactoGeNIE:

•  Similarity
is
pre-‐
ajen5vely

accessible

•  Avoids
visual

clujer

•  Visual
query
for

orthologs

Graphical
Scalability:

Display
Resolu5on
vs
Number
of

Genomes

0

100

200

300

400

500

600

700

800

900

1000

480
720
1080
1440
2160
2880
3240
4320

BactoGeNIE

GeneRiViT

SynBrowse

SynView

PSAT

Geco

Mauve

Pixels

Genomes

Preliminary
User
Feedback

•  A
version
of
BactoGeNIE
used
by
computa5onal
biology
team
on
NxN
pixels

and
MxM
inches
resolu5on
5led
display
wall

•  “This
tool
has
been
widely
used
by
members
of
the
team
to
show
the

compara)ve
analyses
of
genomic
context
for
several
bacterial
genomes”

•  “Genome
browsers
such
as
JBrowse
enable
researchers
to
do
compara)ve

genome
analyses
for
nearly
10-‐50
genomes.
But
fail
to
work
when
we
are

studying
several
hundreds
of
genomes
of
interest.

•  This
tool
is
really
unique
and
it’s
the
only
tool
that
I
am
aware
of
that
can

scale
up
to
any
number
of
genome
comparisons.

•  The
ability
to
load
mul)ple
tracks
of
genomes,
and
the
zoom
in
and
out

op)ons
with
color
coding,
annota)on
tracks
makes
it
very
convenient
for

scien)sts
to
quickly
look
at
paXerns.

•  This
tool
has
a
poten)al
to
serve
both
for
visualiza)on
as
well
as
data
mining

needs.”

Usage
of
a
version
without
the
gene
targe5ng
approach.

Future
study
will
concentrate
on
this
feature
with
a
wider
community
of
users

Summary
of
contribu5ons

•  A
novel
design
that
is
the
ﬁrst
to
enable
direct

comparisons
between
hundreds
of
gene

neighborhoods
in
one
view

•  First
interac5ve,
large-‐scale
compara5ve
gene

neighborhood
approach,
with
on-‐the-‐ﬂy
sor5ng,

dynamic
alignment,
user-‐selected
color
and
color

ramps

•  First
to
show
overviews
with
gene
neighborhood-‐
details,
that
can
be
accessed
through
physical

movement

•  introduces
a
novel
visualiza5on
approach
‘gene

targe5ng’
that
translates
genomic
data
into
high-‐
resolu5on
genomic
maps

What’s
next?

Design

•  Integra5on
with
diﬀerent
levels
of
detail

•  Mul5ple
color
ramps

•  Advanced
ordering
in
y,
based
on
similarity
to
target
or

strain
phylogeny

Implementa5on

•  Scalability
in
rendering
using
paralleliza5on
on
the
GPU

•  Port
to
SAGE

Evalua5on

•  User
studies
and
evalua5ons
of
perceptual
scalability

Scalable
Design,
Big
Data,
Big
Displays

•  Need
visualiza5on
to
provide
an
interface

between
automated
analysis
and
the
expert

•  Por5ng
exis5ng
visual
approaches
to
big
data

and
big
displays
will
not
always
work

•  Need
to
design
for
increased

– pixel-‐density

– display
size

– volume
of
analy5cal
tasks

Thanks!

•  Acknowledgements:

– Jason
Leigh,
Andy
Johnson,
Khairi
Reda,
Lance

Long,
Uthman
Shabazz,
and
everyone
in
the

Electronic
Visualiza5on
Laboratory

– Barry
Goldman,
David
Bush,
Niran
Iyer,
Shawn

Stricklin
and
the
rest
of
the
computa5onal
biology

team
at
Monsanto

•  Ques5ons?

Jillian ms defense-4-14-14-ja-novideo

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Jillian ms defense-4-14-14-ja-novideo

Ähnlich wie Jillian ms defense-4-14-14-ja-novideo (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Jillian ms defense-4-14-14-ja-novideo