The document discusses the use of visualizations to analyze large datasets of bacterial gene sequences and neighborhoods in order to help experts identify patterns and functions. It outlines the development of "BactoGeNIE", a new visualization approach created by the author to scale to big data and large displays for the purpose of comparative bacterial gene neighborhood analysis. The visualization is intended to bring experts into the process of exploring gene neighborhoods across many genomes to gain insights that could not be found through automated methods alone.
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
Jillian ms defense-4-14-14-ja-novideo
1. Bacterial
Gene
Neighborhood
Inves5ga5on
Environment:
A
Scalable
Genome
Visualiza5on
for
Big
Displays
Jillian
Aurisano
Master
of
Science
Defense
April
16,
2014
3. Up
un5l
very
recently
“Observa)ons!”
Exper5se
Explore
Collect
samples,
Catch
errors
4. “No
one
looks
under
a
microscope
anymore.
Its
all
DNA.
”
How
do
scien)sts
make
discoveries?
5. How
do
we
bring
experts
into
the
loop?
• From
direct
collec5on
of
data,
direct
observa5on
of
results
direct
interpreta5on
and
analysis
• To
automated
data
collec5on,
automated
filtering
and
automated
analysis
• Need
visualiza5on
to
bring
experts
into
the
loop
• But
how
do
we
handle
big
data?
• What’s
our
Big
Data
microscope?
“
Picard:
Computer;
scan
everything,
run
diagnos5cs,
and
tell
us
the
answer.”
“Computer:
Results
are
inconclusive”
6. Can
Big
Displays
help?
• Evidence
suggests
that
these
environments
can
have
a
posi5ve
impact
on
percep5on
and
cogni5on
• But
how
do
we
use
them
to
effec5vely
address
big
data
problems?
• Can
exis5ng
visualiza5ons
simply
be
‘scaled-‐
up’
to
fit
or
are
new
approaches
needed?
7. In
this
thesis
I
will…
Examine
a
specific
big
data
visualiza5on
problem:
compara5ve
gene
neighborhood
analysis
in
bacterial
genomics
I
worked
closely
over
several
years
with
a
team
of
computa5onal
biologists
This
work
has
led
to
the
design
and
implementa5on
of
a
new
visualiza5on
approach
designed
to
scale
to
big
data
and
big
displays
BactoGeNIE
(‘Bact(o)erial
Gene
Neighborhood
Inves5ga5on
Environment’)
8. Outline
1) Describe
compara5ve
bacterial
gene
neighborhood
analysis
to
understand
how
to
bring
experts
into
the
loop
2) Examine
poten5al
impact
of
Big
Displays
on
Big
Data
visualiza5on
3) Evaluate
scalability
in
exis5ng
compara5ve
genomics
visualiza5ons
My
work:
BactoGeNIE
4/5/6)
Describe
my
design,
implementa5on,
results
7) Think
about
the
future
In
the
process,
learn
something
about
scaling
up
visual
approaches
to
big
data
and
big
displays
10. Genome
sequencing
boom
• Sequencing
costs
decreasing
faster
than
Moore’s
Law
• So,
we
are
able
to
produce
massive
volumes
of
sequence
data
• Bacterial
genomes
are
small,
so
we
are
genera5ng
thousands
of
complete
bacterial
genome
sequences
Wejerstrand
K.A.,
DNA
Sequencing
Costs:
Data
from
the
NHGRI
Large-‐
Scale
Genome
Sequencing
Program,
2012
<
www.genome.gov/sequencingcosts>
11. What
is
a
genome?
What
is
a
gene?
• Genomes
consists
of
one
or
more
long
molecules
of
‘DNA’
• DNA
consists
of
chained
nucleo5de
molecules
(A,
C,
T,
G)
also
called
‘base
pairs’
• All
the
genes
in
an
organism
are
in
its
‘genome’
• Genes
determine
traits
in
an
organism
• Genes
‘code’
for
proteins,
and
proteins
do
the
work
to
make
traits
happen
12. How
are
genomes
sequenced?
• Sequencing
• Assembly
• Annota5on
• Output:
– Genome
feature
files
– Raw
sequence
files
Michael
Schatz
Cold
Spring
Harbor
13. Lots
of
genome
sequences-‐>
opportunity
Big
challenge:
Hard
to
figure
out
what
a
novel
gene
does
• Tradi5onally:
do
wet-‐lab
research
to
figure
out
– but
expensive,
5me-‐consuming
• Sequence
the
gene,
and
use
computa5onal
methods
to
predict
the
func5on
of
the
protein
– If
novel
gene,
may
not
provide
answer
• Can
complete
genome
sequences
help?
• Compara5ve
gene
neighborhood
analysis
14. From
genome
structure
to
gene-‐product
func5on
• In
bacteria,
genes
whose
products
are
involved
in
similar
func5ons
onen
placed
close
to
each
other
in
the
genome.
• Research
suggests
that
it
is
possible
to
predict
gene-‐product
func5on
in
bacteria
based
on
commonly
recurring
gene
neighbors
• But,
need
to
examine
lots
of
genomes
for
sta5s5cal
significance?
gene1 gene2 gene3 gene4
Biological process
?
15. Comparing
gene
neighborhoods
across
different
genomes
• Genes
with
similar
sequences
likely
produce
proteins
with
similar
func5ons
• Orthologs:
similar
genes
from
different
genomes
• Algorithms
to
compare
genes
between
different
genomes
DeMeo
et
al.
BMC
Molecular
Biology
2008
9:2
doi:
10.1186/1471-‐2199-‐9-‐2
16. Role
for
visualiza5on
in
this
problem
• Why
not
use
automated
methods
to
find
common
sets
of
genes
around
gene
targets?
• Why
visualiza5on?
• 3
E’s:
Explora5on,
Exper5se,
Errors
Automated methods:
Target: gene B
Common subsequences:
Strains 1, 2, 3: {A, B, C, D}
17. • Pajerns
and
anomalies
without
knowing
in
advance
what
you
are
looking
for
Explora5on
Automated methods:
Target: gene B
Common subsequences:
Strains 1, 2, 3: {A, B, C, D}
Duplication
Strain 1
Strain 2
Strain 3
A B D
A
A
C
CC
D
D
B C
CBB
B
Truncation
Strain 1
Strain 2
Strain 3
A B C D
A
A B C
D
D
B C
Deletion
Strain 1
Strain 2
Strain 3
A B
C
D
A
A
C
D
D
B
B
Inversion
Strain 1
Strain 2
Strain 3
A B C D
A
A B C
D
D
CB
18. Exper5se
• Experts
make
connec5ons
that
will
be
missed
by
automated
methods
– Not
just
the
anomaly,
but
significance
of
the
anomaly
– Knowledge
about
strains,
protein
families
involved
in
finding
significant
anomalies
StrainA
StrainB
StrainC
!
19. Errors
• Verify
automated
methods
• Uncertainty
and
errors
in
data
genera5on
Data
Strain 1
Strain 2
Strain 3
Automated methods:
Common subsequences:
Strains 1 and 3: {A, B, C, D}
Strain 2: {A, D}
Ground truth
Strain 1
Strain 2
Strain 3
A B C D
A B C D
A
A B C
D
D
A
A B C
D
D
Data
Strain 1
Strain 2
Strain 3
Automated methods:
Common subsequences:
Strains 1 and 3: {A, B, C, D}
Strain 2: {A, B}
Ground truth
Strain 1
Strain 2
Strain 3
Strain 2
A B C
Breaks
in
assembly
Missed
gene
boundaries
20. To
address
this
problem:
• Visualiza5on
must
help
bring
experts
into
the
data
mining
loop
1) Helps
experts
iden5fy
sources
of
error
2) Allows
experts
explore
the
data
3) Enable
researchers
to
integrate
exper(se
in
data
analysis
So:
overview
visualiza5on
not
enough.
Need
gene-‐neighborhood
details
• Visualiza5on
must
scale
to
enable
comparisons
between
hundreds
to
thousands
of
genomes
21. Big
displays:
Opportunity
for
big
data?
• The
ques5on
is:
can
these
environments
be
used
to
visualize
big
data
sets
bejer?
• Evidence
suggests
yes:
– Physical
naviga5on
over
virtual
naviga5on
• Reduced
need
pan
and
zoom
• Reduced
need
for
context
switching
• U5lize
embodied
cogni5on
• Mul5ple
levels-‐of
detail
accessible
through
physical
movement
– Externalize
more
informa5on
that
can
be
accessed
simultaneously
Lance
Long
22. Por5ng
from
small
to
big
displays
• Maybe
por5ng
genome
visualiza5ons
to
these
environments
is
sufficient?
• Ruddle2013:
– Export
high-‐resolu5on
graphical
output
from
exis5ng
genomics
visualiza5ons
– Display
these
large
images
on
big
display
– Evidence
that
this
had
a
posi5ve
impact
on
researcher
reasoning
• However,
effec5ve
visualiza5on
on
big
displays
involves
more
than
simply
scaling
up
the
representa5on
23. Pixel-‐Density
Scalability
• As
pixel-‐density
increases,
does
a
visual
approach
take
advantage
of
increased
pixels-‐per-‐inch
to
show
more
en55es,
rela5onships
or
to
show
data
at
higher
detail
Evalua5on:
• High-‐Density
Representa5on?
• use
increased
pixels
per
inch
to
show
more
en55es
and
rela5onships
at
higher
detail?
• Simultaneous
detail
and
overview?
• With
increased
pixel
density,
representa5on
shows
details
and
overviews
at
the
same
5me,
without
relying
on
Focus
+Context
24. Display-‐Size
Scalability
• As
display
size
increases,
does
a
visual
approach
take
advantage
of
the
increased
space
to
depict
more
en55es
or
rela5onships?
Evalua5on
• Encode
big
data
spa5ally
• Cluster
related
elements:
• spa5al
memory
• direct,
visual
comparisons
• Physical
naviga5on
over
virtual
naviga5on:
• Overviews
at
a
distance,
details
up-‐close
25. Perceptual
and
Analy5c
Task
Scalability
• Does
a
visual
approach
scale
up
to
enable
the
performance
of
an
analy5c
task
across
more
data,
more
space,
more
pixels.
• Does
percep5on
suffer
if
you
scale
the
approach
up?
• Analy5c
tasks
performed
pre-‐ajen5vely
• Analy5c
tasks
aided
by
visual
queries
• Aids
to
visual
search
for
performing
analy5c
tasks
26. Examining
current
genomic
data
visualiza5ons
• Does
it
address
this
problem?
• Show
gene
neighborhoods
• Compara5ve
• Does
this
visualiza5on
allow
comparison
between
more
than
a
few
gene
neighborhoods?
• If
you
scale
the
visual
approach
up,
does
it:
• Allow
more
comparisons
of
gene
neighborhoods
(Analy5c
Task
Scalability)
• Take
advantage
of
big
displays
in
size
and
pixel-‐density
(Display
Resolu5on
Scalability
and
Display
Size
Scalability)
• In
the
process,
remain
sensible
to
a
human
viewer
(Perceptual
scalability)
27. Line-‐based
compara5ve
approaches
• On
load,
align
1-‐2
genes
to
a
chosen
gene
in
a
reference
genome
• Draw
a
line
or
a
band
to
connect
orthologs
• In
many
cases,
repurpose
genome
browsers
to
be
compara5ve
by
adding
compara5ve
track
• Tools:
PSAT,
GBrowse_syn,
SynView,
ACT,
CGAT,
Combo,
MizBee,
Mauve
Pan,
X.
et
al.
(2005).
SynBrowse:
a
synteny
browser
for
compara5ve
sequence
analysis.
Bioinforma)cs
(Oxford,
England).
McKay
et
al.
Using
the
Generic
Synteny
Browser
(GBrowse_syn).
Current
protocols
in
Bioinforma)cs
Hoboken,
NJ,
USA:
John
Wiley
&
Sons
28. Line-‐based
approaches
expanded:
Mauve
• Like
parallel
coordinates
• Draw
lines
between
orthologs
• Color
genes
by
their
block
with
that
genome
(not
colored
by
orthology)
• Example
shows
9
genomes
Darling,
Aaron
CE,
et
al.
"Mauve:
mul5ple
alignment
of
conserved
genomic
sequence
with
rearrangements."
Genome
research
14.7
(2004):
1394-‐140
29. Line-‐based
approaches:
Cri5que
• Pixel-‐density
scalable?
– Not
a
high-‐density
representa5on
– Need
space
for
the
‘compara5ve
track’
• Display
size
scalable?
– Hard
to
follow
lines
across
a
display
– Hard
to
compare
similar
neighborhoods
across
the
display
– No
overview
from
a
distance,
details
up
close
• Perceptual
scalability
for
comparing
gene
neighborhoods?
– Lots
of
visual
clujer
– Comparisons
not
pre-‐ajen5ve
– No
aid
to
visual
search
• Number
of
genomes
– Published
up
to
9
– Private
groups
have
adapted
frameworks
for
10-‐50
genomes
on
big
display
Darling,
Aaron
CE,
et
al.
"Mauve:
mul5ple
alignment
of
conserved
genomic
sequence
with
rearrangements."
Genome
research
14.7
(2004):
1394-‐140
30. PSAT:
Color
and
alignment
• PSAT
– Orthologs
encoded
using
color
– Strand
on
which
gene
is
posi5oned
is
encoded
by
orienta5on
to
the
center
line
– Text
is
given
by
default
Fong,
Chris5ne,
et
al.
"PSAT:
a
web
tool
to
compare
genomic
neighborhoods
of
mul5ple
prokaryo5c
genomes."
BMC
bioinforma5cs
9.1
(2008):
170.
31. PSAT:
Cri5que
• Pixel-‐Density
Scalability
– Not
high-‐density
representa5on
because
of
text
labels
• Perceptual
scalability
for
comparing
gene
neighborhoods?
– Can’t
scale
to
large
number
of
genes-‐
not
enough
colors
Fong,
Chris5ne,
et
al.
"PSAT:
a
web
tool
to
compare
genomic
neighborhoods
of
mul5ple
prokaryo5c
genomes."
BMC
bioinforma5cs
9.1
(2008):
170.
32. GeneRiViT:
Alignment
and
color
• GeneRiViT
– Align
against
arbitrary
gene
– Color
by
presence/
absence
– Examples
show
4
genomes
– Cri5que:
• No
discussion
of
scalability
• Overview
visualiza5on
• Doesn’t
address
our
problem
Price,
A.
et
al
"Gene-‐RiViT:
A
visualiza5on
tool
for
compara5ve
analysis
of
gene
neighborhoods
in
prokaryotes."
Biological
Data
Visualiza5on
(BioVis),
2012
IEEE
Symposium
on.
IEEE,
2012.
33. Dot
plots
• Coordinates
of
genes
in
two
genomes
are
used
as
x
and
y
axis
• Orthologous
genes
in
other
genomes
are
plojed
• Each
genome
given
a
unique
color
• Cri5que:
– Doesn’t
provide
‘gene-‐
neighborhood’
view
– Overview
tool
– Hard
to
follow
beyond
a
few
genomes
Price,
A.
et
al
"Gene-‐RiViT:
A
visualiza5on
tool
for
compara5ve
analysis
of
gene
neighborhoods
in
prokaryotes."
Biological
Data
Visualiza5on
(BioVis),
2012
IEEE
Symposium
on.
IEEE,
2012.
34. Overview
Visualizaiton:
Sequence
Surveyor
• Not
this
domain
problem,
but
interes5ng
approach
• Each
gene
is
drawn
as
a
rectangle
• Several
possible
variables
for
posi5on:
Ordinal
posi5on
• Several
possible
variables
for
color:
– Posi5on
in
one
reference
genome
– Use
a
color
ramp,
for
wide
range
of
colors
Albers,D.
et
al
"Sequence
surveyor:
Leveraging
overview
for
scalable
genomic
alignment
visualiza5on."
Visualiza5on
and
Computer
Graphics,
IEEE
Transac5ons
on
17.12
(2011):
2392-‐2401.
35. Overview
Visualizaiton:
Sequence
Surveyor
• Pixel-‐density
scalable
– High-‐density
representa5on
– High-‐detail
representa5on
• Display
size
scalability
– May
be
difficult
to
compare
pajerns
from
one
side
of
display
to
another
• Perceptual
Scalability
– Colors
allow
for
pre-‐ajen5ve
iden5fica5on
of
pajerns
– Avoids
visual
clujer
Albers,D.
et
al
"Sequence
surveyor:
Leveraging
overview
for
scalable
genomic
alignment
visualiza5on."
Visualiza5on
and
Computer
Graphics,
IEEE
Transac5ons
on
17.12
(2011):
2392-‐2401.
36. Copy
number
varia5ons
on
big
displays
• Orchestral:
– Visualiza5on
of
a
different
data
type
– Effec5ve
use
of
color
to
enable
pre-‐ajen5vely
iden5fica5on
of
similari5es
across
genomes
– High-‐density
representa5on
– Details-‐up-‐close,
overview
from
a
distance
Ruddle,
Roy
A.,
et
al.
"Leveraging
wall-‐sized
high-‐resolu5on
displays
for
compara5ve
genomics
analyses
of
copy
number
varia5on."
Biological
Data
Visualiza5on
(BioVis),
2013
IEEE
Symposium
on.
IEEE,
2013.
37. BactoGeNIE
Demo
• Video
at:
hjps://www.youtube.com/watch?
v=yrSyi1RWcUw
38. Program
details
• Implemented
in
C++
using
Qt
and
the
QGraphicsView
framework
• Upload:
– genome
feature
files
– Fasta
files
(raw
gene
sequences)
• Cd-‐hit
algorithm
processes
sequence
files
to
compute
ortholog
‘clusters’
• MySQL
database
to
store
big
datasets
– Loads
1000
con5gs
into
memory,
rest
stored
in
database
• Op5mized
for
PubMed
datasets
• Prototyped
on
E.Coli
dran
genomes
– Capable
of
displaying
any
con5gs
from
thousands
of
E.Coli
dran
genomes
• On
EVL
Cyber-‐commons
wall,
around
400
con5gs
in
view
39. BactoGeNIE:
High
density
representa5on
• Compressed
genome
encoding
• No
text
labels,
instead
‘on-‐demand’
• No
‘compara5ve
track’
• Encode
orthology
using
– User
applied
color:
pre-‐
ajen5ve
orthology
iden5fica5on
– Coordinated
highligh5ng:
scalable
visual
query
– Alignment:
use
space
to
encode
similarity
40. Use
space
to
encode
similarity
• Goals:
– Make
it
easier
to
perform
comparisons
across
many
genomes
(Analy5c
task
scalability)
– Accommodate
increased
display
size
(Display
Size
Scalability)
– Make
similari5es
and
differences
easy
to
see
(Perceptual
Scalability)
• Sor5ng
and
Alignment
– Sort
by
con5g
length
– Sort
by
gene
content
– Dynamically
align
against
any
gene
41. Interac5vity
• On
hovering,
con5g
expands
in
height,
so
easier
to
select
genes
of
interest
in
high-‐density
view
• ‘Pop-‐up’
menu
for
each
gene
that
gives
info
and
allows
for:
– applica5on
of
color:
• ‘tagging’
opera5on
• Scalable
query
– “targe5ng”
opera5on
(described
next)
• User
can
sort
genomes
by
:
– Gene
target
– Con5g
length
42. ‘Gene
Targe5ng’
Func5on
to
create
high
resolu5on,
compara5ve
‘maps’
• User
selects
a
gene
of
interest
• This
gene
is
given
a
base
color
• Two
color
ramps
are
applied
to
adjacent
genes,
one
‘upstream’
and
one
‘downstream’
• Orthologous
genes
in
related
genomes
are
given
the
same
colors
• Con5gs
containing
this
gene
are
brought
to
the
top
• The
target
gene
is
centered
• Orthologs
are
aligned
to
the
target
43. Gene
targe5ng
func5on
• Clustering
to
promote
direct
comparisons
• Overviews
at
a
distance
• Details
up
close
• Pre-‐ajen5ve
iden5fica5on
of
similari5es
and
differences
between
gene
neighborhoods
Lance
Long
49. Preliminary
User
Feedback
• A
version
of
BactoGeNIE
used
by
computa5onal
biology
team
on
NxN
pixels
and
MxM
inches
resolu5on
5led
display
wall
• “This
tool
has
been
widely
used
by
members
of
the
team
to
show
the
compara)ve
analyses
of
genomic
context
for
several
bacterial
genomes”
• “Genome
browsers
such
as
JBrowse
enable
researchers
to
do
compara)ve
genome
analyses
for
nearly
10-‐50
genomes.
But
fail
to
work
when
we
are
studying
several
hundreds
of
genomes
of
interest.
• This
tool
is
really
unique
and
it’s
the
only
tool
that
I
am
aware
of
that
can
scale
up
to
any
number
of
genome
comparisons.
• The
ability
to
load
mul)ple
tracks
of
genomes,
and
the
zoom
in
and
out
op)ons
with
color
coding,
annota)on
tracks
makes
it
very
convenient
for
scien)sts
to
quickly
look
at
paXerns.
• This
tool
has
a
poten)al
to
serve
both
for
visualiza)on
as
well
as
data
mining
needs.”
Usage
of
a
version
without
the
gene
targe5ng
approach.
Future
study
will
concentrate
on
this
feature
with
a
wider
community
of
users
50. Summary
of
contribu5ons
• A
novel
design
that
is
the
first
to
enable
direct
comparisons
between
hundreds
of
gene
neighborhoods
in
one
view
• First
interac5ve,
large-‐scale
compara5ve
gene
neighborhood
approach,
with
on-‐the-‐fly
sor5ng,
dynamic
alignment,
user-‐selected
color
and
color
ramps
• First
to
show
overviews
with
gene
neighborhood-‐
details,
that
can
be
accessed
through
physical
movement
• introduces
a
novel
visualiza5on
approach
‘gene
targe5ng’
that
translates
genomic
data
into
high-‐
resolu5on
genomic
maps
51. What’s
next?
Design
• Integra5on
with
different
levels
of
detail
• Mul5ple
color
ramps
• Advanced
ordering
in
y,
based
on
similarity
to
target
or
strain
phylogeny
Implementa5on
• Scalability
in
rendering
using
paralleliza5on
on
the
GPU
• Port
to
SAGE
Evalua5on
• User
studies
and
evalua5ons
of
perceptual
scalability
52. Scalable
Design,
Big
Data,
Big
Displays
• Need
visualiza5on
to
provide
an
interface
between
automated
analysis
and
the
expert
• Por5ng
exis5ng
visual
approaches
to
big
data
and
big
displays
will
not
always
work
• Need
to
design
for
increased
– pixel-‐density
– display
size
– volume
of
analy5cal
tasks
53. Thanks!
• Acknowledgements:
– Jason
Leigh,
Andy
Johnson,
Khairi
Reda,
Lance
Long,
Uthman
Shabazz,
and
everyone
in
the
Electronic
Visualiza5on
Laboratory
– Barry
Goldman,
David
Bush,
Niran
Iyer,
Shawn
Stricklin
and
the
rest
of
the
computa5onal
biology
team
at
Monsanto
• Ques5ons?