This document discusses plans for fully characterizing a reference genome sample. It addresses determining library preparation protocols, sequencing platforms, difficult regions to sequence, error correction and verification approaches, whether to generate new or use existing data, and other considerations like dividing the genome into easy, medium and hard to sequence regions. The goal is to develop a consensus plan to experimentally characterize a reference material.
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Measurements for Reference Material Characterization Working Group Summary Aug2012
1. Measurements
for
Reference
Material
Characteriza4on
Develop
a
consensus
plan
for
experimental
characteriza4on
of
Reference
Materials
Genome
in
a
Bo;le
Working
Group
2. Determine
library
prepara4on
protocols
• Fosmid
and/or
a
BAC
library?
• Need
to
keep
some
cells
for
plaGorm-‐specific
DNA
purifica4on
needs
(e.g.,
to
get
high-‐
molecular-‐weight
DNA)
• Include
sequencing
family
members
3. Determine
sequencing
plaGorms
• Illumina
– 300-‐600bp
inserts
– 3ish
kb
mate-‐pair
– 2x100bp
reads
on
HiSeq
• PacBio
– >10kbish
inserts
– 1x90m
“movies”
– Who
will
do
it
and
pay
for
it?
– PacBio
may
be
able
to
create
the
libraries
– Chris
Mason
to
sequence?
• Life
Tech
– 5500
data
at
NIST
(1,6,
and
10kb
mate
pairs)
– Ion
Torrent/Proton
data
(who
to
generate?)
• Complete
– Standard
libraries
and
LFR
approach
– Start
with
cells
• 454?
– 700-‐800bp
reads?
• Newer
technologies
–
are
they
used
for
verifica4on/valida4on
only?
– Oxford
Nanopore?
– GnuBio?
4. “Error
Correc4on”
and
Verifica4on
(not
valida4on)
• ArrayCGH
and
SNP
Chip
• OpGen
(or
other
op4cal
mapping
approaches)
• Targeted
sequencing
5. Difficult
to
sequence
regions
• Characterize
MHC
regions?
– Use
454?
• Other
parts
of
the
genome?
• Approved
CLIA-‐cer4fied
specific
tests?
– Highly
mul4plexed
TaqMan/Sanger
style
assay
• CLIA-‐cer4fied
whole
exome
data
• Do
we
pick
very
specific
well-‐characterized
fosmids?
6. Old
vs
New
Data
• What
new
data
will
we
need
to
generate
on
the
actual
reference
sample?
• Can
we
use
data
that
currently
exists?
• Or
generate
data
now
or
vanilla
Coriell
samples?
7. Other
Thoughts
• BAC
by
BAC
approach?
• Divide
the
genome
into
the
easy,
medium,
and
hard
bits
– Easy
=
same
call
all
the
4me
on
every
plaGorm
• How
do
we
account
for
technological/
algorithmic
improvements?