2. In
the
year
1992….
• Freetext
Database
=
Document/NoSQL
Database
• Massive
Datasets
– 19043
records!!!
– Approx.
8k
per
record
2
3. The
Drug
• Data
Analysis
was
‘Exci?ng’
• 2-‐3
days
to
write
the
analysis
program
• Processing
would
occur
overnight
• Sta?s?cs
required
‘whole
set’
processing
3
4. The
Hit
• Mornings
were
‘the
hit’
• The
joy
of
real
data
analysis
is
the
output
of
a
good
report
• Get
good
stats
– I
know
how
many
teachers
teach
Geography
in
Scotland!
– I
know
400
people
have
purchased
our
History
so]ware!
• The
wait
and
the
results
kept
us
working
4
5. In
the
year
2002
• Grid
compu?ng
was
the
drug
• Building
200-‐2000
node
grid
systems
• Analysis
could
happen
the
same
day
• Datasets
could
be
huge
– They
just
took
more
hours
• S?ll
working
on
en?re
datasets
– Sta?s?cs
s?ll
required
whole
set
process
• Jobs
became
monotonous
• More
about
construc?on
and
technology
than
stats
5
6. In
the
year
2012
• Need
info
and
sta?s?cs
quicker
than
ever
• Database
clusters
provide
the
backbone
– Grids
without
the
headache
• Build
a
query
in
seconds;
Get
the
result
in
seconds
• Need
sta?s?cs
in
different
ways:
– Live
– Online
(and
some?mes
user
visible)
– Whole
of
set
and
par?al
set,
but
based
on
Big
Data
• Slice
and
dice
in
more
ways
without
effort
6
7. Couchbase
Background
Stats
• Couchbase
1.8
already
hits
interes?ng
numbers
• Draw
Something
(OMGPOP),
within
6
weeks:
– 15
million
daily
ac?ve
users
– 3000
drawings
generated
every
two
seconds
– Over
two
billion
stored
drawings
– 90
nodes
– 3
clusters
– No
stops!
7
8. The
new
drug
• Couchbase
Server
2.0
• Cluster-‐based
database
• Fast,
Scalable,
Predictable
• Map/Reduce
based
querying
• JavaScript/Web-‐based
interface
– Type
in
your
query,
get
your
results
• Instant
Gra?fica?on!
8
9. The
Data
End
• Store
data
however
you
want
• The
Map
will
sort
it
out
for
us
9
11. Map/Reduce
Creates
Indexes
• Not
Hadoop
• Map/Reduce
creates
an
index
• Map
*AND*
Reduce
output
are
stored
• Index
is
used
for
queries
• Makes
queries
faster
(obviously!)
• Index
is
‘materialized’
at
query
?me
– Updated,
not
recreated
• Incremental
map/reduce
11
15. Incremental
Reduce
• Required
at
two
levels
– During
cluster-‐based
queries
– During
index
updates
• Incremental
reduce
requires
prepara?on
• Reduce
func?ons
must
be
able
to
consume
their
own
output
• Roll-‐your-‐own
only
– No
external
libraries
15
16. Tips
for
incremental
• Use
simple
values
when
possible
• Use
complex
(JSON)
structures
– Allows
for
more
incremental
structure
– Store
the
‘current’
result
– Store
the
informa?on
needed
for
the
incremental
result
• Iden?fy
rereduce:
– func?on(key,
value,
rereduce)
{}
16
21. Why
is
the
excitement
back?
• Data
in
is
easy;
no
schema,
no
formavng,
no
updates
• Data
out
is
about
the
stats
– Not
how
we
are
going
to
produce
them
• Queries
are
live
• Tweaks
and
updates
and
extensions
are
live
• Mul?ple
views,
mul?ple
queries
• Reduce
is
op?onal
(raw
data)
• Massive
datasets
are
not
a
problem
21