6. Map
Reduce
MapReduce
is
a
programming
model
for
processing
large
data
sets,
and
the
name
of
an
implementa@on
of
the
model
by
Google.
MapReduce
is
typically
used
to
do
distributed
compu@ng
on
clusters
of
computers.
hEp://research.google.com/archive/mapreduce.html
Friday, April 26, 13
7. In
details
• Developer
specifies
2
methods:
map (in_key, in_value) -> list(out_key, intermediate_value)
• Processes
input
data
• Produces
key,
values
pairs
reduce (out_key, list(intermediate_value)) -> list(out_value)
• Combines
all
intermediate
values
for
a
par@cular
key
• Produce
a
set
of
merged
output
values
Friday, April 26, 13
11. Couchbase
Open
Source
Project
• Leading
NoSQL
database
project
focused
on
distributed
database
technology
and
surrounding
ecosystem
• Supports
both
key-‐value
and
document-‐oriented
use
cases
• All
components
are
available
under
the
Apache
2.0
Public
License
• Obtained
as
packaged
soXware
in
both
enterprise
and
community
edi@ons.
Couchbase
Open Source Project
Friday, April 26, 13
12. Couchbase
Server
Core
Principles
Easy
Scalability
Consistent
High
Performance
Always
On
24x365
Grow
cluster
without
applica@on
changes,
without
down@me
with
a
single
click
Consistent
sub-‐millisecond
read
and
write
response
@mes
with
consistent
high
throughput
No
down@me
for
soXware
upgrades,
hardware
maintenance,
etc.
Flexible
Data
Model
JSON
document
model
with
no
fixed
schema.
JSON
JSON
JSON
JSONJSON
PERFORMANCE
Friday, April 26, 13
13. Addi)onal
Couchbase
Server
Features
Built-‐in
clustering
–
All
nodes
equal
Data
replica@on
with
auto-‐failover
Zero-‐down@me
maintenance
Built-‐in
managed
cached
Append-‐only
storage
layer
Online
compac@on
Monitoring
and
admin
API
&
UI
SDK
for
a
variety
of
languages
Friday, April 26, 13
14. Heartbeat
Process
monitor
Global
singleton
supervisor
Configura@on
manager
on
each
node
Rebalance
orchestrator
Node
health
monitor
one
per
cluster
vBucket
state
and
replica@on
manager
hVp
REST
management
API/Web
UI
HTTP
8091
Erlang
port
mapper
4369
Distributed
Erlang
21100
-‐
21199
Erlang/OTP
storage
interface
Couchbase
EP
Engine
11210
Memcapable
2.0
Moxi
11211
Memcapable
1.0
Memcached
New
Persistence
Layer
8092
Query
APIQuery
Engine
Data
Manager Cluster
Manager
Couchbase
Server
2.0
Architecture
Friday, April 26, 13
15. New
Persistence
Layer
storage
interface
Couchbase
EP
Engine
11210
Memcapable
2.0
Moxi
11211
Memcapable
1.0
Object-‐level
Cache
Disk
Persistence
8092
Query
API
Query
Engine
HTTP
8091
Erlang
port
mapper
4369
Distributed
Erlang
21100
-‐
21199
Heartbeat
Process
monitor
Global
singleton
supervisor
Configura@on
manager
on
each
node
Rebalance
orchestrator
Node
health
monitor
one
per
cluster
vBucket
state
and
replica@on
manager
hVp
REST
management
API/Web
UI
Erlang/OTP
Server/Cluster
Management
&
CommunicaDon
(Erlang)
RAM
Cache,
Indexing
&
Persistence
Management
(C
&
V8)
The Unreasonable Effectiveness of C by Damien Katz
Couchbase
Server
2.0
Architecture
Friday, April 26, 13
16. COUCHBASE
SERVER
CLUSTER
Basic
Opera)on
• Docs
distributed
evenly
across
servers
• Each
server
stores
both
ac)ve
and
replica
docs
Only
one
server
ac@ve
at
a
@me
• Client
library
provides
app
with
simple
interface
to
database
• Cluster
map
provides
map
to
which
server
doc
is
on
App
never
needs
to
know
• App
reads,
writes,
updates
docs
• Mul)ple
app
servers
can
access
same
document
at
same
)me
User
Configured
Replica
Count
=
1
READ/WRITE/UPDATE
ACTIVE
Doc
5
Doc
2
Doc
Doc
Doc
SERVER
1
ACTIVE
Doc
4
Doc
7
Doc
Doc
Doc
SERVER
2
Doc
8
ACTIVE
Doc
1
Doc
2
Doc
Doc
Doc
REPLICA
Doc
4
Doc
1
Doc
8
Doc
Doc
Doc
REPLICA
Doc
6
Doc
3
Doc
2
Doc
Doc
Doc
REPLICA
Doc
7
Doc
9
Doc
5
Doc
Doc
Doc
SERVER
3
Doc
6
APP
SERVER
1
COUCHBASE
Client
Library
CLUSTER
MAP
COUCHBASE
Client
Library
CLUSTER
MAP
APP
SERVER
2
Doc
9
Friday, April 26, 13
19. Key
{
“string”
:
“string”,
“string”
:
value,
“string”
:
{
“string”
:
“string”,
“string”
:
value
},
“string”
:
[
array
]
}
JSON
OBJECT
(“DOCUMENT”)
• How
to
find
document
based
on
its
aVributes?
get
employee
by
email
get
products
by
type
...
• You
need
to
look
“into”
the
document/value
Look
at
a
document
Friday, April 26, 13
24. doc.email meta.id
abba@couchbase.com u::1
beta@couchbase.com u::7
jasdeep@couchbase.com u::2
math@couchbase.com u::5
maE@couchbase.com u::6
ye@@couchbase.com u::4
zorro@couchbase.com u::3
?startkey=”b1”
&
endkey=”zz”
Pulls
the
Index-‐Keys
between
UTF-‐8
Range
specified
by
the
startkey
and
endkey.
?startkey=”bz”
&
endkey=”zn”
Pulls
the
Index-‐Keys
between
UTF-‐8
Range
specified
by
the
startkey
and
endkey.
Friday, April 26, 13
25. doc.email meta.id
abba@couchbase.com u::1
beta@couchbase.com u::7
jasdeep@couchbase.com u::2
math@couchbase.com u::5
maE@couchbase.com u::6
ye@@couchbase.com u::4
zorro@couchbase.com u::3
?key=”math@couchbase.com”
Match
a
Single
Index-‐Key
Friday, April 26, 13
26. doc.email meta.id
abba@couchbase.com u::1
beta@couchbase.com u::7
jasdeep@couchbase.com u::2
math@couchbase.com u::5
maE@couchbase.com u::6
ye@@couchbase.com u::4
zorro@couchbase.com u::3
?keys=[“math@couchbase.com”,
“yeD@couchbase.com”]
Query
Mul@ple
in
the
Set
(Array
Nota@on)
Friday, April 26, 13
28. COUCHBASE
SERVER
CLUSTER
Indexing
and
Querying
User
Configured
Replica
Count
=
1
ACTIVE
Doc
5
Doc
2
Doc
Doc
Doc
SERVER
1
REPLICA
Doc
4
Doc
1
Doc
8
Doc
Doc
Doc
APP
SERVER
1
COUCHBASE
Client
Library
CLUSTER
MAP
COUCHBASE
Client
Library
CLUSTER
MAP
APP
SERVER
2
Doc
9
• Indexing
work
is
distributed
amongst
nodes
• Large
data
set
possible
• Parallelize
the
effort
• Each
node
has
index
for
data
stored
on
it
• Queries
combine
the
results
from
required
nodes
ACTIVE
Doc
5
Doc
2
Doc
Doc
Doc
SERVER
2
REPLICA
Doc
4
Doc
1
Doc
8
Doc
Doc
Doc
Doc
9
ACTIVE
Doc
5
Doc
2
Doc
Doc
Doc
SERVER
3
REPLICA
Doc
4
Doc
1
Doc
8
Doc
Doc
Doc
Doc
9
Query
Friday, April 26, 13
29. Couchbase
Server
2.0:
Views
• Views
can
cover
a
few
different
use
cases
Primary
Index
Simple
secondary
indexes
(the
most
common)
Complex
secondary,
ter@ary
and
composite
indexes
Aggrega@on
func@ons
(reduc@on)
• Example:
count
the
number
of
“North
American
Ales”
Organizing
related
data
• Built
using
Map/Reduce
Map
func@on
creates
a
matrix
from
document
fields
Reduce
func@on
summarizes
(reduces)
informa@on
Friday, April 26, 13
30. Distributed
Index
Build
Phase
• Op)mized
for
lookups,
in-‐order
access
and
aggrega)ons
• All
view
reads
from
disk
(different
performance
profile)
• View
builds
against
every
document
on
every
node
This
is
why
you
should
group
them
in
a
design
document
• Automa)cally
kept
up
to
date
“Incremental”
Map
Reduce
Friday, April 26, 13
31. Dynamic
Range
Queries
with
Op5onal
Aggrega5on
•Efficiently
fetch
an
row
or
group
of
related
rows.
•Queries
use
cached
values
from
B-‐tree
inner
nodes
when
possible
•Take
advantage
of
in-‐order
tree
traversal
with
group_level
queries
Doc
4
Doc
2
Doc
5
SERVER
1
Doc
6
Doc
4
SERVER
2
Doc
7
Doc
1
SERVER
3
Doc
3
Doc
9
Doc
7
Doc
8 Doc
6
Doc
3
DOC
DOC
DOC
DOC
DOC
DOC
DOC
DOC
DOC
DOC
DOC
DOC
DOC
DOC
DOC
Doc
9
Doc
5
DOC
DOC
DOC
Doc
1
Doc
8 Doc
2
Replica
Docs Replica
Docs Replica
Docs
Ac@ve
Docs Ac@ve
Docs Ac@ve
Docs
?startkey=“J”&endkey=“K”
{“rows”:[{“key”:“Juneau”,“value”:null}]}
Friday, April 26, 13
32. Append
Only
Index
• Disk
acDvity
is
slow
• UpdaDng
disk
blocks
is
very
slow
• Appending
new
data
to
the
end
of
the
current
file
is
fast
• Overhead
of
reverse
reading
is
small
• Because
exisDng
blocks
are
not
re-‐used,
can
lead
to
fragmentaDon
Couchbase
will
compact
the
index
automa@cally
Doc
View
Processor Disk
Doc
View
Processor
Changed Documents
Appended
Original
Friday, April 26, 13
33. Adding
a
new
Document
A-R
15
I-R
8
M-R
5
A B C D F G H I K L N O Q R
A-C
3
D-F
2
G-H
2
I-L
3
N-R
4
A-H
7
I-R
7
A-R
14
M
new root
new key
new reductions
Friday, April 26, 13
34. What
about
Reduce
?
• Out
of
the
box
func)ons
:
_count()
_sum()
_stats()
• Create
your
own
if
needed
function(key, values, rereduce) {
if (rereduce) {
var result = 0;
for (var i = 0; i < values.length; i++) {
result += values[i];
}
return result;
} else {
return values.length;
}
}
Friday, April 26, 13
35. Reduce
Func)on
• Key
and
Arrays
of
values
as
parameters
• WriVen
Javascript
• Called
aner
the
map
func)on
• Used
to
reduce
the
result
of
a
map
of
single
values
• Used
with
grouping
• Could
be
ignored
when
querying
reuse
the
index
Friday, April 26, 13
36. • Map()
Result
• Reduce()
• Result
Reduce
in
Ac)on
Key Value
Belgian-‐Style
Dubbel 1
Belgian-‐Style
Dubbel 1
Belgian-‐Style
Dubbel 1
Belgian-‐Style
Pale
Ale 1
Belgian-‐Style
White 1
Belgian-‐Style
White 1
... ...
_count()
Key Value
Belgian-‐Style
Dubbel 3
Belgian-‐Style
Pale
Ale 1
Belgian-‐Style
White 2
Friday, April 26, 13
37. How
to
use
it?
• Use
client
SDK
to
call
the
view:
View view = client.getView("beer", "by_name");
Query query = new Query();
query.setIncludeDocs(true)
.setLimit(20)
.setRangeStart(ComplexKey.of(startKey))
.setRangeEnd(ComplexKey.of(startKey + "uefff"));
ViewResponse result = client.query(view, query);
for(ViewRow row : result) {
....
}
Friday, April 26, 13
39. ≠
Hadoop
&
Couchbase
• Deal
with
“Big
Data”
• “More”
is
be)er
than
“Faster”
• Batch
Oriented
• Usually
used
to
“extract/transform”
data
• Fully
distributed
Map,
Shuffle,
Reduce
• Distributed
• Executed
where
the
document
is
• Deal
with
“indexing”
data
• As
fast
as
possible
• Use
to
query
the
data
in
the
Database
Friday, April 26, 13
40. Map
Reduce
in
Couchbase
• Like
many
other
NoSQL
Database
:
Used
for
queries
!
• Index
are
distributed
on
each
node
of
the
cluster
• Index
are
updated
Incrementally
• Write
you
Map
Reduce
in
Javascript
Friday, April 26, 13