1. NoSQL & BigData
Why Every NoSQL Deployment Should be Paired with Hadoop
Tugdual Grall
Couchbase
@tgrall
2. About
Me
• Tugdual
“Tug”
Grall
Couchbase
• Technical
Evangelist
eXo
• CTO
Oracle
• Developer/Product
Manager
• Mainly
Java/SOA
Developer
in
consul@ng
firms
• Web
• @tgrall
• hEp://blog.grallandco.com
• tgrall
• NantesJUG
co-‐founder
• Pet
Project
:
• hEp://www.resultri.com
3. Big
Data
High
Data
Variety
and
Velocity
Trillions
of
Gigabytes
(ZeEabytes)
2.00
1.50
1.00
0.50
0
2000
2006
2011
Source:
IDC
2011
Digital
Universe
Study
(hEp://www.emc.com/collateral/demos/microsites/emc-‐digital-‐universe-‐2011/index.htm)
More
Flexible
Data
Model
Required
3
• Usually
when
people
talk
about
Big
Data
they
talk
about
capturing
huge
amounts
of
data
and
analyzing
it.
This
reference
to
Big
Data
is
certainly
a
big
trend.
• But
Big
Data
affects
opera@onal
databases
in
a
big
way
as
well
but
for
a
different
set
of
reasons.
• There
are
2
aspects
of
Big
Data
that
are
pushing
people
toward
NoSQL
technologies.
• The
first
is
that
the
vast
majority
of
the
increase
in
data
is
in
the
form
of
un-‐structured
or
semi-‐structured
data.
This
is
data
like
user-‐
generated
content
like
consumer
recommenda@ons
and
machine
generated
data
like
log
files
and
website
click
data.
Rela@onal
databases
aren’t
well
suited
for
storing
this
type
of
data
while
NoSQL
technologies
like
document-‐oriented
database
are
ideally
suited
for
this.
• The
second
is
that
applica@on
developers
are
finding
new
types
of
data
they
want
to
store
all
the
@me.
It
might
be
new
informa@on
they
want
to
store
in
a
user’s
account
profile,
new
logging
informa@on,
etc.
The
point
is
that
what
developers
want
to
store
is
changing
very
rapidly
and
the
amount
of
data
they
want
to
store
is
increasing
very
rapidly.
The
result
is
that
developers
want
a
very
flexible
data
model
that
they
can
evolve
very
quickly.
• Rela@onal
databases
have
fixed
schemas
that
ofen
take
weeks
or
months
to
change.
On
the
other
hand,
NoSQL
databases
are
schema-‐less.
As
a
result,
you
can
far
more
easily
add
new
types
of
data
and
iterate
quickly
on
your
applica@on.
4. Big
Data
High
Data
Variety
and
Velocity
Trillions
of
Gigabytes
(ZeEabytes)
2.00
1.50
1.00
0.50
0
2000
2006
2011
Source:
IDC
2011
Digital
Universe
Study
(hEp://www.emc.com/collateral/demos/microsites/emc-‐digital-‐universe-‐2011/index.htm)
More
Flexible
Data
Model
Required
3
• Usually
when
people
talk
about
Big
Data
they
talk
about
capturing
huge
amounts
of
data
and
analyzing
it.
This
reference
to
Big
Data
is
certainly
a
big
trend.
• But
Big
Data
affects
opera@onal
databases
in
a
big
way
as
well
but
for
a
different
set
of
reasons.
• There
are
2
aspects
of
Big
Data
that
are
pushing
people
toward
NoSQL
technologies.
• The
first
is
that
the
vast
majority
of
the
increase
in
data
is
in
the
form
of
un-‐structured
or
semi-‐structured
data.
This
is
data
like
user-‐
generated
content
like
consumer
recommenda@ons
and
machine
generated
data
like
log
files
and
website
click
data.
Rela@onal
databases
aren’t
well
suited
for
storing
this
type
of
data
while
NoSQL
technologies
like
document-‐oriented
database
are
ideally
suited
for
this.
• The
second
is
that
applica@on
developers
are
finding
new
types
of
data
they
want
to
store
all
the
@me.
It
might
be
new
informa@on
they
want
to
store
in
a
user’s
account
profile,
new
logging
informa@on,
etc.
The
point
is
that
what
developers
want
to
store
is
changing
very
rapidly
and
the
amount
of
data
they
want
to
store
is
increasing
very
rapidly.
The
result
is
that
developers
want
a
very
flexible
data
model
that
they
can
evolve
very
quickly.
• Rela@onal
databases
have
fixed
schemas
that
ofen
take
weeks
or
months
to
change.
On
the
other
hand,
NoSQL
databases
are
schema-‐less.
As
a
result,
you
can
far
more
easily
add
new
types
of
data
and
iterate
quickly
on
your
applica@on.
5. Big
Data
High
Data
Variety
and
Velocity
Trillions
of
Gigabytes
(ZeEabytes)
2.00
1.50
1.00
0.50
0
Structured
Data
2000
2006
2011
Source:
IDC
2011
Digital
Universe
Study
(hEp://www.emc.com/collateral/demos/microsites/emc-‐digital-‐universe-‐2011/index.htm)
More
Flexible
Data
Model
Required
3
• Usually
when
people
talk
about
Big
Data
they
talk
about
capturing
huge
amounts
of
data
and
analyzing
it.
This
reference
to
Big
Data
is
certainly
a
big
trend.
• But
Big
Data
affects
opera@onal
databases
in
a
big
way
as
well
but
for
a
different
set
of
reasons.
• There
are
2
aspects
of
Big
Data
that
are
pushing
people
toward
NoSQL
technologies.
• The
first
is
that
the
vast
majority
of
the
increase
in
data
is
in
the
form
of
un-‐structured
or
semi-‐structured
data.
This
is
data
like
user-‐
generated
content
like
consumer
recommenda@ons
and
machine
generated
data
like
log
files
and
website
click
data.
Rela@onal
databases
aren’t
well
suited
for
storing
this
type
of
data
while
NoSQL
technologies
like
document-‐oriented
database
are
ideally
suited
for
this.
• The
second
is
that
applica@on
developers
are
finding
new
types
of
data
they
want
to
store
all
the
@me.
It
might
be
new
informa@on
they
want
to
store
in
a
user’s
account
profile,
new
logging
informa@on,
etc.
The
point
is
that
what
developers
want
to
store
is
changing
very
rapidly
and
the
amount
of
data
they
want
to
store
is
increasing
very
rapidly.
The
result
is
that
developers
want
a
very
flexible
data
model
that
they
can
evolve
very
quickly.
• Rela@onal
databases
have
fixed
schemas
that
ofen
take
weeks
or
months
to
change.
On
the
other
hand,
NoSQL
databases
are
schema-‐less.
As
a
result,
you
can
far
more
easily
add
new
types
of
data
and
iterate
quickly
on
your
applica@on.
6. Big
Data
High
Data
Variety
and
Velocity
Trillions
of
Gigabytes
(ZeEabytes)
2.00
1.50
Unstructured
and
Semi-‐
Structured
Data
1.00
0.50
0
Text,
Log
Files,
Click
Streams,
Blogs,
Tweets,
Audio,
Video,
etc.
Structured
Data
2000
2006
2011
Source:
IDC
2011
Digital
Universe
Study
(hEp://www.emc.com/collateral/demos/microsites/emc-‐digital-‐universe-‐2011/index.htm)
More
Flexible
Data
Model
Required
3
• Usually
when
people
talk
about
Big
Data
they
talk
about
capturing
huge
amounts
of
data
and
analyzing
it.
This
reference
to
Big
Data
is
certainly
a
big
trend.
• But
Big
Data
affects
opera@onal
databases
in
a
big
way
as
well
but
for
a
different
set
of
reasons.
• There
are
2
aspects
of
Big
Data
that
are
pushing
people
toward
NoSQL
technologies.
• The
first
is
that
the
vast
majority
of
the
increase
in
data
is
in
the
form
of
un-‐structured
or
semi-‐structured
data.
This
is
data
like
user-‐
generated
content
like
consumer
recommenda@ons
and
machine
generated
data
like
log
files
and
website
click
data.
Rela@onal
databases
aren’t
well
suited
for
storing
this
type
of
data
while
NoSQL
technologies
like
document-‐oriented
database
are
ideally
suited
for
this.
• The
second
is
that
applica@on
developers
are
finding
new
types
of
data
they
want
to
store
all
the
@me.
It
might
be
new
informa@on
they
want
to
store
in
a
user’s
account
profile,
new
logging
informa@on,
etc.
The
point
is
that
what
developers
want
to
store
is
changing
very
rapidly
and
the
amount
of
data
they
want
to
store
is
increasing
very
rapidly.
The
result
is
that
developers
want
a
very
flexible
data
model
that
they
can
evolve
very
quickly.
• Rela@onal
databases
have
fixed
schemas
that
ofen
take
weeks
or
months
to
change.
On
the
other
hand,
NoSQL
databases
are
schema-‐less.
As
a
result,
you
can
far
more
easily
add
new
types
of
data
and
iterate
quickly
on
your
applica@on.
7. Opera@onal
vs.
Analy@c
Databases
AnalyOc
Databases
Real-‐Ome,
InteracOve
Databases
NoSQL
Get
insights
from
data
Fast
access
to
data
Couchbase
Mongo
Cloudera
Hortonworks
4
• There
are
two
types
of
databases.
Each
is
focused
on
a
very
different
problem.
• AnalyOc
databases
were
referred
to
in
the
past
as
OLAP
databases.
They
are
focused
on
looking
through
every
record
in
a
huge
database
to
answer
a
ques@on
or
gain
an
insight
about
the
data
contained
in
it.
These
analyses
are
batch
processes
that
access
every
piece
of
data
in
the
database,
are
very
“read”
heavy,
and
produce
results
in
seconds,
minutes,
or
someOmes
days.
For
analy@c
databases,
“real
@me”
means
an
analysis
takes
a
few
seconds
to
run.
• Real-‐Ome
interac@ve
databases
are
ofen
referred
to
as
operaOonal
databases.
They
store
a
lot
of
data
but
usually
much
less
than
an
analy@c
database.
• They
must
provide
access
to
individual
records
in
a
database
in
milliseconds
so
that
users
of
an
applica@on
get
good
response
@me.
• Since
the
requirements
of
each
database
is
very
different,
the
architectures
and
capabili@es
of
each
are
very
different
as
well.
• When
I
refer
to
NoSQL
in
my
presenta@on,
I
am
referring
to
real-‐Ome,
interacOve
databases.
This
is
the
type
of
NoSQL
database
Couchbase
provides.
8. 49%
35%
29%
16%
Lack
of
flexibility/
rigid
schemas
Inability
to
scale
Performance
challenges
out
data
Source:
Couchbase
Survey,
December
2011,
n
=
1351.
Cost
12%
11%
All
of
these
Other
18. Hadoop is not a “NoSQL Database” but more a set of tools to work with BigData:
the ultimate Swiss Army Knife to deal with VERY VERY large volume of data
Oozie: Workflow, coordination
Sqoop : Data connector to import/export data
Hive : SQL-Like interface
Pig : High level programming language
Mahout : Machine learning library
Whirr : Hadoop management tools for cloud services
Flume : Aggregator
Map Reduce : Framework to process large volume of data
HBase : Key Value data store
Zookeeper : Centralized configuration management
HDFS : Distributed file system
25. Moving
Parts
In order to keep up with changing needs on
richer, more targeted content that is delivered
to larger and larger audiences very quickly,
data behind content driven sites is shifting to
Couchbase.
Content Driven
Web Site
Original RDBMS
Couchbase Server Cluster
Logs
Logs
Logs
Logs
Logs
flume
flow
sqoop import
Hadoop excels at complex analytics which
may involve multiple steps of processing
which incorporate a number of different data
sources.
sqoop export
Hadoop Cluster
20
sqoop import
27. What
is
Sqoop?
Sqoop is a tool designed to transfer data between Hadoop and relational
databases.
You can use Sqoop to import data from a relational database management
system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File
System (HDFS), transform the data in Hadoop MapReduce, and then
export the data back into an RDBMS.
sqoop.apache.org
22