2. 1
Table
of
Contents
EXECUTIVE
SUMMARY
.................................................................................................................................
2
Big
data
Classification
...................................................................................................................................
3
Hadoop-‐based
architecture
approaches
......................................................................................................
5
Data
Lake
..................................................................................................................................................
5
Lambda
.....................................................................................................................................................
5
Choosing
the
correct
architecture
............................................................................................................
5
Data
Lake
Architecture
.................................................................................................................................
9
Generic
Data
lake
Architecture
..............................................................................................................
11
Steps
Involved
....................................................................................................................................
12
Lambda
Architecture
..................................................................................................................................
13
Batch
Layer
.............................................................................................................................................
14
Serving
layer
...........................................................................................................................................
14
Speed
layer
.............................................................................................................................................
14
Generic
Lambda
Architecture
................................................................................................................
16
References
..................................................................................................................................................
17
3. 2
EXECUTIVE
SUMMARY
Apache
Hadoop
didn’t
disrupt
the
datacenter,
the
data
did.
Shortly
after
Corporate
IT
functions
within
enterprises
adopted
large
scale
systems
to
manage
data
then
the
Enterprise
Data
Warehouse
(EDW)
emerged
as
the
logical
home
of
all
enterprise
data.
Today,
every
enterprise
has
a
Data
Warehouse
that
serves
to
model
and
capture
the
essence
of
the
business
from
their
enterprise
systems.
The
explosion
of
new
types
of
data
in
recent
years
–
from
inputs
such
as
the
web
and
connected
devices,
or
just
sheer
volumes
of
records
–
has
put
tremendous
pressure
on
the
EDW.
In
response
to
this
disruption,
an
increasing
number
of
organizations
have
turned
to
Apache
Hadoop
to
help
manage
the
enormous
increase
in
data
whilst
maintaining
coherence
of
the
Data
Warehouse.
This
POV
discusses
Apache
Hadoop,
its
capabilities
as
a
data
platform
and
data
processing.
How
the
core
of
Hadoop
and
its
surrounding
ecosystems
provides
the
enterprise
requirements
to
integrate
alongside
the
Data
Warehouse
and
other
enterprise
data
systems
as
part
of
a
modern
data
architecture.
A
step
on
the
journey
toward
delivering
an
enterprise
‘Data
Lake’
or
Lambda
Architecture
(Immutable
data
+
views).
An
enterprise
data
lake
provides
the
following
core
benefits
to
an
enterprise:
New
efficiencies
for
data
architecture
through
a
significantly
lower
cost
of
storage,
and
through
optimization
of
data
processing
workloads
such
as
data
transformation
and
integration.
New
opportunities
for
business
through
flexible
‘schema-‐on-‐read’
access
to
all
enterprise
data,
and
through
multi-‐use
and
multi-‐workload
data
processing
on
the
same
sets
of
data:
from
batch
to
real-‐time.
Apache
Hadoop
provides
both
reliable
storage
(HDFS)
and
a
processing
system
(MapReduce)
for
large
data
sets
across
clusters
of
computers.
MapReduce
is
a
batch
query
processor
that
is
targeted
at
long-‐
running
background
processes.
Hadoop
can
handle
Volume.
But
to
handle
Velocity,
we
need
real-‐time
processing
tools
that
can
compensate
for
the
high-‐latency
of
batch
systems,
and
serve
the
most
recent
data
continuously,
as
new
data
arrives
and
older
data
is
progressively
integrated
into
the
batch
framework.
And
the
answer
to
the
problem
is
Lambda
Architecture.
4. 3
Big
data
Classification
Processing
Type
Batch
Processing
Methodology
Near
Real
time
Real
Time
+
Batch
Prescriptive
Predictive
Diagnostic
Descriptive
Data
Frequency
On
demand
Continuous
Real
Time
Batch
Data
Type
Transactional
Historical
Master
data
Meta
data
Content
Format
Structured
Unstructured:-‐Images,
Text,
Videos,
Documents,
emails
etc.
Semi-‐Structured:
-‐
XML,
JSON
Data
Sources
Machine
generated
Web
&
Social
media
IOT
Human
Generated
Transactional
data
Via
other
data
providers
5. 4
It's
helpful
to
look
at
the
characteristics
of
the
big
data
along
certain
lines
—
for
example,
how
the
data
is
collected,
analyzed,
and
processed.
Once
the
data
and
its
processing
are
classified,
it
can
be
matched
with
the
appropriate
big
data
analysis
architecture:
• Processing
type
-‐
Whether
the
data
is
analyzed
in
real
time
or
batched
for
later
analysis.
Give
careful
consideration
to
choosing
the
analysis
type,
since
it
affects
several
other
decisions
about
products,
tools,
hardware,
data
sources,
and
expected
data
frequency.
A
mix
of
both
types
‘Near
real
time
or
micro
batch”
may
also
be
required
by
the
use
case.
• Processing
methodology
-‐
The
type
of
technique
to
be
applied
for
processing
data
(e.g.,
predictive,
analytical,
ad-‐hoc
query,
and
reporting).
Business
requirements
determine
the
appropriate
processing
methodology.
A
combination
of
techniques
can
be
used.
The
choice
of
processing
methodology
helps
identify
the
appropriate
tools
and
techniques
to
be
used
in
your
big
data
solution.
• Data
frequency
and
size
-‐
How
much
data
is
expected
and
at
what
frequency
does
it
arrive.
Knowing
frequency
and
size
helps
determine
the
storage
mechanism,
storage
format,
and
the
necessary
preprocessing
tools.
Data
frequency
and
size
depend
on
data
sources:
• On
demand,
as
with
social
media
data
• Continuous
feed,
real-‐time
(weather
data,
transactional
data)
• Time
series
(time-‐based
data)
• Data
type
-‐
Type
of
data
to
be
processed
—
transactional,
historical,
master
data,
and
others.
Knowing
the
data
type
helps
segregate
the
data
in
storage.
• Content
format
-‐
Format
of
incoming
data
—
structured
(RDMBS,
for
example),
unstructured
(audio,
video,
and
images,
for
example),
or
semi-‐structured.
Format
determines
how
the
incoming
data
needs
to
be
processed
and
is
key
to
choosing
tools
and
techniques
and
defining
a
solution
from
a
business
perspective.
• Data
source
-‐
Sources
of
data
(where
the
data
is
generated)
—
web
and
social
media,
machine-‐
generated,
human-‐generated,
etc.
Identifying
all
the
data
sources
helps
determine
the
scope
from
a
business
perspective.
6. 5
Hadoop-‐based
architecture
approaches
Data
Lake
A
data
lake
is
a
set
of
centralized
repositories
containing
vast
amounts
of
raw
data
(either
structured
or
unstructured),
described
by
metadata,
organized
into
identifiable
data
sets,
and
available
on
demand.
Data
in
the
lake
supports
discovery,
analytics,
and
reporting,
usually
by
deploying
cluster
tools
like
Hadoop.
Lambda
Lambda
architecture
is
a
data-‐processing
architecture
designed
to
handle
massive
quantities
of
data
by
taking
advantage
of
both
batch-‐
and
stream-‐processing
methods.
This
approach
to
architecture
attempts
to
balance
latency,
throughput,
and
fault-‐tolerance
by
using
batch
processing
to
provide
comprehensive
and
accurate
views
of
batch
data,
while
simultaneously
using
real-‐time
stream
processing
to
provide
views
of
online
data.
The
two
view
outputs
may
be
joined
before
presentation.
The
rise
of
lambda
architecture
is
correlated
with
the
growth
of
big
data,
real-‐time
analytics,
and
the
drive
to
mitigate
the
latencies
of
map-‐reduce.
Choosing
the
correct
architecture
7. 6
Parameter
Data
Lake
Lambda
Simultaneous
access
to
Real
time
and
Batch
data
Data
Lake
can
use
real
time
processing
technologies
like
Storm
to
return
real
time
results,
however
in
such
a
scenario
historical
results
cannot
be
made
available.
If
we
use
technologies
like
Spark
to
process
data,
real
time
data
and
historical
data,
on
request
there
can
be
significant
delays
in
response
time
to
clients
as
compared
to
Lambda
architecture.
Lambda
Architecture’s
Serving
Layer
merges
the
output
of
Batch
Layer
and
Speed
Layer,
before
sending
the
results
of
user
queries.
As
data
is
already
processed
into
views
at
both
the
layers,
the
response
time
is
significantly
less.
Latency
Latency
is
high
as
compared
to
Lambda,
as
real
time
data
need
to
be
processed
with
historical
data
on-‐demand
or
as
a
part
of
batch.
Low-‐latency
real
time
results
are
processed
by
Speed
layer
and
Batch
results
are
pre-‐
processed
in
Batch
layer.
On
request,
both
the
results
are
just
merged,
there
by
resulting
low
latency
time
for
real
time
processing.
Ease
of
Data
Governance
Data
lake
is
coined
to
convey
the
concept
of
centralized
repository
containing
virtually
inexhaustible
amounts
of
raw
data
(or
minimally
curated)
data
that
is
readily
made
available
anytime
to
anyone
authorized
to
perform
analytical
activities.
Lambda
architecture’s
serving
layer
gives
access
to
processed
and
analyzed
data.
As
uses
get
access
to
processed
data
directly,
it
can
lead
to
top
down
data
governance
issues.
Updates
in
source
data
As
data
lake
stores
only
raw
data,
updates
are
just
appended
to
raw
data,
thereby
makes
life
of
business
users
difficult
to
write
business
logic,
in
such
a
way
that
latest
updated
records
are
considered
in
calculations.
Batch
Views
are
always
computed
from
starch
in
Lambda
Architecture.
As
a
result,
updates
can
be
easily
incorporated
in
calculated
Views
in
each
reprocess
batch
cycle.
Fault
tolerance
against
human
errors
Data
Scientist
or
business
users,
running
business
logic
on
relevant
raw
data
in
Data
Lake
might
lead
to
human
errors.
Although,
re-‐covering
from
those
errors
is
not
difficult
as
it’s
just
a
matter
of
re-‐running
the
logic.
However,
the
reprocessing
time
for
large
datasets
might
lead
to
some
delays.
Lambda
architecture
assures
fault
tolerance
not
only
against
hardware
failures
but
against
human
errors.
Re-‐computation
of
views
every
time
from
raw
data
in
batch
layer,
insures
that
any
human
errors
in
business
logic
would
not
be
cascaded
to
a
level
where
it’s
unrecoverable.
Ease
of
business
users
Data
is
stored
in
raw
format,
Data
is
processed
and
available
8. 7
with
data
definitions
and
sometime
groomed
to
make
digestible
by
data
management
tools.
At
times,
it
difficult
for
business
users
to
use
data
in
as-‐
is
conditions.
from
Serving
makes
life
easy
for
business
users.
Accuracy
for
real
time
results
Irrespective
of
any
scenario,
users
accessing
data
from
Data
Lake
has
access
to
immutable
raw
data,
they
can
do
exact
computations,
thereby
always
get
the
accurate
results.
In
scenarios,
where
real
time
calculations
need
to
access
historical
data,
which
is
not
possible,
Lambda
architecture
would
return
you
estimated
results.
For
example,
calculation
of
mean
value,
cannot
be
achieved
until
whole
historical
data
and
real
time
data
is
referenced
at
one
go.
In
such
a
scenario,
serving
layer
would
return
estimated
results.
Infrastructure
Cost
Data
lake
architecture
process
the
data
as
and
when
need
and
thereby
the
cluster
cost
can
be
much
less
as
compared
to
Lambda.
Moreover,
it
only
persist
the
raw
data
however
Lambda
architecture
not
only
persist
the
raw
data
but
processed
data
too.
This
leads
to
extra
storage
cost
in
Lambda
architecture.
Lambda
architecture
data
processing
life
cycle
is
designed
in
such
a
fashion
that
as
soon
the
one
cycle
of
batch
process
is
finished,
it
starts
a
new
cycle
of
batch
processing
which
includes
the
recently
inserted
data.
Simultaneously,
the
speed
layer
is
always
processing
the
real
time
data.
OLAP
Unlike
data
marts,
which
are
optimized
for
data
analysis
by
storing
only
some
attributes
and
dropping
data
below
the
level
aggregation,
a
data
lake
is
designed
to
retain
all
attributes,
especially
so
when
you
do
not
yet
know
what
the
scope
of
data
or
its
use
will
be.
As
Lambda
exposes
the
processed
views
from
serving
layer,
all
the
attributes
of
data
would
not
be
available
to
Data
Scientist
for
running
an
analytical
queries
at
times.
Historical
data
reference
for
processing
OLAP
&
OLTP
queries
access
the
raw
or
groomed
data
directly
from
the
data
lake,
making
it
feasible
to
access
and
refer
the
historical
data
while
processing
data
for
given
time
interval.
Speed
layer
do
not
have
reference
to
historical
data
stored
in
batch
layer,
make
it
difficult
to
run
queries
which
refer
historical
data.
For
e.g.
‘Unique
count’
type
of
queries
cannot
return
correct
results
from
Speed
layer.
However,
‘calculating
average’
type
of
9. 8
query
calculations
be
done
easily
on
Serving
layer,
by
generating
the
average
of
results
returned
from
Speed
and
Batch
layer
on
the
fly.
Slowly
Changing
Dimensions
Although,
data
lake
has
records
of
changed
dimension
attributes,
however
extra
business
logic
need
to
be
written
by
business
uses
to
cater
it.
Lambda
architecture
can
easily
cater
the
slowly
changing
dimensions
by
creating
surrogate
keys
parallel
to
natural
keys
in
case
of
any
change
detected
in
dimension
attributes
while
batch
layer
processing
cycle.
Slowly
changing
Facts
However,
in
Data
Lake
both
the
versions
of
facts
are
available
for
users
to
look
at,
this
would
lead
to
good
analytical
results
if
fact
life
cycle
is
an
attribute
in
business
logic
for
data
analytics.
Although
it’s
easy
to
change
the
facts
in
Lambda
architecture,
but
this
will
lead
to
loss
in
information
of
fact
life
cycle.
As
the
previous
state
of
fact
in
case
of
slowly
changing
facts
is
not
available
to
Data
Scientist,
the
analytical
queries
might
not
give
desired
results
on
views
exposed
by
Serving
Layer.
Frequently
changing
business
logic
Changes
in
processing
code
need
to
be
done.
But
there
is
no
clear
solution,
of
how
the
historically
processed
data
need
to
be
handled.
As
data
is
re-‐processed
from
starch,
even
if
business
logic
changes
frequently
the
historical
data
problem
is
resolved
automatically.
Implementation
lifecycle
Data
lake
is
fast
to
implement
as
it
eliminates
the
dependency
of
data
modeling
upfront
Processing
logic
need
to
be
implemented
at
batch
and
speed
layer,
leading
to
significant
implementation
time
as
comparted
to
Data
Lake
Adding
new
data
sources
Very
easy
to
add
Need
to
be
incorporated
in
processing
layers
and
would
require
code
changes
10. 9
IF
YOU
THINK
OF
A
DATAMART
AS
A
STORE
OF
BOTTLED
WATER
–
CLEANSED
AND
PACKAGED
AND
STRUCTURED
FOR
EASY
CONSUMPTION
–
THE
DATA
LAKE
IS
A
LARGE
BODY
OF
WATER
IN
A
MORE
NATURAL
STATE.
THE
CONTENTS
OF
THE
DATA
LAKE
STREAM
IN
FROM
A
SOURCE
TO
FILL
THE
LAKE,
AND
VARIOUS
USERS
OF
THE
LAKE
CAN
COME
TO
EXAMINE,
DIVE
IN,
OR
TAKE
SAMPLES.
BY:
JAMES
DIXON
(PENTAHO
CTO)
Data
Lake
Architecture
Much
of
today's
research
and
decision
making
are
based
on
knowledge
and
insight
that
can
be
gained
from
analyzing
and
contextualizing
the
vast
(and
growing)
amount
of
“open”
or
“raw”
data.
The
concept
that
the
large
number
of
data
sources
available
today
facilitates
analyses
on
combinations
of
heterogeneous
information
that
would
not
be
achievable
via
“siloed”
data
maintained
in
warehouses
is
very
powerful.
The
term
data
lake
has
been
coined
to
convey
the
concept
of
a
centralized
repository
containing
virtually
inexhaustible
amounts
of
raw
(or
minimally
curated)
data
that
is
readily
made
available
anytime
to
anyone
authorized
to
perform
analytical
activities.
A
data
lake
is
a
set
of
centralized
repositories
containing
vast
amounts
of
raw
data
(either
structured
or
unstructured),
described
by
metadata,
organized
into
identifiable
data
sets,
and
available
on
demand.
Data
in
the
lake
supports
discovery,
analytics,
and
reporting,
usually
by
deploying
cluster
tools
like
Hadoop.
Unlike
traditional
warehouses,
the
format
of
the
data
is
not
described
(that
is,
its
schema
is
not
available)
until
the
data
is
needed.
By
delaying
the
categorization
of
data
from
the
point
of
entry
to
the
point
of
use,
analytical
operations
that
transcend
the
rigid
format
of
an
adopted
schema
become
possible.
Query
and
search
operations
on
the
data
can
be
performed
using
traditional
database
technologies
(when
structured),
as
well
as
via
alternate
means
such
as
indexing
and
NoSQL
derivatives.
Key
Features
• Stores
Raw
data
–
Single
source
of
truth
• Data
accessible
to
anyone
authorized
• Polyglot
Persistence
• Support
multiple
applications
&
Workloads
• Low
Cost,
High
Performance
storage
• Flexible,
easy
to
use
data
organization
• Self-‐service
end-‐user
• More
Flexible
to
answer
new
questions
• Easy
to
add
new
data
sources
• Loosely
coupled
architecture
–
enables
flexibility
of
analysis
• Eliminating
dependency
of
data
modeling
upfront
–
thereby
fast
to
implement
• Storage
is
highly
optimized
as
raw
data
is
stored
Disadvantages
• High
Latency
for
composite
analysis
view
of
both
real
time
and
historical
data
• Raw
data
does
not
provide
relational
structure
that
is
not
friendly
for
business
analytis
on
the
fly
11. 10
In
a
practical
sense,
a
data
lake
is
characterized
by
three
key
attributes:
• Collect
everything:
A
data
lake
contains
all
data,
both
raw
sources
over
extended
periods
of
time
as
well
as
any
processed
data.
• Dive
in
anywhere:
A
data
lake
enables
users
across
multiple
business
units
to
refine,
explore
and
enrich
data
on
their
terms.
• Flexible
access:
A
data
lake
enables
multiple
data
access
patterns
across
a
shared
infrastructure:
batch,
interactive,
online,
search,
in-‐memory
and
other
processing
engines.
12. 11
Generic
Data
lake
Architecture
H
Data
Sources
Real
Time
Micro
Batch
Mega
Batch
Desktop
&
Mobile
Social
Media
and
cloud
Operational
Systems
Internet
of
Things
(IOT)
Ingestion
Tier
Query
Interface
SQL
No
SQL
Extern
al
Storag
e
Centralized
Management
System
monitoring System
management
Unified
Data
Management
Tier
Data
mgmt. Data
Access
Processing
Tier
Workflow
Management
HDFS
storage
Unstructured
and
structured
data
In-‐memory
MapReduce/
Hive/MPP
Flexible
Actions
Real-‐time
insights
Interactive
insights
Batch
insights
Schematic
Metadata Grooming
Data
Processed
Data
Raw
Data
Processed
Data
Processed
Data
13. 12
Steps
Involved
• Procuring
data
–
Process
of
obtaining
data
and
metadata
and
preparing
them
for
eventual
inclusion
in
a
data
lake.
• Obtaining
data
–Transferring
the
data
physically
from
source
to
Data
Lake.
• Describing
data
–
Data
scientist
searching
a
data
lake
for
useful
data
must
be
able
to
find
the
data
relevant
to
his
or
her
need,
for
the
same
they
require
metadata
for
the
data.
Schematic
metadata
for
this
data
set
would
include
information
about
how
the
data
is
formatted
and
information
about
the
schema.
• Grooming
data
–
Although
we
were
talking
about
raw
data
is
made
consumable
by
analytics
applications.
However,
in
some
scenarios
grooming
process
use
schematic
metadata
to
transform
raw
data,
into
data
that
can
be
processed
by
standard
data
management
tools.
• Provisioning
data
–
Authentication
and
authorization
policies
by
which
consumers
take
out
data
from
Data
Lake.
• Preserving
data
–
Managing
Data
Lake
also
require
attention
to
maintenance
issues
such
as
staleness,
expiration,
decommissions
and
renewals.
14. 13
LAMBDA
ARCHITECTURE
IS
A
DATA-‐
PROCESSING
ARCHITECTURE
DESIGNED
TO
HANDLE
MASSIVE
QUANTITIES
OF
DATA
BY
TAKING
ADVANTAGE
OF
BOTH
BATCH-‐
AND
STREAM-‐PROCESSING
METHODS.
THIS
APPROACH
TO
ARCHITECTURE
ATTEMPTS
TO
BALANCE
LATENCY,
THROUGHPUT,
AND
FAULT-‐TOLERANCE
BY
USING
BATCH
PROCESSING
TO
PROVIDE
COMPREHENSIVE
AND
ACCURATE
VIEWS
OF
BATCH
DATA,
WHILE
SIMULTANEOUSLY
USING
REAL-‐TIME
STREAM
PROCESSING
TO
PROVIDE
VIEWS
OF
ONLINE
DATA.
THE
TWO
VIEW
OUTPUTS
MAY
BE
JOINED
BEFORE
PRESENTATION.
Lambda
Architecture
The
Lambda
architecture
is
split
into
three
layers,
the
batch
layer,
the
serving
layer,
and
the
speed
layer.
1. Batch layer (Apache Hadoop)
2. Serving layer (Cloudera Impala,
Spark)
3. Speed layer (Storm, Spark,
Apache HBase, Cassandra)
Key
Features
• Low
latency
simultaneous
analysis
of
the
(near)
real-‐
time
information
extracted
from
a
continuous
inflow
of
data
and
persisting
analysis
of
a
massive
volume
of
data.
• Fault
tolerant
not
against
hardware
failure
but
against
human
error
too
• Mistakes
are
corrected
by
re-‐computations
• Storage
is
highly
optimized
as
raw
data
is
stored
15. 14
Batch
Layer
The
batch
layer
is
responsible
for
two
things.
The
first
is
to
store
the
immutable,
constantly
growing
master
dataset
(HDFS),
and
the
second
is
to
compute
arbitrary
views
from
this
dataset
(MapReduce).
Computing
the
views
is
a
continuous
operation,
so
when
new
data
arrives
it
will
be
aggregated
into
the
views
when
they
are
recomputed
during
the
next
MapReduce
iteration.
The
views
should
be
computed
from
the
entire
dataset
and
therefore
the
batch
layer
is
not
expected
to
update
the
views
frequently.
Depending
on
the
size
of
your
dataset
and
cluster,
each
iteration
could
take
hours.
Serving
layer
The
output
from
the
batch
layer
is
a
set
of
flat
files
containing
the
precomputed
views.
The
serving
layer
is
responsible
for
indexing
and
exposing
the
views
so
that
they
can
be
queried.
Although,
the
batch
and
serving
layers
alone
do
not
satisfy
any
realtime
requirement
because
MapReduce
(by
design)
is
high
latency
and
it
could
take
a
few
hours
for
new
data
to
be
represented
in
the
views
and
propagated
to
the
serving
layer.
This
is
why
we
need
the
speed
layer.
Speed
layer
In
essence
the
speed
layer
is
the
same
as
the
batch
layer
in
that
it
computes
views
from
the
data
it
receives.
The
speed
layer
is
needed
to
compensate
for
the
high
latency
of
the
batch
layer
and
it
does
this
by
computing
realtime
views
in
Storm.
The
realtime
views
contain
only
the
delta
results
to
supplement
the
batch
views.
Whilst
the
batch
layer
is
designed
to
continuously
recompute
the
batch
views
from
scratch,
the
speed
layer
uses
an
incremental
model
whereby
the
realtime
views
are
incremented
as
and
when
new
data
is
received.
What’s
clever
about
the
speed
layer
is
the
realtime
views
are
intended
to
be
transient
and
as
soon
as
the
data
propagates
through
the
batch
and
serving
layers
the
corresponding
results
in
the
Disadvantages
• Maintaining
copies
code
that
needs
to
produce
the
same
result
in
two
complex
distributed
systems
• Could
return
estimated
or
approx.
results.
• Expensive
full
recomputation
is
required
for
fault
tolerance
• Requires
high
cluster
up-‐time,
as
batch
data
need
to
be
processed
continuously.
• Requires
more
implementation
time,
as
duplicate
code
need
to
be
written
in
separate
technologies
to
process
real
time
and
batch
data.
• Time
taken
to
process
batch
is
linearly
16. 15
realtime
views
can
be
discarded.
This
is
referred
to
as
“complexity
isolation”,
meaning
that
the
most
complex
part
of
the
architecture
is
pushed
into
the
layer
whose
results
are
only
temporary.
Realtime
views
are
discarded
once
the
data
they
contain
is
represented
in
batch
view
Now
Batch
Batch
Batch
Realtime
Realtime
Realtime
Time
17. 16
Generic
Lambda
Architecture
Batch
Layer
Serving
Layer
Speed
Layer
All
Data
(HDFS)
Pre-‐computed
Views
&
Summarized
data
Batch
Precompute
Data
Stream
Data
Stream
Data
Stream
Data
Stream
Process
Stream
Increment
views
/
Stream
Summarization
Query
V
V
V
V
V
V
Near
real
time
-‐
Increment
Real
time
views
Batch
Views
Storm
or
Spark
MR
/
Hive/
Pig
Data
Management
&
Access