SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
 
	
   	
  
2015	
  
Miraj	
  Godha	
  
6/5/2015	
  
Hadoop	
  Architecture	
  Approaches	
  
1	
  
	
  
	
  
	
  
Table	
  of	
  Contents	
  
EXECUTIVE	
  SUMMARY	
  .................................................................................................................................	
  2	
  
Big	
  data	
  Classification	
  ...................................................................................................................................	
  3	
  
Hadoop-­‐based	
  architecture	
  approaches	
  ......................................................................................................	
  5	
  
Data	
  Lake	
  ..................................................................................................................................................	
  5	
  
Lambda	
  .....................................................................................................................................................	
  5	
  
Choosing	
  the	
  correct	
  architecture	
  ............................................................................................................	
  5	
  
Data	
  Lake	
  Architecture	
  .................................................................................................................................	
  9	
  
Generic	
  Data	
  lake	
  Architecture	
  ..............................................................................................................	
  11	
  
Steps	
  Involved	
  ....................................................................................................................................	
  12	
  
Lambda	
  Architecture	
  ..................................................................................................................................	
  13	
  
Batch	
  Layer	
  .............................................................................................................................................	
  14	
  
Serving	
  layer	
  ...........................................................................................................................................	
  14	
  
Speed	
  layer	
  .............................................................................................................................................	
  14	
  
Generic	
  Lambda	
  Architecture	
  ................................................................................................................	
  16	
  
References	
  ..................................................................................................................................................	
  17	
  
	
  
	
  
	
  
	
   	
  
2	
  
	
  
EXECUTIVE	
  SUMMARY	
  
	
  
Apache	
  Hadoop	
  didn’t	
  disrupt	
  the	
  datacenter,	
  the	
  data	
  did.	
  Shortly	
  after	
  Corporate	
  IT	
  functions	
  within	
  
enterprises	
  adopted	
  large	
  scale	
  systems	
  to	
  manage	
  data	
  then	
  the	
  Enterprise	
  Data	
  Warehouse	
  (EDW)	
  
emerged	
  as	
  the	
  logical	
  home	
  of	
  all	
  enterprise	
  data.	
  Today,	
  every	
  enterprise	
  has	
  a	
  Data	
  Warehouse	
  that	
  
serves	
  to	
  model	
  and	
  capture	
  the	
  essence	
  of	
  the	
  business	
  from	
  their	
  enterprise	
  systems.	
  The	
  explosion	
  of	
  
new	
  types	
  of	
  data	
  in	
  recent	
  years	
  –	
  from	
  inputs	
  such	
  as	
  the	
  web	
  and	
  connected	
  devices,	
  or	
  just	
  sheer	
  
volumes	
  of	
  records	
  –	
  has	
  put	
  tremendous	
  pressure	
  on	
  the	
  EDW.	
  In	
  response	
  to	
  this	
  disruption,	
  an	
  
increasing	
  number	
  of	
  organizations	
  have	
  turned	
  to	
  Apache	
  Hadoop	
  to	
  help	
  manage	
  the	
  enormous	
  
increase	
  in	
  data	
  whilst	
  maintaining	
  coherence	
  of	
  the	
  Data	
  Warehouse.	
  This	
  POV	
  discusses	
  Apache	
  
Hadoop,	
  its	
  capabilities	
  as	
  a	
  data	
  platform	
  and	
  data	
  processing.	
  How	
  the	
  core	
  of	
  Hadoop	
  and	
  its	
  
surrounding	
  ecosystems	
  provides	
  the	
  enterprise	
  requirements	
  to	
  integrate	
  alongside	
  the	
  Data	
  
Warehouse	
  and	
  other	
  enterprise	
  data	
  systems	
  as	
  part	
  of	
  a	
  modern	
  data	
  architecture.	
  A	
  step	
  on	
  the	
  
journey	
  toward	
  delivering	
  an	
  enterprise	
  ‘Data	
  Lake’	
  or	
  Lambda	
  Architecture	
  (Immutable	
  data	
  +	
  views).	
  	
  
An	
  enterprise	
  data	
  lake	
  provides	
  the	
  following	
  core	
  benefits	
  to	
  an	
  enterprise:	
  New	
  efficiencies	
  for	
  data	
  
architecture	
  through	
  a	
  significantly	
  lower	
  cost	
  of	
  storage,	
  and	
  through	
  optimization	
  of	
  data	
  processing	
  
workloads	
  such	
  as	
  data	
  transformation	
  and	
  integration.	
  New	
  opportunities	
  for	
  business	
  through	
  flexible	
  
‘schema-­‐on-­‐read’	
  access	
  to	
  all	
  enterprise	
  data,	
  and	
  through	
  multi-­‐use	
  and	
  multi-­‐workload	
  data	
  
processing	
  on	
  the	
  same	
  sets	
  of	
  data:	
  from	
  batch	
  to	
  real-­‐time.	
  	
  
Apache	
  Hadoop	
  provides	
  both	
  reliable	
  storage	
  (HDFS)	
  and	
  a	
  processing	
  system	
  (MapReduce)	
  for	
  large	
  
data	
  sets	
  across	
  clusters	
  of	
  computers.	
  MapReduce	
  is	
  a	
  batch	
  query	
  processor	
  that	
  is	
  targeted	
  at	
  long-­‐
running	
  background	
  processes.	
  Hadoop	
  can	
  handle	
  Volume.	
  But	
  to	
  handle	
  Velocity,	
  we	
  need	
  real-­‐time	
  
processing	
  tools	
  that	
  can	
  compensate	
  for	
  the	
  high-­‐latency	
  of	
  batch	
  systems,	
  and	
  serve	
  the	
  most	
  recent	
  
data	
  continuously,	
  as	
  new	
  data	
  arrives	
  and	
  older	
  data	
  is	
  progressively	
  integrated	
  into	
  the	
  batch	
  
framework.	
  And	
  the	
  answer	
  to	
  the	
  problem	
  is	
  Lambda	
  Architecture.	
  
	
  
	
  
	
  
	
  
	
   	
  
3	
  
	
  
Big	
  data	
  Classification	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
Processing	
  Type	
   Batch	
  
Processing	
  
Methodology	
  
Near	
  Real	
  time	
   Real	
  Time	
  +	
  Batch	
  
Prescriptive	
  
Predictive	
  
Diagnostic	
  
Descriptive	
  
Data	
  Frequency	
  
On	
  demand	
   Continuous	
   Real	
  Time	
   Batch	
  
Data	
  Type	
  
Transactional	
   Historical	
   Master	
  data	
   Meta	
  data	
  
Content	
  Format	
   Structured	
  
Unstructured:-­‐Images,	
  
Text,	
  Videos,	
  Documents,	
  
emails	
  etc.	
  
Semi-­‐Structured:	
  -­‐	
  
XML,	
  JSON	
  
Data	
  Sources	
  
Machine	
  
generated	
  
Web	
  &	
  Social	
  
media	
  
IOT	
   Human	
  
Generated	
  
Transactional	
  
data	
  Via	
  other	
  data	
  providers	
  
4	
  
	
  
It's	
  helpful	
  to	
  look	
  at	
  the	
  characteristics	
  of	
  the	
  big	
  data	
  along	
  certain	
  lines	
  —	
  for	
  example,	
  how	
  the	
  data	
  
is	
  collected,	
  analyzed,	
  and	
  processed.	
  Once	
  the	
  data	
  and	
  its	
  processing	
  are	
  classified,	
  it	
  can	
  be	
  matched	
  
with	
  the	
  appropriate	
  big	
  data	
  analysis	
  architecture:	
  
	
  
• Processing	
  type	
  -­‐	
  Whether	
  the	
  data	
  is	
  analyzed	
  in	
  real	
  time	
  or	
  batched	
  for	
  later	
  analysis.	
  Give	
  careful	
  
consideration	
  to	
  choosing	
  the	
  analysis	
  type,	
  since	
  it	
  affects	
  several	
  other	
  decisions	
  about	
  products,	
  tools,	
  
hardware,	
  data	
  sources,	
  and	
  expected	
  data	
  frequency.	
  A	
  mix	
  of	
  both	
  types	
  ‘Near	
  real	
  time	
  or	
  micro	
  
batch”	
  may	
  also	
  be	
  required	
  by	
  the	
  use	
  case.	
  
• Processing	
  methodology	
  -­‐	
  The	
  type	
  of	
  technique	
  to	
  be	
  applied	
  for	
  processing	
  data	
  (e.g.,	
  predictive,	
  
analytical,	
  ad-­‐hoc	
  query,	
  and	
  reporting).	
  Business	
  requirements	
  determine	
  the	
  appropriate	
  processing	
  
methodology.	
  A	
  combination	
  of	
  techniques	
  can	
  be	
  used.	
  The	
  choice	
  of	
  processing	
  methodology	
  helps	
  
identify	
  the	
  appropriate	
  tools	
  and	
  techniques	
  to	
  be	
  used	
  in	
  your	
  big	
  data	
  solution.	
  
• Data	
  frequency	
  and	
  size	
  -­‐	
  How	
  much	
  data	
  is	
  expected	
  and	
  at	
  what	
  frequency	
  does	
  it	
  arrive.	
  Knowing	
  
frequency	
  and	
  size	
  helps	
  determine	
  the	
  storage	
  mechanism,	
  storage	
  format,	
  and	
  the	
  necessary	
  
preprocessing	
  tools.	
  Data	
  frequency	
  and	
  size	
  depend	
  on	
  data	
  sources:	
  
• On	
  demand,	
  as	
  with	
  social	
  media	
  data	
  
• Continuous	
  feed,	
  real-­‐time	
  (weather	
  data,	
  transactional	
  data)	
  
• Time	
  series	
  (time-­‐based	
  data)	
  
• Data	
  type	
  -­‐	
  Type	
  of	
  data	
  to	
  be	
  processed	
  —	
  transactional,	
  historical,	
  master	
  data,	
  and	
  others.	
  Knowing	
  
the	
  data	
  type	
  helps	
  segregate	
  the	
  data	
  in	
  storage.	
  
• Content	
  format	
  -­‐	
  Format	
  of	
  incoming	
  data	
  —	
  structured	
  (RDMBS,	
  for	
  example),	
  unstructured	
  (audio,	
  
video,	
  and	
  images,	
  for	
  example),	
  or	
  semi-­‐structured.	
  Format	
  determines	
  how	
  the	
  incoming	
  data	
  needs	
  
to	
  be	
  processed	
  and	
  is	
  key	
  to	
  choosing	
  tools	
  and	
  techniques	
  and	
  defining	
  a	
  solution	
  from	
  a	
  business	
  
perspective.	
  
• Data	
  source	
  -­‐	
  Sources	
  of	
  data	
  (where	
  the	
  data	
  is	
  generated)	
  —	
  web	
  and	
  social	
  media,	
  machine-­‐
generated,	
  human-­‐generated,	
  etc.	
  Identifying	
  all	
  the	
  data	
  sources	
  helps	
  determine	
  the	
  scope	
  from	
  a	
  
business	
  perspective.	
  	
  
	
  
	
   	
  
5	
  
	
  
Hadoop-­‐based	
  architecture	
  approaches	
  
	
  
Data	
  Lake	
  
A	
  data	
  lake	
  is	
  a	
  set	
  of	
  centralized	
  repositories	
  containing	
  vast	
  amounts	
  of	
  raw	
  data	
  (either	
  structured	
  or	
  
unstructured),	
  described	
  by	
  metadata,	
  organized	
  into	
  identifiable	
  data	
  sets,	
  and	
  available	
  on	
  demand.	
  
Data	
  in	
  the	
  lake	
  supports	
  discovery,	
  analytics,	
  and	
  reporting,	
  usually	
  by	
  deploying	
  cluster	
  tools	
  like	
  
Hadoop.	
  	
  
Lambda	
  
Lambda	
  architecture	
  is	
  a	
  data-­‐processing	
  architecture	
  designed	
  to	
  handle	
  massive	
  quantities	
  of	
  data	
  by	
  
taking	
  advantage	
  of	
  both	
  batch-­‐	
  and	
  stream-­‐processing	
  methods.	
  This	
  approach	
  to	
  architecture	
  
attempts	
  to	
  balance	
  latency,	
  throughput,	
  and	
  fault-­‐tolerance	
  by	
  using	
  batch	
  processing	
  to	
  provide	
  
comprehensive	
  and	
  accurate	
  views	
  of	
  batch	
  data,	
  while	
  simultaneously	
  using	
  real-­‐time	
  stream	
  
processing	
  to	
  provide	
  views	
  of	
  online	
  data.	
  The	
  two	
  view	
  outputs	
  may	
  be	
  joined	
  before	
  presentation.	
  
The	
  rise	
  of	
  lambda	
  architecture	
  is	
  correlated	
  with	
  the	
  growth	
  of	
  big	
  data,	
  real-­‐time	
  analytics,	
  and	
  the	
  
drive	
  to	
  mitigate	
  the	
  latencies	
  of	
  map-­‐reduce.	
  
Choosing	
  the	
  correct	
  architecture	
  
	
  
6	
  
	
  
Parameter	
   Data	
  Lake	
   Lambda	
  
Simultaneous	
  access	
  to	
  Real	
  
time	
  and	
  Batch	
  data	
  
	
  
Data	
  Lake	
  can	
  use	
  real	
  time	
  
processing	
  technologies	
  like	
  
Storm	
  to	
  return	
  real	
  time	
  
results,	
  however	
  in	
  such	
  a	
  
scenario	
  historical	
  results	
  
cannot	
  be	
  made	
  available.	
  If	
  we	
  
use	
  technologies	
  like	
  Spark	
  to	
  
process	
  data,	
  real	
  time	
  data	
  and	
  
historical	
  data,	
  on	
  request	
  there	
  
can	
  be	
  significant	
  delays	
  in	
  
response	
  time	
  to	
  clients	
  as	
  
compared	
  to	
  Lambda	
  
architecture.	
  
	
  
Lambda	
  Architecture’s	
  Serving	
  
Layer	
  merges	
  the	
  output	
  of	
  
Batch	
  Layer	
  and	
  Speed	
  Layer,	
  
before	
  sending	
  the	
  results	
  of	
  
user	
  queries.	
  As	
  data	
  is	
  already	
  
processed	
  into	
  views	
  at	
  both	
  
the	
  layers,	
  the	
  response	
  time	
  is	
  
significantly	
  less.	
  
	
  
Latency	
  
Latency	
  is	
  high	
  as	
  compared	
  to	
  
Lambda,	
  as	
  real	
  time	
  data	
  need	
  
to	
  be	
  processed	
  with	
  historical	
  
data	
  on-­‐demand	
  or	
  as	
  a	
  part	
  of	
  
batch.	
  
Low-­‐latency	
  real	
  time	
  results	
  
are	
  processed	
  by	
  Speed	
  layer	
  
and	
  Batch	
  results	
  are	
  pre-­‐
processed	
  in	
  Batch	
  layer.	
  On	
  
request,	
  both	
  the	
  results	
  are	
  
just	
  merged,	
  there	
  by	
  resulting	
  
low	
  latency	
  time	
  for	
  real	
  time	
  
processing.	
  
Ease	
  of	
  Data	
  Governance	
  
Data	
  lake	
  is	
  coined	
  to	
  convey	
  
the	
  concept	
  of	
  centralized	
  
repository	
  containing	
  virtually	
  
inexhaustible	
  amounts	
  of	
  raw	
  
data	
  (or	
  minimally	
  curated)	
  data	
  
that	
  is	
  readily	
  made	
  available	
  
anytime	
  to	
  anyone	
  authorized	
  
to	
  perform	
  analytical	
  activities.	
  
Lambda	
  architecture’s	
  serving	
  
layer	
  gives	
  access	
  to	
  processed	
  
and	
  analyzed	
  data.	
  As	
  uses	
  get	
  
access	
  to	
  processed	
  data	
  
directly,	
  it	
  can	
  lead	
  to	
  top	
  down	
  
data	
  governance	
  issues.	
  
Updates	
  in	
  source	
  data	
  
As	
  data	
  lake	
  stores	
  only	
  raw	
  
data,	
  updates	
  are	
  just	
  appended	
  
to	
  raw	
  data,	
  thereby	
  makes	
  life	
  
of	
  business	
  users	
  difficult	
  to	
  
write	
  business	
  logic,	
  in	
  such	
  a	
  
way	
  that	
  latest	
  updated	
  records	
  
are	
  considered	
  in	
  calculations.	
  
Batch	
  Views	
  are	
  always	
  
computed	
  from	
  starch	
  in	
  
Lambda	
  Architecture.	
  As	
  a	
  
result,	
  updates	
  can	
  be	
  easily	
  
incorporated	
  in	
  calculated	
  
Views	
  in	
  each	
  reprocess	
  batch	
  
cycle.	
  
Fault	
  tolerance	
  against	
  human	
  
errors	
  
	
  
Data	
  Scientist	
  or	
  business	
  users,	
  
running	
  business	
  logic	
  on	
  
relevant	
  raw	
  data	
  in	
  Data	
  Lake	
  
might	
  lead	
  to	
  human	
  errors.	
  
Although,	
  re-­‐covering	
  from	
  
those	
  errors	
  is	
  not	
  difficult	
  as	
  
it’s	
  just	
  a	
  matter	
  of	
  re-­‐running	
  
the	
  logic.	
  However,	
  the	
  
reprocessing	
  time	
  for	
  large	
  
datasets	
  might	
  lead	
  to	
  some	
  
delays.	
  
	
  
Lambda	
  architecture	
  assures	
  
fault	
  tolerance	
  not	
  only	
  against	
  
hardware	
  failures	
  but	
  against	
  
human	
  errors.	
  Re-­‐computation	
  
of	
  views	
  every	
  time	
  from	
  raw	
  
data	
  in	
  batch	
  layer,	
  insures	
  that	
  
any	
  human	
  errors	
  in	
  business	
  
logic	
  would	
  not	
  be	
  cascaded	
  to	
  a	
  
level	
  where	
  it’s	
  unrecoverable.	
  
	
  
Ease	
  of	
  business	
  users	
   Data	
  is	
  stored	
  in	
  raw	
  format,	
   Data	
  is	
  processed	
  and	
  available	
  
7	
  
	
  
with	
  data	
  definitions	
  and	
  
sometime	
  groomed	
  to	
  make	
  
digestible	
  by	
  data	
  management	
  
tools.	
  At	
  times,	
  it	
  difficult	
  for	
  
business	
  users	
  to	
  use	
  data	
  in	
  as-­‐
is	
  conditions.	
  
from	
  Serving	
  makes	
  life	
  easy	
  for	
  
business	
  users.	
  
Accuracy	
  for	
  real	
  time	
  results	
  
Irrespective	
  of	
  any	
  scenario,	
  
users	
  accessing	
  data	
  from	
  Data	
  
Lake	
  has	
  access	
  to	
  immutable	
  
raw	
  data,	
  they	
  can	
  do	
  exact	
  
computations,	
  thereby	
  always	
  
get	
  the	
  accurate	
  results.	
  
In	
  scenarios,	
  where	
  real	
  time	
  
calculations	
  need	
  to	
  access	
  
historical	
  data,	
  which	
  is	
  not	
  
possible,	
  Lambda	
  architecture	
  
would	
  return	
  you	
  estimated	
  
results.	
  For	
  example,	
  calculation	
  
of	
  mean	
  value,	
  cannot	
  be	
  
achieved	
  until	
  whole	
  historical	
  
data	
  and	
  real	
  time	
  data	
  is	
  
referenced	
  at	
  one	
  go.	
  In	
  such	
  a	
  
scenario,	
  serving	
  layer	
  would	
  
return	
  estimated	
  results.	
  
Infrastructure	
  Cost	
  
Data	
  lake	
  architecture	
  process	
  
the	
  data	
  as	
  and	
  when	
  need	
  and	
  
thereby	
  the	
  cluster	
  cost	
  can	
  be	
  
much	
  less	
  as	
  compared	
  to	
  
Lambda.	
  Moreover,	
  it	
  only	
  
persist	
  the	
  raw	
  data	
  however	
  
Lambda	
  architecture	
  not	
  only	
  
persist	
  the	
  raw	
  data	
  but	
  
processed	
  data	
  too.	
  This	
  leads	
  
to	
  extra	
  storage	
  cost	
  in	
  Lambda	
  
architecture.	
  
Lambda	
  architecture	
  data	
  
processing	
  life	
  cycle	
  is	
  designed	
  
in	
  such	
  a	
  fashion	
  that	
  as	
  soon	
  
the	
  one	
  cycle	
  of	
  batch	
  process	
  is	
  
finished,	
  it	
  starts	
  a	
  new	
  cycle	
  of	
  
batch	
  processing	
  which	
  includes	
  
the	
  recently	
  inserted	
  data.	
  
Simultaneously,	
  the	
  speed	
  layer	
  
is	
  always	
  processing	
  the	
  real	
  
time	
  data.	
  
OLAP	
  
Unlike	
  data	
  marts,	
  which	
  are	
  
optimized	
  for	
  data	
  analysis	
  by	
  
storing	
  only	
  some	
  attributes	
  and	
  
dropping	
  data	
  below	
  the	
  level	
  
aggregation,	
  a	
  data	
  lake	
  is	
  
designed	
  to	
  retain	
  all	
  attributes,	
  
especially	
  so	
  when	
  you	
  do	
  not	
  
yet	
  know	
  what	
  the	
  scope	
  of	
  
data	
  or	
  its	
  use	
  will	
  be.	
  
As	
  Lambda	
  exposes	
  the	
  
processed	
  views	
  from	
  serving	
  
layer,	
  all	
  the	
  attributes	
  of	
  data	
  
would	
  not	
  be	
  available	
  to	
  Data	
  
Scientist	
  for	
  running	
  an	
  
analytical	
  queries	
  at	
  times.	
  
Historical	
  data	
  reference	
  for	
  
processing	
  
OLAP	
  &	
  OLTP	
  queries	
  access	
  the	
  
raw	
  or	
  groomed	
  data	
  directly	
  
from	
  the	
  data	
  lake,	
  making	
  it	
  
feasible	
  to	
  access	
  and	
  refer	
  the	
  
historical	
  data	
  while	
  processing	
  
data	
  for	
  given	
  time	
  interval.	
  
Speed	
  layer	
  do	
  not	
  have	
  
reference	
  to	
  historical	
  data	
  
stored	
  in	
  batch	
  layer,	
  make	
  it	
  
difficult	
  to	
  run	
  queries	
  which	
  
refer	
  historical	
  data.	
  For	
  e.g.	
  
‘Unique	
  count’	
  type	
  of	
  queries	
  
cannot	
  return	
  correct	
  results	
  
from	
  Speed	
  layer.	
  However,	
  
‘calculating	
  average’	
  type	
  of	
  
8	
  
	
  
	
  
	
   	
  
query	
  calculations	
  be	
  done	
  
easily	
  on	
  Serving	
  layer,	
  by	
  
generating	
  the	
  average	
  of	
  
results	
  returned	
  from	
  Speed	
  and	
  
Batch	
  layer	
  on	
  the	
  fly.	
  
Slowly	
  Changing	
  Dimensions	
  
Although,	
  data	
  lake	
  has	
  records	
  
of	
  changed	
  dimension	
  
attributes,	
  however	
  extra	
  
business	
  logic	
  need	
  to	
  be	
  
written	
  by	
  business	
  uses	
  to	
  
cater	
  it.	
  
Lambda	
  architecture	
  can	
  easily	
  
cater	
  the	
  slowly	
  changing	
  
dimensions	
  by	
  creating	
  
surrogate	
  keys	
  parallel	
  to	
  
natural	
  keys	
  in	
  case	
  of	
  any	
  
change	
  detected	
  in	
  dimension	
  
attributes	
  while	
  batch	
  layer	
  
processing	
  cycle.	
  
Slowly	
  changing	
  Facts	
  
	
  
However,	
  in	
  Data	
  Lake	
  both	
  the	
  
versions	
  of	
  facts	
  are	
  available	
  
for	
  users	
  to	
  look	
  at,	
  this	
  would	
  
lead	
  to	
  good	
  analytical	
  results	
  if	
  
fact	
  life	
  cycle	
  is	
  an	
  attribute	
  in	
  
business	
  logic	
  for	
  data	
  analytics.	
  
Although	
  it’s	
  easy	
  to	
  change	
  the	
  
facts	
  in	
  Lambda	
  architecture,	
  
but	
  this	
  will	
  lead	
  to	
  loss	
  in	
  
information	
  of	
  fact	
  life	
  cycle.	
  As	
  
the	
  previous	
  state	
  of	
  fact	
  in	
  case	
  
of	
  slowly	
  changing	
  facts	
  is	
  not	
  
available	
  to	
  Data	
  Scientist,	
  the	
  
analytical	
  queries	
  might	
  not	
  give	
  
desired	
  results	
  on	
  views	
  
exposed	
  by	
  Serving	
  Layer.	
  
Frequently	
  changing	
  business	
  
logic	
  
	
  
Changes	
  in	
  processing	
  code	
  
need	
  to	
  be	
  done.	
  But	
  there	
  is	
  no	
  
clear	
  solution,	
  of	
  how	
  the	
  
historically	
  processed	
  data	
  need	
  
to	
  be	
  handled.	
  
As	
  data	
  is	
  re-­‐processed	
  from	
  
starch,	
  even	
  if	
  business	
  logic	
  
changes	
  frequently	
  the	
  
historical	
  data	
  problem	
  is	
  
resolved	
  automatically.	
  
Implementation	
  lifecycle	
  
	
  
Data	
  lake	
  is	
  fast	
  to	
  implement	
  	
  
as	
  it	
  eliminates	
  the	
  dependency	
  
of	
  data	
  modeling	
  upfront	
  
Processing	
  logic	
  need	
  to	
  be	
  
implemented	
  at	
  batch	
  and	
  
speed	
  layer,	
  leading	
  to	
  
significant	
  implementation	
  time	
  
as	
  comparted	
  to	
  Data	
  Lake	
  
Adding	
  new	
  data	
  sources	
  
	
  
Very	
  easy	
  to	
  add	
  
Need	
  to	
  be	
  incorporated	
  in	
  
processing	
  layers	
  and	
  would	
  
require	
  code	
  changes	
  	
  
9	
  
	
  
IF	
  YOU	
  THINK	
  OF	
  A	
  DATAMART	
  AS	
  A	
  STORE	
  
OF	
  BOTTLED	
  WATER	
  –	
  CLEANSED	
  AND	
  
PACKAGED	
  AND	
  STRUCTURED	
  FOR	
  EASY	
  
CONSUMPTION	
  –	
  THE	
  DATA	
  LAKE	
  IS	
  A	
  
LARGE	
  BODY	
  OF	
  WATER	
  IN	
  A	
  MORE	
  
NATURAL	
  STATE.	
  THE	
  CONTENTS	
  OF	
  THE	
  
DATA	
  LAKE	
  STREAM	
  IN	
  FROM	
  A	
  SOURCE	
  TO	
  
FILL	
  THE	
  LAKE,	
  AND	
  VARIOUS	
  USERS	
  OF	
  THE	
  
LAKE	
  CAN	
  COME	
  TO	
  EXAMINE,	
  DIVE	
  IN,	
  OR	
  
TAKE	
  SAMPLES.	
  
BY:	
  JAMES	
  DIXON	
  (PENTAHO	
  CTO)	
  
Data	
  Lake	
  Architecture	
  
	
  
Much	
  of	
  today's	
  research	
  and	
  decision	
  making	
  are	
  based	
  on	
  knowledge	
  and	
  insight	
  that	
  can	
  be	
  gained	
  
from	
  analyzing	
  and	
  contextualizing	
  the	
  vast	
  (and	
  growing)	
  amount	
  of	
  “open”	
  or	
  “raw”	
  data.	
  The	
  concept	
  
that	
  the	
  large	
  number	
  of	
  data	
  sources	
  available	
  today	
  facilitates	
  analyses	
  on	
  combinations	
  of	
  
heterogeneous	
  information	
  that	
  would	
  not	
  be	
  achievable	
  via	
  “siloed”	
  data	
  maintained	
  in	
  warehouses	
  is	
  
very	
  powerful.	
  The	
  term	
  data	
  lake	
  has	
  been	
  coined	
  to	
  
convey	
  the	
  concept	
  of	
  a	
  centralized	
  repository	
  containing	
  
virtually	
  inexhaustible	
  amounts	
  of	
  raw	
  (or	
  minimally	
  
curated)	
  data	
  that	
  is	
  readily	
  made	
  available	
  anytime	
  to	
  
anyone	
  authorized	
  to	
  perform	
  analytical	
  activities.	
  	
  
A	
  data	
  lake	
  is	
  a	
  set	
  of	
  centralized	
  repositories	
  containing	
  
vast	
  amounts	
  of	
  raw	
  data	
  (either	
  structured	
  or	
  
unstructured),	
  described	
  by	
  metadata,	
  organized	
  into	
  
identifiable	
  data	
  sets,	
  and	
  available	
  on	
  demand.	
  Data	
  in	
  
the	
  lake	
  supports	
  discovery,	
  analytics,	
  and	
  reporting,	
  
usually	
  by	
  deploying	
  cluster	
  tools	
  like	
  
Hadoop.	
  Unlike	
  traditional	
  warehouses,	
  the	
  
format	
  of	
  the	
  data	
  is	
  not	
  described	
  (that	
  is,	
  
its	
  schema	
  is	
  not	
  available)	
  until	
  the	
  data	
  is	
  
needed.	
  By	
  delaying	
  the	
  categorization	
  of	
  
data	
  from	
  the	
  point	
  of	
  entry	
  to	
  the	
  point	
  of	
  
use,	
  analytical	
  operations	
  that	
  transcend	
  the	
  
rigid	
  format	
  of	
  an	
  adopted	
  schema	
  become	
  
possible.	
  Query	
  and	
  search	
  operations	
  on	
  the	
  data	
  can	
  be	
  performed	
  using	
  traditional	
  database	
  
technologies	
  (when	
  structured),	
  as	
  well	
  as	
  via	
  alternate	
  means	
  such	
  as	
  indexing	
  and	
  NoSQL	
  derivatives.	
  
Key	
  Features	
  
• Stores	
  Raw	
  data	
  –	
  Single	
  source	
  of	
  truth	
  
• Data	
  accessible	
  to	
  anyone	
  authorized	
  
• Polyglot	
  Persistence	
  
• Support	
  multiple	
  applications	
  &	
  Workloads	
  
• Low	
  Cost,	
  High	
  Performance	
  storage	
  
• Flexible,	
  easy	
  to	
  use	
  data	
  organization	
  
• Self-­‐service	
  end-­‐user	
  	
  
• More	
  Flexible	
  to	
  answer	
  new	
  questions	
  
• Easy	
  to	
  add	
  new	
  data	
  sources	
  
• Loosely	
  coupled	
  architecture	
  –	
  enables	
  
flexibility	
  of	
  analysis	
  
• Eliminating	
  dependency	
  of	
  data	
  modeling	
  
upfront	
  –	
  thereby	
  fast	
  to	
  implement	
  
• Storage	
  is	
  highly	
  optimized	
  as	
  raw	
  data	
  is	
  
stored	
  
Disadvantages	
  
• High	
  Latency	
  for	
  composite	
  analysis	
  view	
  of	
  
both	
  real	
  time	
  and	
  historical	
  data	
  
• Raw	
  data	
  does	
  not	
  provide	
  relational	
  structure	
  
that	
  is	
  not	
  friendly	
  for	
  business	
  analytis	
  on	
  the	
  
fly	
  
10	
  
	
  
In	
  a	
  practical	
  sense,	
  a	
  data	
  lake	
  is	
  characterized	
  by	
  three	
  key	
  attributes:	
  	
  
• Collect	
  everything:	
  	
  A	
  data	
  lake	
  contains	
  all	
  data,	
  both	
  raw	
  sources	
  over	
  extended	
  periods	
  of	
  
time	
  as	
  well	
  as	
  any	
  processed	
  data.	
  	
  
• Dive	
  in	
  anywhere:	
  A	
  data	
  lake	
  enables	
  users	
  across	
  multiple	
  business	
  units	
  to	
  refine,	
  explore	
  
and	
  enrich	
  data	
  on	
  their	
  terms.	
  	
  
• Flexible	
  access:	
  	
  A	
  data	
  lake	
  enables	
  multiple	
  data	
  access	
  patterns	
  across	
  a	
  shared	
  
infrastructure:	
  batch,	
  interactive,	
  online,	
  search,	
  in-­‐memory	
  and	
  other	
  processing	
  engines.	
  	
  
	
  
	
  
	
  
	
   	
  
11	
  
	
  
Generic	
  Data	
  lake	
  Architecture	
  
	
  
H	
  	
  	
   	
  
Data	
  
Sources
Real	
  
Time
Micro	
  
Batch	
  
Mega	
  
Batch	
  
Desktop	
  &	
  Mobile	
  
	
  	
   	
   	
  
	
   	
  	
  	
   	
  
Social	
  Media	
  and	
  cloud	
  
	
  
Operational	
  Systems	
  	
  	
  
	
  	
   	
   	
  
Internet	
  of	
  Things	
  (IOT)	
  	
  	
  	
  	
  	
  	
  
Ingestion	
  
Tier
Query	
  
Interface
SQL
No	
  SQL
Extern
al	
  
Storag
e
Centralized	
  Management
System	
  monitoring System	
  management
Unified	
  Data	
  Management	
  Tier
Data	
  mgmt. Data	
  Access
Processing	
  Tier
Workflow	
  Management
HDFS	
  storage
Unstructured	
  and	
  structured	
  data
In-­‐memory
MapReduce/	
  Hive/MPP
Flexible	
  
Actions
Real-­‐time	
  
insights
Interactive	
  
insights
Batch
insights
Schematic	
  Metadata Grooming	
  Data
Processed	
  
Data
Raw	
  
	
  
Data
Processed	
  
Data
Processed	
  
Data
12	
  
	
  
Steps	
  Involved	
  
• Procuring	
  data	
  –	
  Process	
  of	
  obtaining	
  data	
  and	
  metadata	
  and	
  preparing	
  them	
  for	
  eventual	
  
inclusion	
  in	
  a	
  data	
  lake.	
  
• Obtaining	
  data	
  –Transferring	
  the	
  data	
  physically	
  from	
  source	
  to	
  Data	
  Lake.	
  
• Describing	
  data	
  –	
  Data	
  scientist	
  searching	
  a	
  data	
  lake	
  for	
  useful	
  data	
  must	
  be	
  able	
  to	
  find	
  the	
  
data	
  relevant	
  to	
  his	
  or	
  her	
  need,	
  for	
  the	
  same	
  they	
  require	
  metadata	
  for	
  the	
  data.	
  Schematic	
  
metadata	
  for	
  this	
  data	
  set	
  would	
  include	
  information	
  about	
  how	
  the	
  data	
  is	
  formatted	
  and	
  
information	
  about	
  the	
  schema.	
  
• Grooming	
  data	
  –	
  Although	
  we	
  were	
  talking	
  about	
  raw	
  data	
  is	
  made	
  consumable	
  by	
  analytics	
  
applications.	
  However,	
  in	
  some	
  scenarios	
  grooming	
  process	
  use	
  schematic	
  metadata	
  to	
  
transform	
  raw	
  data,	
  into	
  data	
  that	
  can	
  be	
  processed	
  by	
  standard	
  data	
  management	
  tools.	
  
• Provisioning	
  data	
  –	
  Authentication	
  and	
  authorization	
  policies	
  by	
  which	
  consumers	
  take	
  out	
  data	
  
from	
  Data	
  Lake.	
  
• Preserving	
  data	
  –	
  Managing	
  Data	
  Lake	
  also	
  require	
  attention	
  to	
  maintenance	
  issues	
  such	
  as	
  
staleness,	
  expiration,	
  decommissions	
  and	
  renewals.	
  
	
   	
  
13	
  
	
  
LAMBDA	
  ARCHITECTURE	
  IS	
  A	
  DATA-­‐
PROCESSING	
  ARCHITECTURE	
  DESIGNED	
  TO	
  
HANDLE	
  MASSIVE	
  QUANTITIES	
  OF	
  DATA	
  BY	
  
TAKING	
  ADVANTAGE	
  OF	
  BOTH	
  BATCH-­‐	
  AND	
  
STREAM-­‐PROCESSING	
  METHODS.	
  THIS	
  
APPROACH	
  TO	
  ARCHITECTURE	
  ATTEMPTS	
  
TO	
  BALANCE	
  LATENCY,	
  THROUGHPUT,	
  AND	
  
FAULT-­‐TOLERANCE	
  BY	
  USING	
  BATCH	
  
PROCESSING	
  TO	
  PROVIDE	
  COMPREHENSIVE	
  
AND	
  ACCURATE	
  VIEWS	
  OF	
  BATCH	
  DATA,	
  
WHILE	
  SIMULTANEOUSLY	
  USING	
  REAL-­‐TIME	
  
STREAM	
  PROCESSING	
  TO	
  PROVIDE	
  VIEWS	
  
OF	
  ONLINE	
  DATA.	
  THE	
  TWO	
  VIEW	
  OUTPUTS	
  
MAY	
  BE	
  JOINED	
  BEFORE	
  PRESENTATION.	
  
Lambda	
  Architecture	
  
	
  
	
  The	
  Lambda	
  architecture	
  is	
  split	
  into	
  three	
  
layers,	
  the	
  batch	
  layer,	
  the	
  serving	
  layer,	
  and	
  the	
  
speed	
  layer.	
  
1. Batch layer (Apache Hadoop)	
  
2. Serving layer (Cloudera Impala,
Spark)	
  
3. Speed layer (Storm, Spark,
Apache HBase, Cassandra)	
  
	
   	
  
Key	
  Features	
  
• Low	
  latency	
  simultaneous	
  analysis	
  of	
  the	
  (near)	
  real-­‐
time	
  information	
  extracted	
  from	
  a	
  continuous	
  inflow	
  
of	
  data	
  and	
  persisting	
  analysis	
  of	
  a	
  massive	
  volume	
  of	
  
data.	
  
• Fault	
  tolerant	
  not	
  against	
  hardware	
  failure	
  but	
  against	
  
human	
  error	
  too	
  
• Mistakes	
  are	
  corrected	
  by	
  re-­‐computations	
  
• Storage	
  is	
  highly	
  optimized	
  as	
  raw	
  data	
  is	
  stored	
  
	
  
14	
  
	
  
Batch	
  Layer	
  
	
  
The	
  batch	
  layer	
  is	
  responsible	
  for	
  two	
  
things.	
  The	
  first	
  is	
  to	
  store	
  the	
  immutable,	
  
constantly	
  growing	
  master	
  dataset	
  (HDFS),	
  
and	
  the	
  second	
  is	
  to	
  compute	
  arbitrary	
  
views	
  from	
  this	
  dataset	
  (MapReduce).	
  
Computing	
  the	
  views	
  is	
  a	
  continuous	
  
operation,	
  so	
  when	
  new	
  data	
  arrives	
  it	
  will	
  
be	
  aggregated	
  into	
  the	
  views	
  when	
  they	
  
are	
  recomputed	
  during	
  the	
  next	
  
MapReduce	
  iteration.	
  
The	
  views	
  should	
  be	
  computed	
  from	
  the	
  
entire	
  dataset	
  and	
  therefore	
  the	
  batch	
  
layer	
  is	
  not	
  expected	
  to	
  update	
  the	
  views	
  
frequently.	
  Depending	
  on	
  the	
  size	
  of	
  your	
  
dataset	
  and	
  cluster,	
  each	
  iteration	
  could	
  take	
  hours.	
  
	
  
Serving	
  layer	
  
	
  
The	
  output	
  from	
  the	
  batch	
  layer	
  is	
  a	
  set	
  of	
  flat	
  files	
  containing	
  the	
  precomputed	
  views.	
  The	
  serving	
  layer	
  
is	
  responsible	
  for	
  indexing	
  and	
  exposing	
  the	
  views	
  so	
  that	
  they	
  can	
  be	
  queried.	
  Although,	
  the	
  batch	
  and	
  
serving	
  layers	
  alone	
  do	
  not	
  satisfy	
  any	
  realtime	
  requirement	
  because	
  MapReduce	
  (by	
  design)	
  is	
  high	
  
latency	
  and	
  it	
  could	
  take	
  a	
  few	
  hours	
  for	
  new	
  data	
  to	
  be	
  represented	
  in	
  the	
  views	
  and	
  propagated	
  to	
  the	
  
serving	
  layer.	
  This	
  is	
  why	
  we	
  need	
  the	
  speed	
  layer.	
  
Speed	
  layer	
  
	
  
In	
  essence	
  the	
  speed	
  layer	
  is	
  the	
  same	
  as	
  the	
  batch	
  layer	
  in	
  that	
  it	
  computes	
  views	
  from	
  the	
  data	
  it	
  
receives.	
  The	
  speed	
  layer	
  is	
  needed	
  to	
  compensate	
  for	
  the	
  high	
  latency	
  of	
  the	
  batch	
  layer	
  and	
  it	
  does	
  
this	
  by	
  computing	
  realtime	
  views	
  in	
  Storm.	
  The	
  realtime	
  views	
  contain	
  only	
  the	
  delta	
  results	
  to	
  
supplement	
  the	
  batch	
  views.	
  
Whilst	
  the	
  batch	
  layer	
  is	
  designed	
  to	
  continuously	
  recompute	
  the	
  batch	
  views	
  from	
  scratch,	
  the	
  speed	
  
layer	
  uses	
  an	
  incremental	
  model	
  whereby	
  the	
  realtime	
  views	
  are	
  incremented	
  as	
  and	
  when	
  new	
  data	
  is	
  
received.	
  What’s	
  clever	
  about	
  the	
  speed	
  layer	
  is	
  the	
  realtime	
  views	
  are	
  intended	
  to	
  be	
  transient	
  and	
  as	
  
soon	
  as	
  the	
  data	
  propagates	
  through	
  the	
  batch	
  and	
  serving	
  layers	
  the	
  corresponding	
  results	
  in	
  the	
  
Disadvantages	
  
• Maintaining	
  copies	
  code	
  that	
  needs	
  to	
  produce	
  
the	
  same	
  result	
  in	
  two	
  complex	
  distributed	
  
systems	
  
• Could	
  return	
  estimated	
  or	
  approx.	
  results.	
  	
  
• Expensive	
  full	
  recomputation	
  is	
  required	
  for	
  
fault	
  tolerance	
  
• Requires	
  high	
  cluster	
  up-­‐time,	
  as	
  batch	
  data	
  
need	
  to	
  be	
  processed	
  continuously.	
  
• Requires	
  more	
  implementation	
  time,	
  as	
  
duplicate	
  code	
  need	
  to	
  be	
  written	
  in	
  separate	
  
technologies	
  to	
  process	
  real	
  time	
  and	
  batch	
  
data.	
  
• Time	
  taken	
  to	
  process	
  batch	
  is	
  linearly	
  
15	
  
	
  
realtime	
  views	
  can	
  be	
  discarded.	
  This	
  is	
  referred	
  to	
  as	
  “complexity	
  isolation”,	
  meaning	
  that	
  the	
  most	
  
complex	
  part	
  of	
  the	
  architecture	
  is	
  pushed	
  into	
  the	
  layer	
  whose	
  results	
  are	
  only	
  temporary.	
  
	
  
	
  
	
  
	
  
	
  
	
   	
  
Realtime	
  views	
  are	
  discarded	
  
once	
  the	
  data	
  they	
  contain	
  is	
  
represented	
  in	
  batch	
  view	
  
Now	
  
Batch	
  
Batch	
  
Batch	
  
Realtime	
  
Realtime	
  
Realtime	
  
Time	
  
16	
  
	
  
Generic	
  Lambda	
  Architecture	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
   	
  
Batch	
  Layer	
  
Serving	
  Layer	
  
Speed	
  Layer	
  
All	
  Data	
  
(HDFS)	
  
Pre-­‐computed	
  
Views	
  &	
  
Summarized	
  data	
  
Batch	
  
Precompute	
  
Data	
  Stream	
  
Data	
  Stream	
  
Data	
  Stream	
  
Data	
  Stream	
  
Process	
  
Stream	
  
Increment	
  views	
  /	
  
Stream	
  
Summarization	
  
Query	
  
V	
   V	
   V	
  
V	
   V	
   V	
  
Near	
  real	
  time	
  -­‐	
  
Increment	
  
Real	
  time	
  
views	
  
Batch	
  
Views	
  
Storm	
  or	
  Spark	
  
MR	
  /	
  Hive/	
  Pig	
  
Data	
  Management	
  &	
  
Access	
  
17	
  
	
  
References	
  
	
  
http://www.ibm.com/developerworks/library/bd-­‐archpatterns1/	
  	
  
http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper2.pdf	
  	
  
https://en.wikipedia.org/wiki/Lambda_architecture	
  	
  
http://voltdb.com/blog/simplifying-­‐complex-­‐lambda-­‐architecture	
  	
  
http://en.wiktionary.org/wiki/data_lake	
  	
  
	
   	
  
	
  

Weitere ähnliche Inhalte

Was ist angesagt?

IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...
IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...
IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...IRJET Journal
 
Iaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasetsIaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasetsIaetsd Iaetsd
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesFellowBuddy.com
 
Data Ware Housing And Data Mining
Data Ware Housing And Data MiningData Ware Housing And Data Mining
Data Ware Housing And Data Miningcpjcollege
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousingSunny Gandhi
 
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...Application of Data Warehousing & Data Mining to Exploitation for Supporting ...
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...Gihan Wikramanayake
 
Seminar datawarehousing
Seminar datawarehousingSeminar datawarehousing
Seminar datawarehousingKavisha Uniyal
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digitalsambiswal
 
Data ware house architecture
Data ware house architectureData ware house architecture
Data ware house architectureDeepak Chaurasia
 
Data warehouse
Data warehouseData warehouse
Data warehouseRajThakuri
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data miningEr. Nawaraj Bhandari
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
Data Warehousing & Basic Architectural Framework
Data Warehousing & Basic Architectural FrameworkData Warehousing & Basic Architectural Framework
Data Warehousing & Basic Architectural FrameworkDr. Sunil Kr. Pandey
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsCognizant
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dmsumit621
 
142230 633685297550892500
142230 633685297550892500142230 633685297550892500
142230 633685297550892500sumit621
 
The Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageThe Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageIRJET Journal
 

Was ist angesagt? (20)

IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...
IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...
IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...
 
Iaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasetsIaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasets
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
 
Data Ware Housing And Data Mining
Data Ware Housing And Data MiningData Ware Housing And Data Mining
Data Ware Housing And Data Mining
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousing
 
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...Application of Data Warehousing & Data Mining to Exploitation for Supporting ...
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...
 
Seminar datawarehousing
Seminar datawarehousingSeminar datawarehousing
Seminar datawarehousing
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digital
 
Data ware house architecture
Data ware house architectureData ware house architecture
Data ware house architecture
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MININGDATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data mining
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Data Warehousing & Basic Architectural Framework
Data Warehousing & Basic Architectural FrameworkData Warehousing & Basic Architectural Framework
Data Warehousing & Basic Architectural Framework
 
[IJET-V1I5P5] Authors: T.Jalaja, M.Shailaja
[IJET-V1I5P5] Authors: T.Jalaja, M.Shailaja[IJET-V1I5P5] Authors: T.Jalaja, M.Shailaja
[IJET-V1I5P5] Authors: T.Jalaja, M.Shailaja
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
142230 633685297550892500
142230 633685297550892500142230 633685297550892500
142230 633685297550892500
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
The Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageThe Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their Usage
 

Ähnlich wie Hadoop-based architecture approaches

About Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopAbout Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopLynn Langit
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundationshktripathy
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewIRJET Journal
 
BD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdfBD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdferamfatima43
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfrajsharma159890
 
Enterprise Data Lake
Enterprise Data LakeEnterprise Data Lake
Enterprise Data Lakesambiswal
 
Gdpr ccpa automated compliance - spark java application features and functi...
Gdpr   ccpa automated compliance - spark java application features and functi...Gdpr   ccpa automated compliance - spark java application features and functi...
Gdpr ccpa automated compliance - spark java application features and functi...Steven Meister
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big DataFrank Kienle
 
DOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond Hill
DOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond HillDOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond Hill
DOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond HillClaraZara1
 
DOCUMENT SELECTION USING MAPREDUCE
DOCUMENT SELECTION USING MAPREDUCEDOCUMENT SELECTION USING MAPREDUCE
DOCUMENT SELECTION USING MAPREDUCEijsptm
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKRajesh Jayarman
 
Become Data Driven With Hadoop as-a-Service
Become Data Driven With Hadoop as-a-ServiceBecome Data Driven With Hadoop as-a-Service
Become Data Driven With Hadoop as-a-ServiceMammoth Data
 
Stream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White PaperStream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White PaperImpetus Technologies
 
BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)Syaifuddin Ismail
 

Ähnlich wie Hadoop-based architecture approaches (20)

About Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopAbout Streaming Data Solutions for Hadoop
About Streaming Data Solutions for Hadoop
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A Review
 
Big data and oracle
Big data and oracleBig data and oracle
Big data and oracle
 
BD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdfBD_Architecture and Charateristics.pptx.pdf
BD_Architecture and Charateristics.pptx.pdf
 
Big Data
Big DataBig Data
Big Data
 
Big Data
Big DataBig Data
Big Data
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
 
Big-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdfBig-Data-Analytics.8592259.powerpoint.pdf
Big-Data-Analytics.8592259.powerpoint.pdf
 
Enterprise Data Lake
Enterprise Data LakeEnterprise Data Lake
Enterprise Data Lake
 
Gdpr ccpa automated compliance - spark java application features and functi...
Gdpr   ccpa automated compliance - spark java application features and functi...Gdpr   ccpa automated compliance - spark java application features and functi...
Gdpr ccpa automated compliance - spark java application features and functi...
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
DOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond Hill
DOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond HillDOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond Hill
DOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond Hill
 
DOCUMENT SELECTION USING MAPREDUCE
DOCUMENT SELECTION USING MAPREDUCEDOCUMENT SELECTION USING MAPREDUCE
DOCUMENT SELECTION USING MAPREDUCE
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Become Data Driven With Hadoop as-a-Service
Become Data Driven With Hadoop as-a-ServiceBecome Data Driven With Hadoop as-a-Service
Become Data Driven With Hadoop as-a-Service
 
Stream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White PaperStream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White Paper
 
BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)
 

Hadoop-based architecture approaches

  • 1.       2015   Miraj  Godha   6/5/2015   Hadoop  Architecture  Approaches  
  • 2. 1         Table  of  Contents   EXECUTIVE  SUMMARY  .................................................................................................................................  2   Big  data  Classification  ...................................................................................................................................  3   Hadoop-­‐based  architecture  approaches  ......................................................................................................  5   Data  Lake  ..................................................................................................................................................  5   Lambda  .....................................................................................................................................................  5   Choosing  the  correct  architecture  ............................................................................................................  5   Data  Lake  Architecture  .................................................................................................................................  9   Generic  Data  lake  Architecture  ..............................................................................................................  11   Steps  Involved  ....................................................................................................................................  12   Lambda  Architecture  ..................................................................................................................................  13   Batch  Layer  .............................................................................................................................................  14   Serving  layer  ...........................................................................................................................................  14   Speed  layer  .............................................................................................................................................  14   Generic  Lambda  Architecture  ................................................................................................................  16   References  ..................................................................................................................................................  17            
  • 3. 2     EXECUTIVE  SUMMARY     Apache  Hadoop  didn’t  disrupt  the  datacenter,  the  data  did.  Shortly  after  Corporate  IT  functions  within   enterprises  adopted  large  scale  systems  to  manage  data  then  the  Enterprise  Data  Warehouse  (EDW)   emerged  as  the  logical  home  of  all  enterprise  data.  Today,  every  enterprise  has  a  Data  Warehouse  that   serves  to  model  and  capture  the  essence  of  the  business  from  their  enterprise  systems.  The  explosion  of   new  types  of  data  in  recent  years  –  from  inputs  such  as  the  web  and  connected  devices,  or  just  sheer   volumes  of  records  –  has  put  tremendous  pressure  on  the  EDW.  In  response  to  this  disruption,  an   increasing  number  of  organizations  have  turned  to  Apache  Hadoop  to  help  manage  the  enormous   increase  in  data  whilst  maintaining  coherence  of  the  Data  Warehouse.  This  POV  discusses  Apache   Hadoop,  its  capabilities  as  a  data  platform  and  data  processing.  How  the  core  of  Hadoop  and  its   surrounding  ecosystems  provides  the  enterprise  requirements  to  integrate  alongside  the  Data   Warehouse  and  other  enterprise  data  systems  as  part  of  a  modern  data  architecture.  A  step  on  the   journey  toward  delivering  an  enterprise  ‘Data  Lake’  or  Lambda  Architecture  (Immutable  data  +  views).     An  enterprise  data  lake  provides  the  following  core  benefits  to  an  enterprise:  New  efficiencies  for  data   architecture  through  a  significantly  lower  cost  of  storage,  and  through  optimization  of  data  processing   workloads  such  as  data  transformation  and  integration.  New  opportunities  for  business  through  flexible   ‘schema-­‐on-­‐read’  access  to  all  enterprise  data,  and  through  multi-­‐use  and  multi-­‐workload  data   processing  on  the  same  sets  of  data:  from  batch  to  real-­‐time.     Apache  Hadoop  provides  both  reliable  storage  (HDFS)  and  a  processing  system  (MapReduce)  for  large   data  sets  across  clusters  of  computers.  MapReduce  is  a  batch  query  processor  that  is  targeted  at  long-­‐ running  background  processes.  Hadoop  can  handle  Volume.  But  to  handle  Velocity,  we  need  real-­‐time   processing  tools  that  can  compensate  for  the  high-­‐latency  of  batch  systems,  and  serve  the  most  recent   data  continuously,  as  new  data  arrives  and  older  data  is  progressively  integrated  into  the  batch   framework.  And  the  answer  to  the  problem  is  Lambda  Architecture.              
  • 4. 3     Big  data  Classification                                                 Processing  Type   Batch   Processing   Methodology   Near  Real  time   Real  Time  +  Batch   Prescriptive   Predictive   Diagnostic   Descriptive   Data  Frequency   On  demand   Continuous   Real  Time   Batch   Data  Type   Transactional   Historical   Master  data   Meta  data   Content  Format   Structured   Unstructured:-­‐Images,   Text,  Videos,  Documents,   emails  etc.   Semi-­‐Structured:  -­‐   XML,  JSON   Data  Sources   Machine   generated   Web  &  Social   media   IOT   Human   Generated   Transactional   data  Via  other  data  providers  
  • 5. 4     It's  helpful  to  look  at  the  characteristics  of  the  big  data  along  certain  lines  —  for  example,  how  the  data   is  collected,  analyzed,  and  processed.  Once  the  data  and  its  processing  are  classified,  it  can  be  matched   with  the  appropriate  big  data  analysis  architecture:     • Processing  type  -­‐  Whether  the  data  is  analyzed  in  real  time  or  batched  for  later  analysis.  Give  careful   consideration  to  choosing  the  analysis  type,  since  it  affects  several  other  decisions  about  products,  tools,   hardware,  data  sources,  and  expected  data  frequency.  A  mix  of  both  types  ‘Near  real  time  or  micro   batch”  may  also  be  required  by  the  use  case.   • Processing  methodology  -­‐  The  type  of  technique  to  be  applied  for  processing  data  (e.g.,  predictive,   analytical,  ad-­‐hoc  query,  and  reporting).  Business  requirements  determine  the  appropriate  processing   methodology.  A  combination  of  techniques  can  be  used.  The  choice  of  processing  methodology  helps   identify  the  appropriate  tools  and  techniques  to  be  used  in  your  big  data  solution.   • Data  frequency  and  size  -­‐  How  much  data  is  expected  and  at  what  frequency  does  it  arrive.  Knowing   frequency  and  size  helps  determine  the  storage  mechanism,  storage  format,  and  the  necessary   preprocessing  tools.  Data  frequency  and  size  depend  on  data  sources:   • On  demand,  as  with  social  media  data   • Continuous  feed,  real-­‐time  (weather  data,  transactional  data)   • Time  series  (time-­‐based  data)   • Data  type  -­‐  Type  of  data  to  be  processed  —  transactional,  historical,  master  data,  and  others.  Knowing   the  data  type  helps  segregate  the  data  in  storage.   • Content  format  -­‐  Format  of  incoming  data  —  structured  (RDMBS,  for  example),  unstructured  (audio,   video,  and  images,  for  example),  or  semi-­‐structured.  Format  determines  how  the  incoming  data  needs   to  be  processed  and  is  key  to  choosing  tools  and  techniques  and  defining  a  solution  from  a  business   perspective.   • Data  source  -­‐  Sources  of  data  (where  the  data  is  generated)  —  web  and  social  media,  machine-­‐ generated,  human-­‐generated,  etc.  Identifying  all  the  data  sources  helps  determine  the  scope  from  a   business  perspective.          
  • 6. 5     Hadoop-­‐based  architecture  approaches     Data  Lake   A  data  lake  is  a  set  of  centralized  repositories  containing  vast  amounts  of  raw  data  (either  structured  or   unstructured),  described  by  metadata,  organized  into  identifiable  data  sets,  and  available  on  demand.   Data  in  the  lake  supports  discovery,  analytics,  and  reporting,  usually  by  deploying  cluster  tools  like   Hadoop.     Lambda   Lambda  architecture  is  a  data-­‐processing  architecture  designed  to  handle  massive  quantities  of  data  by   taking  advantage  of  both  batch-­‐  and  stream-­‐processing  methods.  This  approach  to  architecture   attempts  to  balance  latency,  throughput,  and  fault-­‐tolerance  by  using  batch  processing  to  provide   comprehensive  and  accurate  views  of  batch  data,  while  simultaneously  using  real-­‐time  stream   processing  to  provide  views  of  online  data.  The  two  view  outputs  may  be  joined  before  presentation.   The  rise  of  lambda  architecture  is  correlated  with  the  growth  of  big  data,  real-­‐time  analytics,  and  the   drive  to  mitigate  the  latencies  of  map-­‐reduce.   Choosing  the  correct  architecture    
  • 7. 6     Parameter   Data  Lake   Lambda   Simultaneous  access  to  Real   time  and  Batch  data     Data  Lake  can  use  real  time   processing  technologies  like   Storm  to  return  real  time   results,  however  in  such  a   scenario  historical  results   cannot  be  made  available.  If  we   use  technologies  like  Spark  to   process  data,  real  time  data  and   historical  data,  on  request  there   can  be  significant  delays  in   response  time  to  clients  as   compared  to  Lambda   architecture.     Lambda  Architecture’s  Serving   Layer  merges  the  output  of   Batch  Layer  and  Speed  Layer,   before  sending  the  results  of   user  queries.  As  data  is  already   processed  into  views  at  both   the  layers,  the  response  time  is   significantly  less.     Latency   Latency  is  high  as  compared  to   Lambda,  as  real  time  data  need   to  be  processed  with  historical   data  on-­‐demand  or  as  a  part  of   batch.   Low-­‐latency  real  time  results   are  processed  by  Speed  layer   and  Batch  results  are  pre-­‐ processed  in  Batch  layer.  On   request,  both  the  results  are   just  merged,  there  by  resulting   low  latency  time  for  real  time   processing.   Ease  of  Data  Governance   Data  lake  is  coined  to  convey   the  concept  of  centralized   repository  containing  virtually   inexhaustible  amounts  of  raw   data  (or  minimally  curated)  data   that  is  readily  made  available   anytime  to  anyone  authorized   to  perform  analytical  activities.   Lambda  architecture’s  serving   layer  gives  access  to  processed   and  analyzed  data.  As  uses  get   access  to  processed  data   directly,  it  can  lead  to  top  down   data  governance  issues.   Updates  in  source  data   As  data  lake  stores  only  raw   data,  updates  are  just  appended   to  raw  data,  thereby  makes  life   of  business  users  difficult  to   write  business  logic,  in  such  a   way  that  latest  updated  records   are  considered  in  calculations.   Batch  Views  are  always   computed  from  starch  in   Lambda  Architecture.  As  a   result,  updates  can  be  easily   incorporated  in  calculated   Views  in  each  reprocess  batch   cycle.   Fault  tolerance  against  human   errors     Data  Scientist  or  business  users,   running  business  logic  on   relevant  raw  data  in  Data  Lake   might  lead  to  human  errors.   Although,  re-­‐covering  from   those  errors  is  not  difficult  as   it’s  just  a  matter  of  re-­‐running   the  logic.  However,  the   reprocessing  time  for  large   datasets  might  lead  to  some   delays.     Lambda  architecture  assures   fault  tolerance  not  only  against   hardware  failures  but  against   human  errors.  Re-­‐computation   of  views  every  time  from  raw   data  in  batch  layer,  insures  that   any  human  errors  in  business   logic  would  not  be  cascaded  to  a   level  where  it’s  unrecoverable.     Ease  of  business  users   Data  is  stored  in  raw  format,   Data  is  processed  and  available  
  • 8. 7     with  data  definitions  and   sometime  groomed  to  make   digestible  by  data  management   tools.  At  times,  it  difficult  for   business  users  to  use  data  in  as-­‐ is  conditions.   from  Serving  makes  life  easy  for   business  users.   Accuracy  for  real  time  results   Irrespective  of  any  scenario,   users  accessing  data  from  Data   Lake  has  access  to  immutable   raw  data,  they  can  do  exact   computations,  thereby  always   get  the  accurate  results.   In  scenarios,  where  real  time   calculations  need  to  access   historical  data,  which  is  not   possible,  Lambda  architecture   would  return  you  estimated   results.  For  example,  calculation   of  mean  value,  cannot  be   achieved  until  whole  historical   data  and  real  time  data  is   referenced  at  one  go.  In  such  a   scenario,  serving  layer  would   return  estimated  results.   Infrastructure  Cost   Data  lake  architecture  process   the  data  as  and  when  need  and   thereby  the  cluster  cost  can  be   much  less  as  compared  to   Lambda.  Moreover,  it  only   persist  the  raw  data  however   Lambda  architecture  not  only   persist  the  raw  data  but   processed  data  too.  This  leads   to  extra  storage  cost  in  Lambda   architecture.   Lambda  architecture  data   processing  life  cycle  is  designed   in  such  a  fashion  that  as  soon   the  one  cycle  of  batch  process  is   finished,  it  starts  a  new  cycle  of   batch  processing  which  includes   the  recently  inserted  data.   Simultaneously,  the  speed  layer   is  always  processing  the  real   time  data.   OLAP   Unlike  data  marts,  which  are   optimized  for  data  analysis  by   storing  only  some  attributes  and   dropping  data  below  the  level   aggregation,  a  data  lake  is   designed  to  retain  all  attributes,   especially  so  when  you  do  not   yet  know  what  the  scope  of   data  or  its  use  will  be.   As  Lambda  exposes  the   processed  views  from  serving   layer,  all  the  attributes  of  data   would  not  be  available  to  Data   Scientist  for  running  an   analytical  queries  at  times.   Historical  data  reference  for   processing   OLAP  &  OLTP  queries  access  the   raw  or  groomed  data  directly   from  the  data  lake,  making  it   feasible  to  access  and  refer  the   historical  data  while  processing   data  for  given  time  interval.   Speed  layer  do  not  have   reference  to  historical  data   stored  in  batch  layer,  make  it   difficult  to  run  queries  which   refer  historical  data.  For  e.g.   ‘Unique  count’  type  of  queries   cannot  return  correct  results   from  Speed  layer.  However,   ‘calculating  average’  type  of  
  • 9. 8           query  calculations  be  done   easily  on  Serving  layer,  by   generating  the  average  of   results  returned  from  Speed  and   Batch  layer  on  the  fly.   Slowly  Changing  Dimensions   Although,  data  lake  has  records   of  changed  dimension   attributes,  however  extra   business  logic  need  to  be   written  by  business  uses  to   cater  it.   Lambda  architecture  can  easily   cater  the  slowly  changing   dimensions  by  creating   surrogate  keys  parallel  to   natural  keys  in  case  of  any   change  detected  in  dimension   attributes  while  batch  layer   processing  cycle.   Slowly  changing  Facts     However,  in  Data  Lake  both  the   versions  of  facts  are  available   for  users  to  look  at,  this  would   lead  to  good  analytical  results  if   fact  life  cycle  is  an  attribute  in   business  logic  for  data  analytics.   Although  it’s  easy  to  change  the   facts  in  Lambda  architecture,   but  this  will  lead  to  loss  in   information  of  fact  life  cycle.  As   the  previous  state  of  fact  in  case   of  slowly  changing  facts  is  not   available  to  Data  Scientist,  the   analytical  queries  might  not  give   desired  results  on  views   exposed  by  Serving  Layer.   Frequently  changing  business   logic     Changes  in  processing  code   need  to  be  done.  But  there  is  no   clear  solution,  of  how  the   historically  processed  data  need   to  be  handled.   As  data  is  re-­‐processed  from   starch,  even  if  business  logic   changes  frequently  the   historical  data  problem  is   resolved  automatically.   Implementation  lifecycle     Data  lake  is  fast  to  implement     as  it  eliminates  the  dependency   of  data  modeling  upfront   Processing  logic  need  to  be   implemented  at  batch  and   speed  layer,  leading  to   significant  implementation  time   as  comparted  to  Data  Lake   Adding  new  data  sources     Very  easy  to  add   Need  to  be  incorporated  in   processing  layers  and  would   require  code  changes    
  • 10. 9     IF  YOU  THINK  OF  A  DATAMART  AS  A  STORE   OF  BOTTLED  WATER  –  CLEANSED  AND   PACKAGED  AND  STRUCTURED  FOR  EASY   CONSUMPTION  –  THE  DATA  LAKE  IS  A   LARGE  BODY  OF  WATER  IN  A  MORE   NATURAL  STATE.  THE  CONTENTS  OF  THE   DATA  LAKE  STREAM  IN  FROM  A  SOURCE  TO   FILL  THE  LAKE,  AND  VARIOUS  USERS  OF  THE   LAKE  CAN  COME  TO  EXAMINE,  DIVE  IN,  OR   TAKE  SAMPLES.   BY:  JAMES  DIXON  (PENTAHO  CTO)   Data  Lake  Architecture     Much  of  today's  research  and  decision  making  are  based  on  knowledge  and  insight  that  can  be  gained   from  analyzing  and  contextualizing  the  vast  (and  growing)  amount  of  “open”  or  “raw”  data.  The  concept   that  the  large  number  of  data  sources  available  today  facilitates  analyses  on  combinations  of   heterogeneous  information  that  would  not  be  achievable  via  “siloed”  data  maintained  in  warehouses  is   very  powerful.  The  term  data  lake  has  been  coined  to   convey  the  concept  of  a  centralized  repository  containing   virtually  inexhaustible  amounts  of  raw  (or  minimally   curated)  data  that  is  readily  made  available  anytime  to   anyone  authorized  to  perform  analytical  activities.     A  data  lake  is  a  set  of  centralized  repositories  containing   vast  amounts  of  raw  data  (either  structured  or   unstructured),  described  by  metadata,  organized  into   identifiable  data  sets,  and  available  on  demand.  Data  in   the  lake  supports  discovery,  analytics,  and  reporting,   usually  by  deploying  cluster  tools  like   Hadoop.  Unlike  traditional  warehouses,  the   format  of  the  data  is  not  described  (that  is,   its  schema  is  not  available)  until  the  data  is   needed.  By  delaying  the  categorization  of   data  from  the  point  of  entry  to  the  point  of   use,  analytical  operations  that  transcend  the   rigid  format  of  an  adopted  schema  become   possible.  Query  and  search  operations  on  the  data  can  be  performed  using  traditional  database   technologies  (when  structured),  as  well  as  via  alternate  means  such  as  indexing  and  NoSQL  derivatives.   Key  Features   • Stores  Raw  data  –  Single  source  of  truth   • Data  accessible  to  anyone  authorized   • Polyglot  Persistence   • Support  multiple  applications  &  Workloads   • Low  Cost,  High  Performance  storage   • Flexible,  easy  to  use  data  organization   • Self-­‐service  end-­‐user     • More  Flexible  to  answer  new  questions   • Easy  to  add  new  data  sources   • Loosely  coupled  architecture  –  enables   flexibility  of  analysis   • Eliminating  dependency  of  data  modeling   upfront  –  thereby  fast  to  implement   • Storage  is  highly  optimized  as  raw  data  is   stored   Disadvantages   • High  Latency  for  composite  analysis  view  of   both  real  time  and  historical  data   • Raw  data  does  not  provide  relational  structure   that  is  not  friendly  for  business  analytis  on  the   fly  
  • 11. 10     In  a  practical  sense,  a  data  lake  is  characterized  by  three  key  attributes:     • Collect  everything:    A  data  lake  contains  all  data,  both  raw  sources  over  extended  periods  of   time  as  well  as  any  processed  data.     • Dive  in  anywhere:  A  data  lake  enables  users  across  multiple  business  units  to  refine,  explore   and  enrich  data  on  their  terms.     • Flexible  access:    A  data  lake  enables  multiple  data  access  patterns  across  a  shared   infrastructure:  batch,  interactive,  online,  search,  in-­‐memory  and  other  processing  engines.              
  • 12. 11     Generic  Data  lake  Architecture     H         Data   Sources Real   Time Micro   Batch   Mega   Batch   Desktop  &  Mobile                     Social  Media  and  cloud     Operational  Systems               Internet  of  Things  (IOT)               Ingestion   Tier Query   Interface SQL No  SQL Extern al   Storag e Centralized  Management System  monitoring System  management Unified  Data  Management  Tier Data  mgmt. Data  Access Processing  Tier Workflow  Management HDFS  storage Unstructured  and  structured  data In-­‐memory MapReduce/  Hive/MPP Flexible   Actions Real-­‐time   insights Interactive   insights Batch insights Schematic  Metadata Grooming  Data Processed   Data Raw     Data Processed   Data Processed   Data
  • 13. 12     Steps  Involved   • Procuring  data  –  Process  of  obtaining  data  and  metadata  and  preparing  them  for  eventual   inclusion  in  a  data  lake.   • Obtaining  data  –Transferring  the  data  physically  from  source  to  Data  Lake.   • Describing  data  –  Data  scientist  searching  a  data  lake  for  useful  data  must  be  able  to  find  the   data  relevant  to  his  or  her  need,  for  the  same  they  require  metadata  for  the  data.  Schematic   metadata  for  this  data  set  would  include  information  about  how  the  data  is  formatted  and   information  about  the  schema.   • Grooming  data  –  Although  we  were  talking  about  raw  data  is  made  consumable  by  analytics   applications.  However,  in  some  scenarios  grooming  process  use  schematic  metadata  to   transform  raw  data,  into  data  that  can  be  processed  by  standard  data  management  tools.   • Provisioning  data  –  Authentication  and  authorization  policies  by  which  consumers  take  out  data   from  Data  Lake.   • Preserving  data  –  Managing  Data  Lake  also  require  attention  to  maintenance  issues  such  as   staleness,  expiration,  decommissions  and  renewals.      
  • 14. 13     LAMBDA  ARCHITECTURE  IS  A  DATA-­‐ PROCESSING  ARCHITECTURE  DESIGNED  TO   HANDLE  MASSIVE  QUANTITIES  OF  DATA  BY   TAKING  ADVANTAGE  OF  BOTH  BATCH-­‐  AND   STREAM-­‐PROCESSING  METHODS.  THIS   APPROACH  TO  ARCHITECTURE  ATTEMPTS   TO  BALANCE  LATENCY,  THROUGHPUT,  AND   FAULT-­‐TOLERANCE  BY  USING  BATCH   PROCESSING  TO  PROVIDE  COMPREHENSIVE   AND  ACCURATE  VIEWS  OF  BATCH  DATA,   WHILE  SIMULTANEOUSLY  USING  REAL-­‐TIME   STREAM  PROCESSING  TO  PROVIDE  VIEWS   OF  ONLINE  DATA.  THE  TWO  VIEW  OUTPUTS   MAY  BE  JOINED  BEFORE  PRESENTATION.   Lambda  Architecture      The  Lambda  architecture  is  split  into  three   layers,  the  batch  layer,  the  serving  layer,  and  the   speed  layer.   1. Batch layer (Apache Hadoop)   2. Serving layer (Cloudera Impala, Spark)   3. Speed layer (Storm, Spark, Apache HBase, Cassandra)       Key  Features   • Low  latency  simultaneous  analysis  of  the  (near)  real-­‐ time  information  extracted  from  a  continuous  inflow   of  data  and  persisting  analysis  of  a  massive  volume  of   data.   • Fault  tolerant  not  against  hardware  failure  but  against   human  error  too   • Mistakes  are  corrected  by  re-­‐computations   • Storage  is  highly  optimized  as  raw  data  is  stored    
  • 15. 14     Batch  Layer     The  batch  layer  is  responsible  for  two   things.  The  first  is  to  store  the  immutable,   constantly  growing  master  dataset  (HDFS),   and  the  second  is  to  compute  arbitrary   views  from  this  dataset  (MapReduce).   Computing  the  views  is  a  continuous   operation,  so  when  new  data  arrives  it  will   be  aggregated  into  the  views  when  they   are  recomputed  during  the  next   MapReduce  iteration.   The  views  should  be  computed  from  the   entire  dataset  and  therefore  the  batch   layer  is  not  expected  to  update  the  views   frequently.  Depending  on  the  size  of  your   dataset  and  cluster,  each  iteration  could  take  hours.     Serving  layer     The  output  from  the  batch  layer  is  a  set  of  flat  files  containing  the  precomputed  views.  The  serving  layer   is  responsible  for  indexing  and  exposing  the  views  so  that  they  can  be  queried.  Although,  the  batch  and   serving  layers  alone  do  not  satisfy  any  realtime  requirement  because  MapReduce  (by  design)  is  high   latency  and  it  could  take  a  few  hours  for  new  data  to  be  represented  in  the  views  and  propagated  to  the   serving  layer.  This  is  why  we  need  the  speed  layer.   Speed  layer     In  essence  the  speed  layer  is  the  same  as  the  batch  layer  in  that  it  computes  views  from  the  data  it   receives.  The  speed  layer  is  needed  to  compensate  for  the  high  latency  of  the  batch  layer  and  it  does   this  by  computing  realtime  views  in  Storm.  The  realtime  views  contain  only  the  delta  results  to   supplement  the  batch  views.   Whilst  the  batch  layer  is  designed  to  continuously  recompute  the  batch  views  from  scratch,  the  speed   layer  uses  an  incremental  model  whereby  the  realtime  views  are  incremented  as  and  when  new  data  is   received.  What’s  clever  about  the  speed  layer  is  the  realtime  views  are  intended  to  be  transient  and  as   soon  as  the  data  propagates  through  the  batch  and  serving  layers  the  corresponding  results  in  the   Disadvantages   • Maintaining  copies  code  that  needs  to  produce   the  same  result  in  two  complex  distributed   systems   • Could  return  estimated  or  approx.  results.     • Expensive  full  recomputation  is  required  for   fault  tolerance   • Requires  high  cluster  up-­‐time,  as  batch  data   need  to  be  processed  continuously.   • Requires  more  implementation  time,  as   duplicate  code  need  to  be  written  in  separate   technologies  to  process  real  time  and  batch   data.   • Time  taken  to  process  batch  is  linearly  
  • 16. 15     realtime  views  can  be  discarded.  This  is  referred  to  as  “complexity  isolation”,  meaning  that  the  most   complex  part  of  the  architecture  is  pushed  into  the  layer  whose  results  are  only  temporary.                 Realtime  views  are  discarded   once  the  data  they  contain  is   represented  in  batch  view   Now   Batch   Batch   Batch   Realtime   Realtime   Realtime   Time  
  • 17. 16     Generic  Lambda  Architecture                                                   Batch  Layer   Serving  Layer   Speed  Layer   All  Data   (HDFS)   Pre-­‐computed   Views  &   Summarized  data   Batch   Precompute   Data  Stream   Data  Stream   Data  Stream   Data  Stream   Process   Stream   Increment  views  /   Stream   Summarization   Query   V   V   V   V   V   V   Near  real  time  -­‐   Increment   Real  time   views   Batch   Views   Storm  or  Spark   MR  /  Hive/  Pig   Data  Management  &   Access  
  • 18. 17     References     http://www.ibm.com/developerworks/library/bd-­‐archpatterns1/     http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper2.pdf     https://en.wikipedia.org/wiki/Lambda_architecture     http://voltdb.com/blog/simplifying-­‐complex-­‐lambda-­‐architecture     http://en.wiktionary.org/wiki/data_lake