SlideShare a Scribd company logo
1 of 40
Download to read offline
1	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Real-­‐&me	
  Learning	
  with	
  Hadoop	
  
The	
  “λ	
  +	
  ε”	
  architecture	
  
2	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
§  Contact:	
  
–  tdunning@maprtech.com	
  
–  @ted_dunning	
  
§  Slides	
  and	
  such	
  (available	
  late	
  tonight):	
  
–  hEp://slideshare.net/tdunning	
  
§  Hash	
  tags:	
  #mapr	
  #storm	
  
	
  	
  
3	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
The	
  Challenge	
  
§  Hadoop	
  is	
  great	
  of	
  processing	
  vats	
  of	
  data	
  
–  But	
  sucks	
  for	
  real-­‐6me	
  (by	
  design!)	
  
	
  
§  Storm	
  is	
  great	
  for	
  real-­‐6me	
  processing	
  
–  But	
  lacks	
  any	
  way	
  to	
  deal	
  with	
  batch	
  processing	
  
§  It	
  sounds	
  like	
  there	
  isn’t	
  a	
  solu6on	
  
–  Neither	
  fashionable	
  solu6on	
  handles	
  everything	
  
4	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
This	
  is	
  not	
  a	
  problem.	
  
	
  It’s	
  an	
  opportunity!	
  
5	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
t	
  
now	
  
Hadoop	
  is	
  Not	
  Very	
  Real-­‐&me	
  
Unprocessed
Data	
  
Fully	
  
processed	
  
Latest	
  full	
  
period	
  
Hadoop	
  job	
  
takes	
  this	
  
long	
  for	
  this	
  
data	
  
6	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
t	
  
now	
  
Hadoop	
  works	
  
great	
  back	
  here	
  
Storm	
  
works	
  
here	
  
Real-­‐&me	
  and	
  Long-­‐&me	
  together	
  
Blended	
  
view	
  
Blended	
  
view	
  
Blended	
  
View	
  
7	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
An	
  Example	
  
8	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
The	
  Same	
  Problem	
  
9	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
What	
  Does	
  the	
  Lambda	
  Architecture	
  Do?	
  
§  The	
  idea	
  is	
  that	
  we	
  want	
  to	
  compute	
  a	
  func6on	
  of	
  a	
  all	
  history	
  up	
  
to	
  6me	
  n	
  
§  In	
  order	
  to	
  get	
  real-­‐6me	
  response,	
  we	
  divide	
  this	
  into	
  two	
  parts	
  
§  Where	
  the	
  addi6on	
  may	
  not	
  really	
  be	
  addi6on	
  
§  The	
  idea	
  is	
  that	
  if	
  we	
  lose	
  the	
  history	
  from	
  m+1	
  un6l	
  n,	
  things	
  get	
  
beEer	
  soon	
  enough	
  
f x1...xn( )
f x1...xm...xn( )= f x1...xm( )+ f xm+1...xn( )
10	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Can	
  We	
  Do	
  BeLer?	
  
§  Can	
  we	
  minimize	
  or	
  avoid	
  failure	
  transients?	
  
§  Can	
  we	
  guarantee	
  precise	
  boundaries?	
  
§  Can	
  we	
  synchronize	
  computa6ons	
  accurately?	
  
11	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Alterna&ve	
  without	
  Lambda	
  
Search	
  
Engine	
  
NoSql	
  
de	
  Jour	
  
Consumer	
  
Real-­‐6me	
   Long-­‐6me	
  
?	
  
12	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Problems	
  
§  Simply	
  dumping	
  into	
  noSql	
  engine	
  doesn’t	
  quite	
  work	
  
§  Insert	
  rate	
  is	
  limited	
  
§  No	
  load	
  isola6on	
  
–  Big	
  retrospec6ve	
  jobs	
  kill	
  real-­‐6me	
  
§  Low	
  scan	
  performance	
  
–  Hbase	
  preEy	
  good,	
  but	
  not	
  stellar	
  
§  Difficult	
  to	
  set	
  boundaries	
  
–  where	
  does	
  real-­‐6me	
  end	
  and	
  long-­‐6me	
  begin?	
  
13	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Almost	
  a	
  Solu&on	
  
§  Lambda	
  architecture	
  talks	
  about	
  func6on	
  of	
  long-­‐6me	
  state	
  
–  Real-­‐6me	
  approximate	
  accelerator	
  adjusts	
  previous	
  result	
  to	
  current	
  state	
  
§  Sounds	
  good,	
  but	
  …	
  
–  How	
  does	
  the	
  real-­‐6me	
  accelerator	
  combine	
  with	
  long-­‐6me?	
  
–  What	
  algorithms	
  can	
  do	
  this?	
  
–  How	
  can	
  we	
  avoid	
  gaps	
  and	
  overlaps	
  and	
  other	
  errors?	
  
§  Needs	
  more	
  work	
  
§  We	
  need	
  a	
  “λ	
  +	
  ε”	
  architecture	
  !	
  
14	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
A	
  Simple	
  Example	
  
§  Let’s	
  start	
  with	
  the	
  simplest	
  case	
  …	
  coun6ng	
  
§  Coun6ng	
  =	
  addi6on	
  
–  Addi6on	
  is	
  associa6ve	
  
–  Addi6on	
  is	
  on-­‐line	
  
–  We	
  can	
  generalize	
  these	
  results	
  to	
  all	
  associa6ve,	
  on-­‐line	
  func6ons	
  
–  But	
  let’s	
  start	
  simple	
  
15	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Data	
  
Sources	
  
Catcher	
  
Cluster	
  
Rough	
  Design	
  –	
  Data	
  Flow	
  
Catcher	
  
Cluster	
  
Query	
  Event	
  
Spout	
  
Logger	
  
Bolt	
  
Counter	
  
Bolt	
  
Raw	
  
Logs	
  
Logger	
  
Bolt	
  
Semi	
  
Agg	
  
Hadoop	
  
Aggregator	
  
Snap	
  
Long	
  
agg	
  
ProtoSpout	
  
Counter	
  
Bolt	
  
Logger	
  
Bolt	
  
Data	
  
Sources	
  
16	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Closer	
  Look	
  –	
  Catcher	
  Protocol	
  
Data	
  
Sources	
  
Catcher	
  
Cluster	
  
Catcher	
  
Cluster	
  
Data	
  
Sources	
  
The	
  data	
  sources	
  and	
  catchers	
  
communicate	
  with	
  a	
  very	
  simple	
  
protocol.	
  
	
  
Hello()	
  =>	
  list	
  of	
  catchers	
  
Log(topic,message)	
  =>	
  	
  
	
  	
  	
  	
  (OK|FAIL,	
  redirect-­‐to-­‐catcher)	
  
17	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Closer	
  Look	
  –	
  Catcher	
  Queues	
  
Catcher	
  
Cluster	
  
Catcher	
  
Cluster	
  
The	
  catchers	
  forward	
  log	
  requests	
  
to	
  the	
  correct	
  catcher	
  and	
  return	
  
that	
  host	
  in	
  the	
  reply	
  to	
  allow	
  the	
  
client	
  to	
  avoid	
  the	
  extra	
  hop.	
  
	
  
Each	
  topic	
  file	
  is	
  appended	
  by	
  
exactly	
  one	
  catcher.	
  
	
  
Topic	
  files	
  are	
  kept	
  in	
  shared	
  file	
  
storage.	
  
Topic	
  
File	
  
Topic	
  
File	
  
18	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Closer	
  Look	
  –	
  ProtoSpout	
  
The	
  ProtoSpout	
  tails	
  the	
  topic	
  files,	
  
parses	
  log	
  records	
  into	
  tuples	
  and	
  
injects	
  them	
  into	
  the	
  Storm	
  
topology.	
  
	
  
Last	
  fully	
  acked	
  posi6on	
  stored	
  in	
  
shared,	
  transac6onally	
  correct	
  file	
  
system.	
  
Topic	
  
File	
  
Topic	
  
File	
  
ProtoSpout	
  
19	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Closer	
  Look	
  –	
  Counter	
  Bolt	
  
§  Cri6cal	
  design	
  goals:	
  
–  fast	
  ack	
  for	
  all	
  tuples	
  
–  fast	
  restart	
  of	
  counter	
  
§  Ack	
  happens	
  when	
  tuple	
  hits	
  the	
  replay	
  log	
  (10’s	
  of	
  milliseconds,	
  
group	
  commit)	
  
§  Restart	
  involves	
  replaying	
  semi-­‐agg’s	
  +	
  replay	
  log	
  (very	
  fast)	
  
§  Replay	
  log	
  only	
  lasts	
  un6l	
  next	
  semi-­‐aggregate	
  goes	
  out	
  
Counter	
  
Bolt	
  
Replay	
  
Log	
  
Semi-­‐
aggregated	
  
records	
  
Incoming	
  
records	
  
Real-­‐6me	
   Long-­‐6me	
  
20	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
A	
  Frozen	
  Moment	
  in	
  Time	
  
§  Snapshot	
  defines	
  the	
  dividing	
  line	
  
§  All	
  data	
  in	
  the	
  snap	
  is	
  long-­‐6me,	
  all	
  
awer	
  is	
  real-­‐6me	
  
§  Semi-­‐agg	
  strategy	
  allows	
  clean	
  
combina6on	
  of	
  both	
  kinds	
  of	
  data	
  
§  Data	
  synchronized	
  snap	
  not	
  
needed	
  (if	
  the	
  snap	
  is	
  really	
  a	
  snap)	
  
Semi	
  
Agg	
  
Hadoop	
  
Aggregator	
  
Snap	
  
Long	
  
agg	
  
21	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Guarantees	
  
§  Counter	
  output	
  volume	
  is	
  small-­‐ish	
  
–  the	
  greater	
  of	
  k	
  tuples	
  per	
  100K	
  inputs	
  or	
  k	
  tuple/s	
  
–  1	
  tuple/s/label/bolt	
  for	
  this	
  exercise	
  
§  Persistence	
  layer	
  must	
  provide	
  guarantees	
  
–  distributed	
  against	
  node	
  failure	
  
–  must	
  have	
  either	
  readable	
  flush	
  or	
  closed-­‐append	
  
§  HDFS	
  is	
  distributed,	
  but	
  provides	
  no	
  guarantees	
  and	
  strange	
  
seman6cs	
  
§  MapRfs	
  is	
  distributed,	
  provides	
  all	
  necessary	
  guarantees	
  
22	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Presenta&on	
  Layer	
  
§  Presenta6on	
  must	
  
–  read	
  recent	
  output	
  of	
  Logger	
  bolt	
  
–  read	
  relevant	
  output	
  of	
  Hadoop	
  jobs	
  
–  combine	
  semi-­‐aggregated	
  records	
  
§  User	
  will	
  see	
  
–  counts	
  that	
  increment	
  within	
  0-­‐2	
  s	
  of	
  events	
  
–  seamless	
  and	
  accurate	
  meld	
  of	
  short	
  and	
  long-­‐term	
  data	
  
23	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
The	
  Basic	
  Idea	
  
§  Online	
  algorithms	
  generally	
  have	
  rela6vely	
  small	
  state	
  (like	
  
coun6ng)	
  
§  Online	
  algorithms	
  generally	
  have	
  a	
  simple	
  update	
  (like	
  coun6ng)	
  
§  If	
  we	
  can	
  do	
  this	
  with	
  coun6ng,	
  we	
  can	
  do	
  it	
  with	
  all	
  kinds	
  of	
  
algorithms	
  
24	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Summary	
  –	
  Part	
  1	
  
§  Semi-­‐agg	
  strategy	
  +	
  snapshots	
  allows	
  correct	
  real-­‐6me	
  counts	
  
–  because	
  addi6on	
  is	
  on-­‐line	
  and	
  associa6ve	
  
§  Other	
  on-­‐line	
  associa6ve	
  opera6ons	
  include:	
  
–  k-­‐means	
  clustering	
  (see	
  Dan	
  Filimon’s	
  talk	
  at	
  16.)	
  
–  count	
  dis6nct	
  (see	
  hyper-­‐log-­‐log	
  counters	
  from	
  streamlib	
  or	
  kmv	
  from	
  
Brickhouse)	
  
–  top-­‐k	
  values	
  
–  top-­‐k	
  (count(*))	
  (see	
  streamlib)	
  
–  contextual	
  Bayesian	
  bandits	
  (see	
  part	
  2	
  of	
  this	
  talk)	
  
25	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Example	
  2	
  –	
  AB	
  tes&ng	
  in	
  real-­‐&me	
  
§  I	
  have	
  15	
  versions	
  of	
  my	
  landing	
  page	
  
§  Each	
  visitor	
  is	
  assigned	
  to	
  a	
  version	
  
–  Which	
  version?	
  
§  A	
  conversion	
  or	
  sale	
  or	
  whatever	
  can	
  happen	
  
–  How	
  long	
  to	
  wait?	
  
§  Some	
  versions	
  of	
  the	
  landing	
  page	
  are	
  horrible	
  
–  Don’t	
  want	
  to	
  give	
  them	
  traffic	
  
26	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
A	
  Quick	
  Diversion	
  
§  You	
  see	
  a	
  coin	
  
–  What	
  is	
  the	
  probability	
  of	
  heads?	
  
–  Could	
  it	
  be	
  larger	
  or	
  smaller	
  than	
  that?	
  
§  I	
  flip	
  the	
  coin	
  and	
  while	
  it	
  is	
  in	
  the	
  air	
  ask	
  again	
  
§  I	
  catch	
  the	
  coin	
  and	
  ask	
  again	
  
§  I	
  look	
  at	
  the	
  coin	
  (and	
  you	
  don’t)	
  and	
  ask	
  again	
  
§  Why	
  does	
  the	
  answer	
  change?	
  
–  And	
  did	
  it	
  ever	
  have	
  a	
  single	
  value?	
  
27	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
A	
  Philosophical	
  Conclusion	
  
§  Probability	
  as	
  expressed	
  by	
  humans	
  is	
  subjec6ve	
  and	
  depends	
  on	
  
informa6on	
  and	
  experience	
  
28	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
A	
  Prac&cal	
  Applica&on	
  
29	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
I	
  Dunno	
  
0 0.2 0.4 0.6 0.8 1
p
Prob(p)
30	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
5	
  heads	
  out	
  of	
  10	
  throws	
  
0 0.2 0.4 0.6 0.8 1
p
Prob(p)
31	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
2	
  heads	
  out	
  of	
  12	
  throws	
  
0 0.2 0.4 0.6 0.8 1
p
Prob(p)
Mean	
  
Using	
  any	
  single	
  number	
  as	
  a	
  “best”	
  
es6mate	
  denies	
  the	
  uncertain	
  nature	
  of	
  
a	
  distribu6on	
  
Adding	
  confidence	
  bounds	
  s6ll	
  loses	
  most	
  of	
  
the	
  informa6on	
  in	
  the	
  distribu6on	
  and	
  
prevents	
  good	
  modeling	
  of	
  the	
  tails	
  
32	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Bayesian	
  Bandit	
  
§  Compute	
  distribu6ons	
  based	
  on	
  data	
  
§  Sample	
  p1	
  and	
  p2	
  from	
  these	
  distribu6ons	
  
§  Put	
  a	
  coin	
  in	
  bandit	
  1	
  if	
  p1	
  >	
  p2	
  
§  Else,	
  put	
  the	
  coin	
  in	
  bandit	
  2	
  
33	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
And	
  it	
  works!	
  
11000 100 200 300 400 500 600 700 800 900 1000
0.12
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
n
regret
ε-greedy, ε = 0.05
Bayesian Bandit with Gamma-Normal
34	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Video	
  Demo	
  
35	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
The	
  Code	
  
§  Select	
  an	
  alterna6ve	
  
§  Select	
  and	
  learn	
  
§  But	
  we	
  already	
  know	
  how	
  to	
  count!	
  
n = dim(k)[1]!
p0 = rep(0, length.out=n)!
for (i in 1:n) {!
p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1)!
}!
return (which(p0 == max(p0)))!
for (z in 1:steps) {!
i = select(k)!
j = test(i)!
k[i,j] = k[i,j]+1!
}!
return (k)!
36	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
The	
  Basic	
  Idea	
  
§  We	
  can	
  encode	
  a	
  distribu6on	
  by	
  sampling	
  
§  Sampling	
  allows	
  unifica6on	
  of	
  explora6on	
  and	
  exploita6on	
  
§  Can	
  be	
  extended	
  to	
  more	
  general	
  response	
  models	
  
§  Note	
  that	
  learning	
  here	
  =	
  coun6ng	
  =	
  on-­‐line	
  algorithm	
  
37	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Generalized	
  Banditry	
  
§  Suppose	
  we	
  have	
  an	
  infinite	
  number	
  of	
  bandits	
  
–  suppose	
  they	
  are	
  each	
  labeled	
  by	
  two	
  real	
  numbers	
  x	
  and	
  y	
  in	
  [0,1]	
  
–  also	
  that	
  expected	
  payoff	
  is	
  a	
  parameterized	
  func6on	
  of	
  x	
  and	
  y	
  
–  now	
  assume	
  a	
  distribu6on	
  for	
  θ	
  that	
  we	
  can	
  learn	
  online	
  
§  Selec6on	
  works	
  by	
  sampling	
  θ,	
  then	
  compu6ng	
  f	
  
§  Learning	
  works	
  by	
  propaga6ng	
  updates	
  back	
  to	
  θ	
  
–  If	
  f	
  is	
  linear,	
  this	
  is	
  very	
  easy	
  
§  Don’t	
  just	
  have	
  to	
  have	
  two	
  labels,	
  could	
  have	
  labels	
  and	
  context	
  
	
  
E z[ ]= f (x, y |θ)
38	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Caveats	
  
§  Original	
  Bayesian	
  Bandit	
  only	
  requires	
  real-­‐6me	
  
§  Generalized	
  Bandit	
  may	
  require	
  access	
  to	
  long	
  history	
  for	
  learning	
  
–  Pseudo	
  online	
  learning	
  may	
  be	
  easier	
  than	
  true	
  online	
  
§  Bandit	
  variables	
  can	
  include	
  content,	
  6me	
  of	
  day,	
  day	
  of	
  week	
  
§  Context	
  variables	
  can	
  include	
  user	
  id,	
  user	
  features	
  
§  Bandit	
  ×	
  context	
  variables	
  provide	
  the	
  real	
  power	
  
39	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
§  Contact:	
  
–  tdunning@maprtech.com	
  
–  @ted_dunning	
  
§  Slides	
  and	
  such	
  (just	
  don’t	
  believe	
  the	
  metrics):	
  
–  hEp://slideshare.net/tdunning	
  
§  Hash	
  tags:	
  #mapr	
  #storm	
  
	
  	
  
40	
  ©MapR	
  Technologies	
  -­‐	
  Confiden6al	
  
Thank	
  You	
  

More Related Content

What's hot

What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?Ted Dunning
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteTed Dunning
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to NewMapR Technologies
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoopTed Dunning
 
Strata new-york-2012
Strata new-york-2012Strata new-york-2012
Strata new-york-2012Ted Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesTed Dunning
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Carol McDonald
 
On-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsOn-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsDatabricks
 
Boston hug-2012-07
Boston hug-2012-07Boston hug-2012-07
Boston hug-2012-07Ted Dunning
 
What's new in Apache Mahout
What's new in Apache MahoutWhat's new in Apache Mahout
What's new in Apache MahoutTed Dunning
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Ted Dunning
 
Mahout and Recommendations
Mahout and RecommendationsMahout and Recommendations
Mahout and RecommendationsTed Dunning
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionTed Dunning
 
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Codemotion
 
Hadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsHadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsMapR Technologies
 

What's hot (19)

What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to New
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Strata new-york-2012
Strata new-york-2012Strata new-york-2012
Strata new-york-2012
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Cognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approachesCognitive computing with big data, high tech and low tech approaches
Cognitive computing with big data, high tech and low tech approaches
 
Big Data Paris
Big Data ParisBig Data Paris
Big Data Paris
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
 
On-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsOn-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy Models
 
Boston hug-2012-07
Boston hug-2012-07Boston hug-2012-07
Boston hug-2012-07
 
What's new in Apache Mahout
What's new in Apache MahoutWhat's new in Apache Mahout
What's new in Apache Mahout
 
GoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 SkinnedGoTo Amsterdam 2013 Skinned
GoTo Amsterdam 2013 Skinned
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
Mahout and Recommendations
Mahout and RecommendationsMahout and Recommendations
Mahout and Recommendations
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly Detection
 
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
 
Hadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsHadoop as a Platform for Genomics
Hadoop as a Platform for Genomics
 

Similar to Real-Time Learning with Hadoop's Lambda Architecture

Buzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time LearningBuzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time LearningMapR Technologies
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clusteringTed Dunning
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemGyula Fóra
 
Real-time and long-time together
Real-time and long-time togetherReal-time and long-time together
Real-time and long-time togetherTed Dunning
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time TogetherMapR Technologies
 
Cmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceCmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceTed Dunning
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012MapR Technologies
 
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceMapR Technologies
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016Mathieu Dumoulin
 
The Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a MissionThe Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a Missioninside-BigData.com
 
Slices Of Performance in Java - Oleksandr Bodnar
Slices Of Performance in Java - Oleksandr BodnarSlices Of Performance in Java - Oleksandr Bodnar
Slices Of Performance in Java - Oleksandr BodnarGlobalLogic Ukraine
 
Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O Sri Ambati
 
C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Compu...
C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Compu...C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Compu...
C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Compu...DataStax Academy
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning ClusteringMapR Technologies
 
Deploying Large Spark Models to production and model scoring in near real time
Deploying Large Spark Models to production and model scoring in near real timeDeploying Large Spark Models to production and model scoring in near real time
Deploying Large Spark Models to production and model scoring in near real timesubhojit banerjee
 
NetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaNetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaJosef Niedermeier
 
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...DataStax
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Flink Forward
 

Similar to Real-Time Learning with Hadoop's Lambda Architecture (20)

Buzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time LearningBuzz Words Dunning Real-Time Learning
Buzz Words Dunning Real-Time Learning
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clustering
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Real-time and long-time together
Real-time and long-time togetherReal-time and long-time together
Real-time and long-time together
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time Together
 
Cmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop PerformanceCmu Lecture on Hadoop Performance
Cmu Lecture on Hadoop Performance
 
New directions for mahout
New directions for mahoutNew directions for mahout
New directions for mahout
 
Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012Boston Hug by Ted Dunning 2012
Boston Hug by Ted Dunning 2012
 
CMU Lecture on Hadoop Performance
CMU Lecture on Hadoop PerformanceCMU Lecture on Hadoop Performance
CMU Lecture on Hadoop Performance
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016
 
The Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a MissionThe Sierra Supercomputer: Science and Technology on a Mission
The Sierra Supercomputer: Science and Technology on a Mission
 
Slices Of Performance in Java - Oleksandr Bodnar
Slices Of Performance in Java - Oleksandr BodnarSlices Of Performance in Java - Oleksandr Bodnar
Slices Of Performance in Java - Oleksandr Bodnar
 
Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O
 
C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Compu...
C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Compu...C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Compu...
C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Compu...
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning Clustering
 
Deploying Large Spark Models to production and model scoring in near real time
Deploying Large Spark Models to production and model scoring in near real timeDeploying Large Spark Models to production and model scoring in near real time
Deploying Large Spark Models to production and model scoring in near real time
 
NetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaNetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and Vertica
 
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...
 
London hug
London hugLondon hug
London hug
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
 

More from Ted Dunning

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxTed Dunning
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with KubernetesTed Dunning
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in KubernetesTed Dunning
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningTed Dunning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning LogisticsTed Dunning
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTed Dunning
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logisticsTed Dunning
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real DataTed Dunning
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data SecurelyTed Dunning
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownTed Dunning
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossibleTed Dunning
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningTed Dunning
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation TechnTed Dunning
 
Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Ted Dunning
 
My talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveMy talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveTed Dunning
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really MatterTed Dunning
 
Building multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search enginesBuilding multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search enginesTed Dunning
 

More from Ted Dunning (20)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 
Recommendation Techn
Recommendation TechnRecommendation Techn
Recommendation Techn
 
Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0Possible Visions for Mahout 1.0
Possible Visions for Mahout 1.0
 
My talk about recommendation and search to the Hive
My talk about recommendation and search to the HiveMy talk about recommendation and search to the Hive
My talk about recommendation and search to the Hive
 
Which Algorithms Really Matter
Which Algorithms Really MatterWhich Algorithms Really Matter
Which Algorithms Really Matter
 
Building multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search enginesBuilding multi-modal recommendation engines using search engines
Building multi-modal recommendation engines using search engines
 

Real-Time Learning with Hadoop's Lambda Architecture

  • 1. 1  ©MapR  Technologies  -­‐  Confiden6al   Real-­‐&me  Learning  with  Hadoop   The  “λ  +  ε”  architecture  
  • 2. 2  ©MapR  Technologies  -­‐  Confiden6al   §  Contact:   –  tdunning@maprtech.com   –  @ted_dunning   §  Slides  and  such  (available  late  tonight):   –  hEp://slideshare.net/tdunning   §  Hash  tags:  #mapr  #storm      
  • 3. 3  ©MapR  Technologies  -­‐  Confiden6al   The  Challenge   §  Hadoop  is  great  of  processing  vats  of  data   –  But  sucks  for  real-­‐6me  (by  design!)     §  Storm  is  great  for  real-­‐6me  processing   –  But  lacks  any  way  to  deal  with  batch  processing   §  It  sounds  like  there  isn’t  a  solu6on   –  Neither  fashionable  solu6on  handles  everything  
  • 4. 4  ©MapR  Technologies  -­‐  Confiden6al   This  is  not  a  problem.    It’s  an  opportunity!  
  • 5. 5  ©MapR  Technologies  -­‐  Confiden6al   t   now   Hadoop  is  Not  Very  Real-­‐&me   Unprocessed Data   Fully   processed   Latest  full   period   Hadoop  job   takes  this   long  for  this   data  
  • 6. 6  ©MapR  Technologies  -­‐  Confiden6al   t   now   Hadoop  works   great  back  here   Storm   works   here   Real-­‐&me  and  Long-­‐&me  together   Blended   view   Blended   view   Blended   View  
  • 7. 7  ©MapR  Technologies  -­‐  Confiden6al   An  Example  
  • 8. 8  ©MapR  Technologies  -­‐  Confiden6al   The  Same  Problem  
  • 9. 9  ©MapR  Technologies  -­‐  Confiden6al   What  Does  the  Lambda  Architecture  Do?   §  The  idea  is  that  we  want  to  compute  a  func6on  of  a  all  history  up   to  6me  n   §  In  order  to  get  real-­‐6me  response,  we  divide  this  into  two  parts   §  Where  the  addi6on  may  not  really  be  addi6on   §  The  idea  is  that  if  we  lose  the  history  from  m+1  un6l  n,  things  get   beEer  soon  enough   f x1...xn( ) f x1...xm...xn( )= f x1...xm( )+ f xm+1...xn( )
  • 10. 10  ©MapR  Technologies  -­‐  Confiden6al   Can  We  Do  BeLer?   §  Can  we  minimize  or  avoid  failure  transients?   §  Can  we  guarantee  precise  boundaries?   §  Can  we  synchronize  computa6ons  accurately?  
  • 11. 11  ©MapR  Technologies  -­‐  Confiden6al   Alterna&ve  without  Lambda   Search   Engine   NoSql   de  Jour   Consumer   Real-­‐6me   Long-­‐6me   ?  
  • 12. 12  ©MapR  Technologies  -­‐  Confiden6al   Problems   §  Simply  dumping  into  noSql  engine  doesn’t  quite  work   §  Insert  rate  is  limited   §  No  load  isola6on   –  Big  retrospec6ve  jobs  kill  real-­‐6me   §  Low  scan  performance   –  Hbase  preEy  good,  but  not  stellar   §  Difficult  to  set  boundaries   –  where  does  real-­‐6me  end  and  long-­‐6me  begin?  
  • 13. 13  ©MapR  Technologies  -­‐  Confiden6al   Almost  a  Solu&on   §  Lambda  architecture  talks  about  func6on  of  long-­‐6me  state   –  Real-­‐6me  approximate  accelerator  adjusts  previous  result  to  current  state   §  Sounds  good,  but  …   –  How  does  the  real-­‐6me  accelerator  combine  with  long-­‐6me?   –  What  algorithms  can  do  this?   –  How  can  we  avoid  gaps  and  overlaps  and  other  errors?   §  Needs  more  work   §  We  need  a  “λ  +  ε”  architecture  !  
  • 14. 14  ©MapR  Technologies  -­‐  Confiden6al   A  Simple  Example   §  Let’s  start  with  the  simplest  case  …  coun6ng   §  Coun6ng  =  addi6on   –  Addi6on  is  associa6ve   –  Addi6on  is  on-­‐line   –  We  can  generalize  these  results  to  all  associa6ve,  on-­‐line  func6ons   –  But  let’s  start  simple  
  • 15. 15  ©MapR  Technologies  -­‐  Confiden6al   Data   Sources   Catcher   Cluster   Rough  Design  –  Data  Flow   Catcher   Cluster   Query  Event   Spout   Logger   Bolt   Counter   Bolt   Raw   Logs   Logger   Bolt   Semi   Agg   Hadoop   Aggregator   Snap   Long   agg   ProtoSpout   Counter   Bolt   Logger   Bolt   Data   Sources  
  • 16. 16  ©MapR  Technologies  -­‐  Confiden6al   Closer  Look  –  Catcher  Protocol   Data   Sources   Catcher   Cluster   Catcher   Cluster   Data   Sources   The  data  sources  and  catchers   communicate  with  a  very  simple   protocol.     Hello()  =>  list  of  catchers   Log(topic,message)  =>            (OK|FAIL,  redirect-­‐to-­‐catcher)  
  • 17. 17  ©MapR  Technologies  -­‐  Confiden6al   Closer  Look  –  Catcher  Queues   Catcher   Cluster   Catcher   Cluster   The  catchers  forward  log  requests   to  the  correct  catcher  and  return   that  host  in  the  reply  to  allow  the   client  to  avoid  the  extra  hop.     Each  topic  file  is  appended  by   exactly  one  catcher.     Topic  files  are  kept  in  shared  file   storage.   Topic   File   Topic   File  
  • 18. 18  ©MapR  Technologies  -­‐  Confiden6al   Closer  Look  –  ProtoSpout   The  ProtoSpout  tails  the  topic  files,   parses  log  records  into  tuples  and   injects  them  into  the  Storm   topology.     Last  fully  acked  posi6on  stored  in   shared,  transac6onally  correct  file   system.   Topic   File   Topic   File   ProtoSpout  
  • 19. 19  ©MapR  Technologies  -­‐  Confiden6al   Closer  Look  –  Counter  Bolt   §  Cri6cal  design  goals:   –  fast  ack  for  all  tuples   –  fast  restart  of  counter   §  Ack  happens  when  tuple  hits  the  replay  log  (10’s  of  milliseconds,   group  commit)   §  Restart  involves  replaying  semi-­‐agg’s  +  replay  log  (very  fast)   §  Replay  log  only  lasts  un6l  next  semi-­‐aggregate  goes  out   Counter   Bolt   Replay   Log   Semi-­‐ aggregated   records   Incoming   records   Real-­‐6me   Long-­‐6me  
  • 20. 20  ©MapR  Technologies  -­‐  Confiden6al   A  Frozen  Moment  in  Time   §  Snapshot  defines  the  dividing  line   §  All  data  in  the  snap  is  long-­‐6me,  all   awer  is  real-­‐6me   §  Semi-­‐agg  strategy  allows  clean   combina6on  of  both  kinds  of  data   §  Data  synchronized  snap  not   needed  (if  the  snap  is  really  a  snap)   Semi   Agg   Hadoop   Aggregator   Snap   Long   agg  
  • 21. 21  ©MapR  Technologies  -­‐  Confiden6al   Guarantees   §  Counter  output  volume  is  small-­‐ish   –  the  greater  of  k  tuples  per  100K  inputs  or  k  tuple/s   –  1  tuple/s/label/bolt  for  this  exercise   §  Persistence  layer  must  provide  guarantees   –  distributed  against  node  failure   –  must  have  either  readable  flush  or  closed-­‐append   §  HDFS  is  distributed,  but  provides  no  guarantees  and  strange   seman6cs   §  MapRfs  is  distributed,  provides  all  necessary  guarantees  
  • 22. 22  ©MapR  Technologies  -­‐  Confiden6al   Presenta&on  Layer   §  Presenta6on  must   –  read  recent  output  of  Logger  bolt   –  read  relevant  output  of  Hadoop  jobs   –  combine  semi-­‐aggregated  records   §  User  will  see   –  counts  that  increment  within  0-­‐2  s  of  events   –  seamless  and  accurate  meld  of  short  and  long-­‐term  data  
  • 23. 23  ©MapR  Technologies  -­‐  Confiden6al   The  Basic  Idea   §  Online  algorithms  generally  have  rela6vely  small  state  (like   coun6ng)   §  Online  algorithms  generally  have  a  simple  update  (like  coun6ng)   §  If  we  can  do  this  with  coun6ng,  we  can  do  it  with  all  kinds  of   algorithms  
  • 24. 24  ©MapR  Technologies  -­‐  Confiden6al   Summary  –  Part  1   §  Semi-­‐agg  strategy  +  snapshots  allows  correct  real-­‐6me  counts   –  because  addi6on  is  on-­‐line  and  associa6ve   §  Other  on-­‐line  associa6ve  opera6ons  include:   –  k-­‐means  clustering  (see  Dan  Filimon’s  talk  at  16.)   –  count  dis6nct  (see  hyper-­‐log-­‐log  counters  from  streamlib  or  kmv  from   Brickhouse)   –  top-­‐k  values   –  top-­‐k  (count(*))  (see  streamlib)   –  contextual  Bayesian  bandits  (see  part  2  of  this  talk)  
  • 25. 25  ©MapR  Technologies  -­‐  Confiden6al   Example  2  –  AB  tes&ng  in  real-­‐&me   §  I  have  15  versions  of  my  landing  page   §  Each  visitor  is  assigned  to  a  version   –  Which  version?   §  A  conversion  or  sale  or  whatever  can  happen   –  How  long  to  wait?   §  Some  versions  of  the  landing  page  are  horrible   –  Don’t  want  to  give  them  traffic  
  • 26. 26  ©MapR  Technologies  -­‐  Confiden6al   A  Quick  Diversion   §  You  see  a  coin   –  What  is  the  probability  of  heads?   –  Could  it  be  larger  or  smaller  than  that?   §  I  flip  the  coin  and  while  it  is  in  the  air  ask  again   §  I  catch  the  coin  and  ask  again   §  I  look  at  the  coin  (and  you  don’t)  and  ask  again   §  Why  does  the  answer  change?   –  And  did  it  ever  have  a  single  value?  
  • 27. 27  ©MapR  Technologies  -­‐  Confiden6al   A  Philosophical  Conclusion   §  Probability  as  expressed  by  humans  is  subjec6ve  and  depends  on   informa6on  and  experience  
  • 28. 28  ©MapR  Technologies  -­‐  Confiden6al   A  Prac&cal  Applica&on  
  • 29. 29  ©MapR  Technologies  -­‐  Confiden6al   I  Dunno   0 0.2 0.4 0.6 0.8 1 p Prob(p)
  • 30. 30  ©MapR  Technologies  -­‐  Confiden6al   5  heads  out  of  10  throws   0 0.2 0.4 0.6 0.8 1 p Prob(p)
  • 31. 31  ©MapR  Technologies  -­‐  Confiden6al   2  heads  out  of  12  throws   0 0.2 0.4 0.6 0.8 1 p Prob(p) Mean   Using  any  single  number  as  a  “best”   es6mate  denies  the  uncertain  nature  of   a  distribu6on   Adding  confidence  bounds  s6ll  loses  most  of   the  informa6on  in  the  distribu6on  and   prevents  good  modeling  of  the  tails  
  • 32. 32  ©MapR  Technologies  -­‐  Confiden6al   Bayesian  Bandit   §  Compute  distribu6ons  based  on  data   §  Sample  p1  and  p2  from  these  distribu6ons   §  Put  a  coin  in  bandit  1  if  p1  >  p2   §  Else,  put  the  coin  in  bandit  2  
  • 33. 33  ©MapR  Technologies  -­‐  Confiden6al   And  it  works!   11000 100 200 300 400 500 600 700 800 900 1000 0.12 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 n regret ε-greedy, ε = 0.05 Bayesian Bandit with Gamma-Normal
  • 34. 34  ©MapR  Technologies  -­‐  Confiden6al   Video  Demo  
  • 35. 35  ©MapR  Technologies  -­‐  Confiden6al   The  Code   §  Select  an  alterna6ve   §  Select  and  learn   §  But  we  already  know  how  to  count!   n = dim(k)[1]! p0 = rep(0, length.out=n)! for (i in 1:n) {! p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1)! }! return (which(p0 == max(p0)))! for (z in 1:steps) {! i = select(k)! j = test(i)! k[i,j] = k[i,j]+1! }! return (k)!
  • 36. 36  ©MapR  Technologies  -­‐  Confiden6al   The  Basic  Idea   §  We  can  encode  a  distribu6on  by  sampling   §  Sampling  allows  unifica6on  of  explora6on  and  exploita6on   §  Can  be  extended  to  more  general  response  models   §  Note  that  learning  here  =  coun6ng  =  on-­‐line  algorithm  
  • 37. 37  ©MapR  Technologies  -­‐  Confiden6al   Generalized  Banditry   §  Suppose  we  have  an  infinite  number  of  bandits   –  suppose  they  are  each  labeled  by  two  real  numbers  x  and  y  in  [0,1]   –  also  that  expected  payoff  is  a  parameterized  func6on  of  x  and  y   –  now  assume  a  distribu6on  for  θ  that  we  can  learn  online   §  Selec6on  works  by  sampling  θ,  then  compu6ng  f   §  Learning  works  by  propaga6ng  updates  back  to  θ   –  If  f  is  linear,  this  is  very  easy   §  Don’t  just  have  to  have  two  labels,  could  have  labels  and  context     E z[ ]= f (x, y |θ)
  • 38. 38  ©MapR  Technologies  -­‐  Confiden6al   Caveats   §  Original  Bayesian  Bandit  only  requires  real-­‐6me   §  Generalized  Bandit  may  require  access  to  long  history  for  learning   –  Pseudo  online  learning  may  be  easier  than  true  online   §  Bandit  variables  can  include  content,  6me  of  day,  day  of  week   §  Context  variables  can  include  user  id,  user  features   §  Bandit  ×  context  variables  provide  the  real  power  
  • 39. 39  ©MapR  Technologies  -­‐  Confiden6al   §  Contact:   –  tdunning@maprtech.com   –  @ted_dunning   §  Slides  and  such  (just  don’t  believe  the  metrics):   –  hEp://slideshare.net/tdunning   §  Hash  tags:  #mapr  #storm      
  • 40. 40  ©MapR  Technologies  -­‐  Confiden6al   Thank  You