Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

ENAR short course

Nächste SlideShare
Enar short course
Enar short course
Wird geladen in …3

Hier ansehen

1 von 309 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)


Ähnlich wie ENAR short course (20)

Aktuellste (20)


ENAR short course

  1. 1. Sta$s$cal  Compu$ng     For  Big  Data   Deepak  Agarwal   LinkedIn  Applied  Relevance  Science   dagarwal@linkedin.com   ENAR  2014,  Bal$more,  USA  
  2. 2. Main  Collaborators:  several  others  at  both  Y!   and  LinkedIn   •  I  won’t  be  here  without  them,  extremely  lucky  to  work  with  such  talented   individuals   Bee-Chung Chen Liang Zhang Bo Long Jonathan Traupman Paul Ogilvie
  3. 3. Structure  of  This  Tutorial   •  Part  I:  Introduc$on  to  Map-­‐Reduce  and  the   Hadoop  System   –  Overview  of  Distributed  Compu$ng   –  Introduc$on  to  Map-­‐Reduce   –  Some  sta$s$cal  computa$ons  using  Map-­‐Reduce   •  Bootstrap,  Logis$c  Regression   •  Part  II:  Recommender  Systems  for  Web   Applica$ons   –  Introduc$on   –  Content  Recommenda$on   –  Online  Adver$sing  
  4. 4. Big  Data  becoming  Ubiquitous   •  Bioinforma$cs   •  Astronomy   •  Internet   •  Telecommunica$ons   •  Climatology   •  …    
  5. 5. Big  Data:  Some  size  es$mates   •  1000  human  genomes:  >  100TB  of  data  (1000   genomes  project)   •  Sloan  Digital  Sky  Survey:  200GB  data  per  night   (>140TB  aggregated)   •  Facebook:  A  billion  monthly  ac$ve  users   •  LinkedIn:      roughly  >  280M  members  worldwide   •  Twiaer:  >  500  million  tweets  a  day   •  Over  6  billion  mobile  phones  in  the  world   genera$ng  data  everyday  
  6. 6. Big  Data:  Paradigm  shid   •  Classical  Sta$s$cs   –  Generalize  using  small  data     •  Paradigm  Shid  with  Big  Data   –  We  now  have  an  almost  infinite  supply  of  data   –  Easy  Sta$s$cs  ?  Just  appeal  to  asympto$c  theory?   •  So  the  issue  is  mostly  computa$onal?   –  Not  quite   •  More  data  comes  with  more  heterogeneity   •  Need  to  change  our  sta$s$cal  thinking  to  adapt   –  Classical  sta$s$cs  s$ll  invaluable  to  think  about  big  data  analy$cs    
  7. 7. Some  Sta$s$cal  Challenges   •  Exploratory  Analysis  (EDA),  Visualiza$on   – Retrospec$ve  (on  Terabytes)   – More  Real  Time    (streaming  computa$ons  every   few  minutes/hours)   •  Sta$s$cal  Modeling   – Scale  (computa$onal  challenge)   – Curse  of  dimensionality     •  Millions  of  predictors,  heterogeneity   – Temporal  and  Spa$al  correla$ons  
  8. 8. Sta$s$cal  Challenges  con$nued   •  Experiments   – To  test  new  methods,  test  hypothesis  from   randomized  experiments   – Adap$ve  experiments     •  Forecas$ng     – Planning,  adver$sing   •  Many  more  I  are  not  fully  well  versed  in        
  9. 9. Defining  Big  Data     •  How  to  know  you  have  the  big  data  problem?   – Is  it  only  the  number  of  terabytes  ?   – What  about  dimensionality,  structured/ unstructured,  computa$ons  required,…   •  No  clear  defini$on,  different  point  of  views   – When  desired  computa$on  cannot  be  completed   in  the  s$pulated  $me  with  current  best  algorithm   using  cores  available  on  a  commodity  PC      
  10. 10.  Distributed  Compu$ng  for  Big  Data       •  Distributed  compu$ng  invaluable  tool  to  scale   computa$ons  for  big  data   •  Some  distributed  compu$ng  models   – Mul$-­‐threading   – Graphics  Processing  Units  (GPU)   – Message  Passing  Interface  (MPI)   – Map-­‐Reduce  
  11. 11. Evalua$ng  a  method  for  a  problem   •  Scalability   –  Process  X  GB  in  Y  hours   •  Ease  of  use  for  a  sta$s$cian     •  Reliability  (fault  tolerance)   –  Especially  in  an  industrial  environment   •  Cost   –  Hardware  and  cost  of  maintaining   •  Good  for  the  computa$ons  required?   –  E.g.,  Itera$ve  versus  one  pass   •  Resource  sharing  
  12. 12. Mul$threading   •  Mul$ple  threads  take  advantage  of  mul$ple   CPUs   •  Shared  memory   •  Threads  can  execute  independently  and   concurrently   •  Can  only  handle  Gigabytes  of  data   •  Reliable  
  13. 13. Graphics  Processing  Units  (GPU)   •  Number  of  cores:   –  CPU:  Order  of  10   –  GPU:  smaller  cores   •  Order  of  1000   •  Can  be  >100x  faster  than  CPU   –  Parallel  computa$onally  intensive  tasks  off-­‐loaded  to  GPU   •  Good  for  certain  computa$onally-­‐intensive  tasks   •  Can  only  handle  Gigabytes  of  data   •  Not  trivial  to  use,  requires  good  understanding  of  low-­‐level  architecture   for  efficient  use   –  But  things  changing,  it  is  gemng  more  user  friendly  
  14. 14. Message  Passing  Interface  (MPI)   •  Language  independent  communica$on   protocol  among  processes  (e.g.  computers)   •  Most  suitable  for  master/slave  model   •  Can  handle  Terabytes  of  data   •  Good  for  itera$ve  processing   •  Fault  tolerance  is  low  
  15. 15. Map-­‐Reduce  (Dean  &  Ghemawat,   2004)   Mappers   Reducers   Data   Output   •  Computa$on  split  to  Map   (scaaer)  and  Reduce  (gather)   stages   •  Easy  to  Use:     –  User  needs  to  implement  two   func$ons:  Mapper  and   Reducer   •  Easily  handles  Terabytes  of   data   •  Very  good  fault  tolerance   (failed  tasks  automa$cally   get  restarted)  
  16. 16. Comparison  of  Distributed  Compu$ng  Methods   Mul$threading   GPU   MPI   Map-­‐Reduce   Scalability  (data   size)   Gigabytes   Gigabytes   Terabytes   Terabytes   Fault  Tolerance   High   High   Low   High   Maintenance  Cost   Low   Medium   Medium   Medium-­‐High   Itera$ve  Process   Complexity   Cheap   Cheap   Cheap   Usually   expensive   Resource  Sharing   Hard   Hard   Easy   Easy   Easy  to  Implement?   Easy   Needs   understanding   of  low-­‐level  GPU   architecture   Easy   Easy  
  17. 17. Example  Problem   •  Tabula$ng  word  counts  in  corpus  of   documents   •  Similar  to  table  func$on  in  R  
  18. 18. Word  Count  Through  Map-­‐Reduce   Hello  World   Bye  World     Hello  Hadoop   Goodbye  Hadoop   Mapper  1   <Hello,  1>   <Hadoop,  1>   <Goodbye,  1>   <Hadoop,1>   <Hello,  1>   <World,  1>   <Bye,  1>   <World,1>   Mapper  2   Reducer  1   Words  from  A-­‐G   Reducer  2   Words  from  H-­‐Z   <Bye,  1>   <Goodbye,  1>   <Hello,  2>   <World,  2>   <Hadoop,  2>  
  19. 19. Key  Ideas  about  Map-­‐Reduce   Big  Data   Par$$on  1   Par$$on  2   …   Par$$on  N   Mapper  1   Mapper  2   …   Mapper  N   <Key,  Value>   <Key,  Value>  <Key,  Value>  <Key,  Value>   Reducer  1   Reducer  2   Reducer  M  …   Output  1   Output  1  Output  1  Output  1  
  20. 20. Key  Ideas  about  Map-­‐Reduce   •  Data  are  split  into  par$$ons  and  stored  in  many   different  machines  on  disk  (distributed  storage)   •  Mappers  process  data  chunks  independently    and   emit  <Key,  Value>  pairs   •  Data  with  the  same  key  are  sent  to  the  same   reducer.  One  reducer  can  receive  mul$ple  keys   •  Every  reducer  sorts  its  data  by  key   •  For  each  key,  the  reducer  processes  the  values   corresponding  to  the  key  according  to  the   customized  reducer  func$on  and  output  
  21. 21. Compute  Mean  for  Each  Group   ID   Group  No.   Score   1   1   0.5   2   3   1.0   3   1   0.8   4   2   0.7   5   2   1.5   6   3   1.2   7   1   0.8   8   2   0.9   9   4   1.3   …   …   …  
  22. 22. Key  Ideas  about  Map-­‐Reduce   •  Data  are  split  into  par$$ons  and  stored  in  many  different  machines  on   disk  (distributed  storage)   •  Mappers  process  data  chunks  independently    and  emit  <Key,  Value>  pairs   –  For  each  row:   •  Key  =  Group  No.   •  Value  =  Score   •  Data  with  the  same  key  are  sent  to  the  same  reducer.  One  reducer  can   receive  mul$ple  keys   –  E.g.  2  reducers   –  Reducer  1  receives  data  with  key  =  1,  2   –  Reducer  2  receives  data  with  key  =  3,  4   •  Every  reducer  sorts  its  data  by  key   –  E.g.  Reducer  1:  <key  =  1,  values=[0.5,  0.8,  0.8]>,  <key=2,  values=<0.7,  1.5,  0.9>   •  For  each  key,  the  reducer  processes  the  values  corresponding  to  the  key   according  to  the  customized  reducer  func$on  and  output   –  E.g.  Reducer  1  output:  <1,  mean(0.5,  0.8,  0.8)>,  <2,  mean(0.7,  1.5,  0.9)>  
  23. 23. Key  Ideas  about  Map-­‐Reduce   •  Data  are  split  into  par$$ons  and  stored  in  many  different  machines  on   disk  (distributed  storage)   •  Mappers  process  data  chunks  independently    and  emit  <Key,  Value>  pairs   –  For  each  row:   •  Key  =  Group  No.   •  Value  =  Score   •  Data  with  the  same  key  are  sent  to  the  same  reducer.  One  reducer  can   receive  mul$ple  keys   –  E.g.  2  reducers   –  Reducer  1  receives  data  with  key  =  1,  2   –  Reducer  2  receives  data  with  key  =  3,  4   •  Every  reducer  sorts  its  data  by  key   –  E.g.  Reducer  1:  <key  =  1,  values=[0.5,  0.8,  0.8]>,  <key=2,  values=<0.7,  1.5,  0.9>   •  For  each  key,  the  reducer  processes  the  values  corresponding  to  the  key   according  to  the  customized  reducer  func$on  and  output   –  E.g.  Reducer  1  output:  <1,  mean(0.5,  0.8,  0.8)>,  <2,  mean(0.7,  1.5,  0.9)>   What  you  need   to  implement  
  24. 24. Mapper:   Input:  Data   for  (row  in  Data)   {    groupNo  =  row$groupNo;    score  =  row$score;    Output(c(groupNo,  score));   }   Reducer:   Input:  Key  (groupNo),  List  Value  (a  list  of  scores  that  belong  to  the  Key)   count  =  0;   sum  =  0;   for  (v  in  Value)   {    sum  +=  v;    count++;   }   Output(c(Key,  sum/count));   Pseudo  Code  (in  R)  
  25. 25. Exercise  1   •  Problem:  Average  height  per  {Grade,  Gender}?   •  What  should  be  the  mapper  output  key?   •  What  should  be  the  mapper  output  value?   •  What  are  the  reducer  input?   •  What  are  the  reducer  output?   •  Write  mapper  and  reducer  for  this?   Student  ID   Grade   Gender   Height  (cm)   1   3   M   120   2   2   F   115   3   2   M   116   …   …   …  
  26. 26. •  Problem:  Average  height  per  Grade  and  Gender?   •  What  should  be  the  mapper  output  key?   –  {Grade,  Gender}   •  What  should  be  the  mapper  output  value?   –  Height   •  What  are  the  reducer  input?   –  Key:  {Grade,  Gender},  Value:  List  of  Heights     •  What  are  the  reducer  output?   –  {Grade,  Gender,  mean(Heights)}   Student  ID   Grade   Gender   Height  (cm)   1   3   M   120   2   2   F   115   3   2   M   116   …   …   …  
  27. 27. Exercise  2   •  Problem:  Number  of  students  per  {Grade,  Gender}?   •  What  should  be  the  mapper  output  key?   •  What  should  be  the  mapper  output  value?   •  What  are  the  reducer  input?   •  What  are  the  reducer  output?   •  Write  mapper  and  reducer  for  this?   Student  ID   Grade   Gender   Height  (cm)   1   3   M   120   2   2   F   115   3   2   M   116   …   …   …  
  28. 28. •  Problem:  Number  of  students  per  {Grade,  Gender}?   •  What  should  be  the  mapper  output  key?   –  {Grade,  Gender}   •  What  should  be  the  mapper  output  value?   –  1   •  What  are  the  reducer  input?   –  Key:  {Grade,  Gender},  Value:  List  of  1’s   •  What  are  the  reducer  output?   –  {Grade,  Gender,  sum(value  list)}   –  OR:  {Grade,  Gender,  length(value  list)}   Student  ID   Grade   Gender   Height  (cm)   1   3   M   120   2   2   F   115   3   2   M   116   …   …   …  
  29. 29. More  on  Map-­‐Reduce   •  Depends  on  distributed  file  systems   •  Typically  mappers  are  the  data  storage  nodes   •  Map/Reduce  tasks  automa$cally  get  restarted   when  they  fail  (good  fault  tolerance)   •  Map  and  Reduce  I/O  are  all  on  disk   –  Data  transmission  from  mappers  to  reducers  are   through  disk  copy   •  Itera$ve  process  through  Map-­‐Reduce   –  Each  itera$on  becomes  a  map-­‐reduce  job   –  Can  be  expensive  since  map-­‐reduce  overhead  is  high  
  30. 30. The  Apache  Hadoop  System   •  An  open-­‐source  sodware  for  reliable,  scalable,   distributed  compu$ng     •  The  most  popular  distributed  compu$ng   system  in  the  world   •  Key  modules:   – Hadoop  Distributed  File  System  (HDFS)   – Hadoop  YARN  (job  scheduling  and  cluster   resource  management)   – Hadoop  MapReduce  
  31. 31. Major  Tools  on  Hadoop   •  Pig   –  A  high-­‐level  language  for  Map-­‐Reduce  computa$on   •  Hive   –  A  SQL-­‐like  query  language  for  data  querying  via  Map-­‐Reduce   •  Hbase   –  A  distributed  &  scalable  database  on  Hadoop   –  Allows  random,  real  $me  read/write  access  to  big  data   –  Voldemort  is  similar  to  Hbase   •  Mahout   –  A  scalable  machine  learning  library   •  …  
  32. 32. Hadoop  Installa$on   •  Semng  up  Hadoop  on  your  desktop/laptop:   – hap://hadoop.apache.org/docs/stable/ single_node_setup.html   •  Semng  up  Hadoop  on  a  cluster  of  machines   – hap://hadoop.apache.org/docs/stable/ cluster_setup.html  
  33. 33. Hadoop  Distributed  File  System  (HDFS)   •  Master/Slave  architecture   •  NameNode:  a  single  master  node  that  controls  which   data  block  is  stored  where.     •  DataNodes:  slave  nodes  that  store  data  and  do  R/W   opera$ons   •  Clients  (Gateway):  Allow  users  to  login  and  interact   with  HDFS  and  submit  Map-­‐Reduce  jobs   •  Big  data  is  split  to  equal-­‐sized  blocks,  each  block  can  be   stored  in  different  DataNodes   •  Disk  failure  tolerance:  data  is  replicated  mul$ple  $mes  
  34. 34. Load  the  Data  into  Pig   •  A  =  LOAD  ‘Sample-­‐1.dat'  USING  PigStorage()  AS   (ID  :  int,  groupNo:  int,  score:  float);     –  The  path  of  the  data  on  HDFS  ader  LOAD     •  USING  PigStorage()  means  delimit  the  data  by  tab   (can  be  omiaed)   •  If  data  are  delimited  by  other  characters,  e.g.   space,  use  USING  PigStorage(‘  ‘)     •  Data  schema  defined  ader  AS     •  Variable  types:  int,  long,  float,  double,  chararray,   …  
  35. 35. Structure  of  This  Tutorial   •  Part  I:  Introduc$on  to  Map-­‐Reduce  and  the   Hadoop  System   – Overview  of  Distributed  Compu$ng   – Introduc$on  to  Map-­‐Reduce   – Introduc$on  to  the  Hadoop  System   –   Examples  of  Sta$s$cal  Compu$ng  for  Big  Data   •  Bag  of  Liale  Bootstraps   •  Large  Scale  Logis$c  Regression  
  36. 36. Bag  of  Liale  Bootstraps   Kleiner  et  al.  2012  
  37. 37. Bootstrap  (Efron,  1979)   •  A  re-­‐sampling  based  method  to  obtain  sta$s$cal   distribu$on  of  sample  es$mators   •  Why  are  we  interested  ?   –  Re-­‐sampling  is  embarrassingly  parallelizable     •  For  example:   –  Standard  devia$on  of  the  mean  of  N  samples  (μ)   –  For  i  =  1  to  r  do     •  Randomly  sample  with  replacement  N  $mes  from  the  original   sample  -­‐>  bootstrap  data  i   •  Compute  mean  of  the  i-­‐th  bootstrap  data  -­‐>  μi   –  Es$mate  of  Sd(μ)  =  Sd([μ1,…μr])   –  r  is  usually  a  large  number,  e.g.  200  
  38. 38. Bootstrap  for  Big  Data   •  Can  have  r  nodes  running  in  parallel,  each   sampling  one  bootstrap  data   •  However…   – N  can  be  very  large   – Data  may  not  fit  into  memory   – Collec$ng  N  samples  with  replacement  on  each   node  can  be  computa$onally  expensive  
  39. 39. M  out  of  N  Bootstrap     (Bikel  et  al.  1997)   •  Obtain  SdM(μ)  by  sampling  M  samples  with   replacement  for  each  bootstrap,  where  M<N   •  Apply  analy$cal  correc$on  to  SdM(μ)  to  obtain   Sd(μ)  using  prior  knowledge  of  convergence  rate   of  sample  es$mates   •  However…   –  Prior  knowledge  is  required   –  Choice  of  M  is  cri$cal  to  performance   –  Finding  op$mal  value  of  M  needs  more    computa$on  
  40. 40. Bag  of  Liale  Bootstraps  (BLB)   •  Example:  Standard  devia$on  of  the  mean     •  Generate  S  sampled  data  sets,  each  obtained  by  random   sampling  without  replacement  a  subset  of  size  b  (or   par$$on  the  original  data  into  S  par$$ons,  each  with  size   b)   •  For  each  data  p  =  1  to  S  do   –  For  i  =  1  to  r  do     •  N  samples  with  replacement  on  data  of  size  b   •  Compute  mean  of  the  resampled  data  μpi   –  Compute  Sdp(μ)  =  Sd([μp1,…μpr])   •  Es$mate  of  Sd(μ)  =  Avg([Sd1(μ),…,  SdS(μ)])  
  41. 41. Bag  of  Liale  Bootstraps  (BLB)   •  Interest:  ξ(θ),  where  θ  is  an  es$mate  obtained  from   size  N  data   –   ξ  is  some  func$on  of  θ,  such  as  standard  devia$on,  …   •  Generate  S  sampled  data  sets,  each  obtained  from  random   sampling  without  replacement  a  subset  of  size  b  (or  par$$on   the  original  data  into  S  par$$ons,  each  with  size  b)   •  For  each  data  p  =  1  to  S  do   –  For  i  =  1  to  r  do     •  Sample  N  samples  with  replacement  on  data  of  size  b   •  Compute  mean  of  the  resampled  data  θpi   –  Compute  ξp(θ)  =  ξ([θp1,…θpr])   •  Es$mate  of  ξ(μ)  =  Avg([ξ1(θ),…,  ξS(θ)])  
  42. 42. Bag  of  Liale  Bootstraps  (BLB)   •  Interest:  ξ(θ),  where  θ  is  an  es$mate  obtained  from   size  N  data   –   ξ  is  some  func$on  of  θ,  such  as  standard  devia$on,  …   •  Generate  S  sampled  data  sets,  each  obtained  from  random   sampling  without  replacement  a  subset  of  size  b  (or  par$$on   the  original  data  into  S  par$$ons,  each  with  size  b)   •  For  each  data  p  =  1  to  S  do   –  For  i  =  1  to  r  do     •  Sample  N  samples  with  replacement  on  the  data  of  size  b   •  Compute  mean  of  the  resampled  data  θpi   –  Compute  ξp(θ)  =  ξ([θp1,…θpr])   •  Es$mate  of  ξ(μ)  =  Avg([ξ1(θ),…,  ξS(θ)])   Mapper  Reducer   Gateway  
  43. 43. Why  is  BLB  Efficient   •  Before:   – N  samples  with  replacement  from  size  N  data  is   expensive  when  N  is  large   •  Now:   – N  samples  with  replacement  from  size  b  data   – b  can  be  several  magnitude  smaller  than  N  (e.g.  b   =  Nγ,  γ  in  [0.5,  1))   – Equivalent  to:  A  mul$nomial  sampler  with  dim  =  b   – Storage  =  O(b),  Computa$onal  complexity  =  O(b)  
  44. 44. Simula$on  Experiment   •  95%  CI  of  Logis$c  Regression  Coefficients   •  N  =  20000,  10  explanatory  variables   •  Rela$ve  Error  =  |Es$mated  CI  width  –  True  CI   width  |  /  True  CI  width   •  BLB-­‐γ:  BLB  with  b  =  Nγ   •  BOFN-­‐γ:  b  out  of  N  sampling  with  b  =  Nγ   •  BOOT:  Naïve  bootstrap  
  45. 45. Simula$on  Experiment  
  46. 46. Real  Data   •  95%  CI  of  Logis$c  Regression  Coefficients   •  N  =  6M,  3000  explanatory  variables   •  Data  size  =  150GB,  r  =  50,  s  =  5,  γ  =  0.7  
  47. 47. Summary  of  BLB   •  A  new  algorithm  for  bootstrapping  on  big  data   •  Advantages   – Fast  and  efficient   – Easy  to  parallelize   – Easy  to  understand  and  implement   – Friendly  to  Hadoop,  makes  it  rou$ne  to  perform   sta$s$cal  calcula$ons  on  Big  data  
  48. 48. Large  Scale  Logis$c  Regression  
  49. 49. Logis$c  Regression   •  Binary  response:  Y   •  Covariates:  X   •  Yi  ~  Bernoulli(pi)   •  log(pi/(1-­‐pi))  =  Xi Tβ  ;    β  ~  MVN(0  ,  1/λ  I  )   •  Widely  used  (research  and  applica$ons)  
  50. 50. Large  Scale  Logis$c  Regression   •  Binary  response:  Y   –  E.g.,  Click  /  Non-­‐click  on  an  ad  on  a  webpage   •  Covariates:  X   –  User  covariates:     •  Age,  gender,  industry,  educa$on,  job,  job  $tle,  …   –  Item  covariates:   •  Categories,  keywords,  topics,  …   –  Context  covariates:   •  Time,  page  type,  posi$on,  …   –  2-­‐way  interac$on:   •  User  covariates  X  item  covariates   •  Context  covariates  X  item  covariates   •  …  
  51. 51. Computa$onal  Challenge   •  Hundreds  of  millions/billions  of  observa$ons     •  Hundreds  of  thousands/millions  of  covariates   •  Fimng  such  a  logis$c  regression  model  on  a   single  machine  not  feasible   •  Model  fimng  itera$ve  using  methods  like   gradient  descent,  Newton’s  method  etc   – Mul$ple  passes  over  the  data  
  52. 52. Recap  on  Op$miza$on  method   •  Problem:  Find  x  to  min(F(x))   •  Itera$on  n:  xn  =  xn-­‐1  –  bn-­‐1  F’(xn-­‐1)   •   bn-­‐1  is  the  step  size  that  can  change  every   itera$on   •  Iterate  un$l  convergence     •  Conjugate  gradient,  LBFGS,  Newton  trust   region,  …  all  of  this  kind  
  53. 53. Itera$ve  Process  with  Hadoop     Disk   Mappers   Disk   Reducers   Disk  Mappers  Disk  Reducers   Disk   Mappers   Disk   Reducers  
  54. 54. Limita$ons  of  Hadoop  for  fimng  a  big   logis$c  regression   •  Itera$ve  process  is  expensive  and  slow   •  Every  itera$on  =  a  Map-­‐Reduce  job   •  I/O  of  mapper  and  reducers  are  both  through   disk   •  Plus:  Wai$ng  in  queue  $me     •  Q:  Can  we  find  a  fimng  method  that  scales   with  Hadoop  ?  
  55. 55. Large  Scale  Logis$c  Regression   •  Naïve:     –  Par$$on  the  data  and  run  logis$c  regression  for  each  par$$on   –  Take  the  mean  of  the  learned  coefficients   –  Problem:  Not  guaranteed  to  converge  to  the  model  from  single   machine!   •  Alterna$ng  Direc$on  Method  of  Mul$pliers  (ADMM)   –  Boyd  et  al.  2011   –  Set  up  constraints:  each  par$$on’s  coefficient  =  global   consensus   –  Solve  the  op$miza$on  problem  using  Lagrange  Mul$pliers   –  Advantage:  guaranteed  to  converge  to  a  single  machine  logis$c   regression  on  the  en$re  data  with  reasonable  number  of   itera$ons    
  56. 56. Large  Scale  Logis$c  Regression  via  ADMM   BIG  DATA   Par$$on  1   Par$$on  2   Par$$on  3   Par$$on  K   Logis$c   Regression   Logis$c   Regression   Logis$c   Regression   Logis$c   Regression   Consensus   Computa$on   Iteration 1
  57. 57. Large  Scale  Logis$c  Regression  via  ADMM   BIG  DATA   Par$$on  1   Par$$on  2   Par$$on  3   Par$$on  K   Logis$c   Regression   Consensus   Computa$on   Logis$c   Regression   Logis$c   Regression   Logis$c   Regression   Iteration 1
  58. 58. Large  Scale  Logis$c  Regression  via  ADMM   BIG  DATA   Par$$on  1   Par$$on  2   Par$$on  3   Par$$on  K   Logis$c   Regression   Logis$c   Regression   Logis$c   Regression   Logis$c   Regression   Consensus   Computa$on   Iteration 2
  59. 59. Details  of  ADMM  
  60. 60. Dual  Ascent  Method   •  Consider  a  convex  op$miza$on  problem     •  Lagrangian  for  the  problem:     •  Dual  Ascent:   round and motivation. Dual Ascent der the equality-constrained convex optimization problem minimize f(x) subject to Ax = b, (2. ariable x ∈ Rn , where A ∈ Rm×n and f : Rn → R is convex. e Lagrangian for problem (2.1) is L(x,y) = f(x) + yT (Ax − b) he dual function is g(y) = inf x L(x,y) = −f∗ (−AT y) − bT y, y is the dual variable or Lagrange multiplier, and f∗ is the conv round and motivation. Dual Ascent der the equality-constrained convex optimization problem minimize f(x) subject to Ax = b, (2. variable x ∈ Rn , where A ∈ Rm×n and f : Rn → R is convex. he Lagrangian for problem (2.1) is L(x,y) = f(x) + yT (Ax − b) he dual function is g(y) = inf x L(x,y) = −f∗ (−AT y) − bT y, y is the dual variable or Lagrange multiplier, and f∗ is the conv gate of f; see [20, §3.3] or [140, §12] for background. The du rimal optimal point x from a dual optimal point y as x = argmin x L(x,y ), vided there is only one minimizer of L(x,y ). (This is the case e.g., f is strictly convex.) In the sequel, we will use the notation minx F(x) to denote any minimizer of F, even when F does not e a unique minimizer. In the dual ascent method, we solve the dual problem using gradient ent. Assuming that g is differentiable, the gradient g(y) can be uated as follows. We first find x+ = argminx L(x,y); then we have (y) = Ax+ − b, which is the residual for the equality constraint. The l ascent method consists of iterating the updates xk+1 := argmin x L(x,yk ) (2.2) yk+1 := yk + αk (Axk+1 − b), (2.3) ere αk > 0 is a step size, and the superscript is the iteration counter.
  61. 61. Augmented  Lagrangians   •  Bring  robustness  to  the  dual  ascent  method   •  Yield  convergence  without  assump$ons  like  strict   convexity  or  finiteness  of  f   •      •  The  value  of  ρ  influences  the  convergence  rate   aph-structured optimization problems. Augmented Lagrangians and the Method of Multiplie mented Lagrangian methods were developed in part to br tness to the dual ascent method, and in particular, to yield c nce without assumptions like strict convexity or finiteness of augmented Lagrangian for (2.1) is Lρ(x,y) = f(x) + yT (Ax − b) + (ρ/2) Ax − b 2 2, (22.3 Augmented Lagrangians and the Method where ρ > 0 is called the penalty parameter. (Note standard Lagrangian for the problem.) The augmen can be viewed as the (unaugmented) Lagrangian asso problem 2
  62. 62. Alterna$ng  Direc$on  Method  of   Mul$pliers  (ADMM)   •  Problem:     •  Augmented  Lagrangians     •  ADMM:   MM is an algorithm that is intended to blend the decompos dual ascent with the superior convergence properties of the m multipliers. The algorithm solves problems in the form minimize f(x) + g(z) subject to Ax + Bz = c h variables x ∈ Rn and z ∈ Rm , where A ∈ Rp×n , B ∈ Rp×m Rp . We will assume that f and g are convex; more specific as ns will be discussed in §3.2. The only difference from the g ear equality-constrained problem (2.1) is that the variable, ca re, has been split into two parts, called x and z here, with the e function separable across this splitting. The optimal value blem (3.1) will be denoted by p = inf{f(x) + g(z) | Ax + Bz = c}. with variables x ∈ Rn and z ∈ Rm , where A ∈ Rp×n , B ∈ Rp×m , and c ∈ Rp . We will assume that f and g are convex; more specific assump- tions will be discussed in §3.2. The only difference from the general linear equality-constrained problem (2.1) is that the variable, called x there, has been split into two parts, called x and z here, with the objec- tive function separable across this splitting. The optimal value of the problem (3.1) will be denoted by p = inf{f(x) + g(z) | Ax + Bz = c}. As in the method of multipliers, we form the augmented Lagrangian Lρ(x,z,y) = f(x) + g(z) + yT (Ax + Bz − c) + (ρ/2) Ax + Bz − c 2 2. 13 14 Alternating Direction Method of Multipliers ADMM consists of the iterations xk+1 := argmin x Lρ(x,zk ,yk ) (3.2) zk+1 := argmin z Lρ(xk+1 ,z,yk ) (3.3) yk+1 := yk + ρ(Axk+1 + Bzk+1 − c), (3.4) where ρ > 0. The algorithm is very similar to dual ascent and the
  63. 63. Large  Scale  Logis$c  Regression  via  ADMM   •  Nota$on     –  (Xi  ,  yi):  data  in  the  ith  par$$on   –  βi:  coefficient  vector  for  par$$on  i   –  β:  Consensus  coefficient  vector   –  r(β):  penalty  component  such  as  ||β||2 2     •  Op$miza$on  problem   Brief Article The Author July 7, 2013 min NX i=1 li(yi, XT i i) + r( ) subject to i =
  64. 64. ADMM  updates   LOCAL  REGRESSIONS   Shrinkage  towards  current   best  global  es$mate   UPDATED   CONSENSUS  
  65. 65. An  example  implementa$on   •  ADMM  for  Logis$c  regression  model  fimng  with   L2/L1  penalty   •  Each  itera$on  of  ADMM  is  a  Map-­‐Reduce  job   –  Mapper:  par$$on  the  data  into  K  par$$ons   –  Reducer:  For  each  par$$on,  use  liblinear/glmnet  to  fit   a  L1/L2  logis$c  regression   –  Gateway:  consensus  computa$on  by  results  from  all   reducers,  and  sends  back  the  consensus  to  each   reducer  node  
  66. 66. KDD  CUP  2010  Data   •  Bridge  to  Algebra  2008-­‐2009  data  in     haps://pslcdatashop.web.cmu.edu/KDDCup/ downloads.jsp   •  Binary  response,  20M  covariates   •  Only  keep  covariates  with  >=  10  occurrences   =>  2.2M  covariates   •  Training  data:  8,407,752  samples   •  Test  data  :  510,302  samples  
  67. 67. Avg  Training  Loglikelihood  vs  Number   of  Itera$ons  
  68. 68. Test  AUC  vs  Number  of  Itera$ons    
  69. 69. Beaer  Convergence  Can     Be  Achieved  By   •  Beaer  Ini$aliza$on   – Use  results  from  Naïve  method  to  ini$alize  the   parameters   •  Adap$vely  change  step  size  (ρ)  for  each   itera$on  based  on  the  convergence  status  of   the  consensus  
  70. 70. Recommender Problems for Web Applications
  71. 71. Agenda •  Topic of Interest –  Recommender problems for dynamic, time- sensitive applications •  Content Optimization, Online Advertising, Movie recommendation, shopping,… •  Introduction •  Offline components –  Regression, Collaborative filtering (CF), … •  Online components + initialization –  Time-series, online/incremental methods, explore/ exploit (bandit) •  Evaluation methods + Multi-Objective •  Challenges
  72. 72. Three components we will focus on •  Defining the problem –  Formulate objectives whose optimization achieves some long- term goals for the recommender system •  E.g. How to serve content to optimize audience reach and engagement, optimize some combination of engagement and revenue ? •  Modeling (to estimate some critical inputs) –  Predict rates of some positive user interaction(s) with items based on data obtained from historical user-item interactions •  E.g. Click rates, average time-spent on page, etc •  Could be explicit feedback like ratings •  Experimentation –  Create experiments to collect data proactively to improve models, helps in converging to the best choice(s) cheaply and rapidly. •  Explore and Exploit (continuous experimentation) •  DOE (testing hypotheses by avoiding bias inherent in data)
  73. 73. Modern Recommendation Systems •  Goal –  Serve the right item to a user in a given context to optimize long-term business objectives •  A scientific discipline that involves –  Large scale Machine Learning & Statistics •  Offline Models (capture global & stable characteristics) •  Online Models (incorporates dynamic components) •  Explore/Exploit (active and adaptive experimentation) –  Multi-Objective Optimization •  Click-rates (CTR), Engagement, advertising revenue, diversity, etc –  Inferring user interest •  Constructing User Profiles –  Natural Language Processing to understand content •  Topics, “aboutness”, entities, follow-up of something, breaking news,…
  74. 74. Some examples from content optimization •  Simple version –  I have a content module on my page, content inventory is obtained from a third party source which is further refined through editorial oversight. Can I algorithmically recommend content on this module? I want to improve overall click-rate (CTR) on this module •  More advanced –  I got X% lift in CTR. But I have additional information on other downstream utilities (e.g. advertising revenue). Can I increase downstream utility without losing too many clicks? •  Highly advanced –  There are multiple modules running on my webpage. How do I perform a simultaneous optimization?
  75. 75. Recommend applications Recommend search queries Recommend news article Recommend packages: Image Title, summary Links to other pages Pick  4  out  of  a  pool  of  K          K  =  20  ~  40          Dynamic     Routes  traffic  other  pages  
  76. 76. Problems in this example •  Optimize CTR on multiple modules –  Today Module, Trending Now, Personal Assistant, News –  Simple solution: Treat modules as independent, optimize separately. May not be the best when there are strong correlations. •  For any single module –  Optimize some combination of CTR, downstream engagement, and perhaps advertising revenue.
  77. 77. Online Advertising Advertisers Ad Network Ads Page Recommend Best ad(s) User Publisher Response rates (click, conversion, ad-view) Bids Auction Click conversion Select argmax f(bid,response rates) ML /Statistical model Examples: Yahoo, Google, MSN, … Ad exchanges (RightMedia, DoubleClick, …)
  78. 78. LinkedIn  Today:  Content  Module   Objective: Serve content to maximize engagement metrics like CTR (or weighted CTR)
  79. 79. LinkedIn  Ads:  Match  ads  to  users   visi$ng  LinkedIn  
  80. 80. Right  Media  Ad  Exchange:  Unified   Marketplace   Match  ads  to  page  views  on  publisher  sites   Has  ad   impression  to   sell  -­‐-­‐   AUCTIONS   Bids  $0.50   Bids  $0.75  via  Network…   …  which  becomes   $0.45  bid   Bids  $0.65—WINS!   AdSense   Ad.com   Bids  $0.60  
  81. 81. Recommender problems in general USER     Item  Inventory   Ar$cles,  web  page,     ads,  …   Use  an  automated  algorithm     to  select  item(s)  to  show     Get  feedback  (click,  $me  spent,..)     Refine  the  models     Repeat  (large  number  of  :mes)   Op:mize  metric(s)  of  interest   (Total  clicks,  Total  revenue,…)   Example applications Search: Web, Vertical Online Advertising Content ….. Context   query,  page,  …  
  82. 82. •  Items: Articles, ads, modules, movies, users, updates, etc. •  Context: query keywords, pages, mobile, social media, etc. •  Metric to optimize (e.g., relevance score, CTR, revenue, engagement) –  Currently, most applications are single-objective –  Could be multi-objective optimization (maximize X subject to Y, Z,..) •  Properties of the item pool –  Size (e.g., all web pages vs. 40 stories) –  Quality of the pool (e.g., anything vs. editorially selected) –  Lifetime (e.g., mostly old items vs. mostly new items) Important Factors
  83. 83. Factors affecting Solution (continued) •  Properties of the context –  Pull: Specified by explicit, user-driven query (e.g., keywords, a form) –  Push: Specified by implicit context (e.g., a page, a user, a session) •  Most applications are somewhere on continuum of pull and push •  Properties of the feedback on the matches made –  Types and semantics of feedback (e.g., click, vote) –  Latency (e.g., available in 5 minutes vs. 1 day) –  Volume (e.g., 100K per day vs. 300M per day) •  Constraints specifying legitimate matches –  e.g., business rules, diversity rules, editorial Voice –  Multiple objectives •  Available Metadata (e.g., link graph, various user/item attributes)
  84. 84. Predicting User-Item Interactions (e.g. CTR) •  Myth: We have so much data on the web, if we can only process it the problem is solved –  Number of things to learn increases with sample size •  Rate of increase is not slow –  Dynamic nature of systems make things worse –  We want to learn things quickly and react fast •  Data is sparse in web recommender problems –  We lack enough data to learn all we want to learn and as quickly as we would like to learn –  Several Power laws interacting with each other •  E.g. User visits power law, items served power law –  Bivariate Zipf: Owen & Dyer, 2011
  85. 85. Can Machine Learning help? •  Fortunately, there are group behaviors that generalize to individuals & they are relatively stable –  E.g. Users in San Francisco tend to read more baseball news •  Key issue: Estimating such groups –  Coarse group : more stable but does not generalize that well. –  Granular group: less stable with few individuals –  Getting a good grouping structure is to hit the “sweet spot” •  Another big advantage on the web –  Intervene and run small experiments on a small population to collect data that helps rapid convergence to the best choices(s) •  We don’t need to learn all user-item interactions, only those that are good.
  86. 86. Predicting user-item interaction rates Offline   (  Captures  stable  characteris$cs   at  coarse  resolu$ons)   (Logis$c,  Boos$ng,….)     Feature    construc$on   Content:  IR,  clustering,  taxonomy,  en$ty,..     User  profiles:  clicks,  views,  social,  community,..   Near  Online   (Finer  resolu$on   Correc$ons)   (item,  user  level)   (Quick  updates)   Explore/Exploit   (Adap$ve  sampling)   (helps  rapid  convergence   to  best  choices)   Initialize
  87. 87. Post-click: An example in Content Optimization Recommender          EDITORIAL               content Clicks on FP links influence downstream supply distribution    AD  SERVER                DISPLAY          ADVERTISING   Revenue   Downstream   engagement      (Time  spent)  
  88. 88. Serving Content on Front Page: Click Shaping •  What do we want to optimize? •  Current: Maximize clicks (maximize downstream supply from FP) •  But consider the following –  Article 1: CTR=5%, utility per click = 5 –  Article 2: CTR=4.9%, utility per click=10 •  By promoting 2, we lose 1 click/100 visits, gain 5 utils •  If we do this for a large number of visits --- lose some clicks but obtain significant gains in utility? –  E.g. lose 5% relative CTR, gain 40% in utility (revenue, engagement, etc)
  89. 89. High  level  picture   http request Statistical Models updated in Batch mode: e.g. once every 30 mins Server   Item   Recommenda$on   system:  thousands   of  computa$ons  in   sub-­‐seconds   User Interacts e.g. click, does nothing
  90. 90. High  level  overview:  Item   Recommenda$on  System   User Info Item Index Id, meta-data ML/ Statistical Models Score Items P(Click), P(share), Semantic-relevance score,…. Rank Items: sort by score (CTR,bid*CTR,..) combine scores using Multi-obj optim, Threshold on some scores,…. User-item interaction Data: batch process Updated in batch: Activity, profile Pre-filter SPAM,editorial,,.. Feature extraction NLP, cllustering,..
  91. 91. ML/Sta$s$cal  models  for  scoring   Number of items Scored by ML Traffic volume 1000100 100k 1M 100M Few hours Few days Several days LinkedIn Today Yahoo! Front Page Right Media Ad exchange LinkedIn Ads
  92. 92. Summary  of  deployments       •  Yahoo!  Front  page  Today  Module  (2008-­‐2011):  300%  improvement   in  click-­‐through  rates   –  Similar  algorithms  delivered  via  a  self-­‐serve  pla•orm,  adopted  by   several  Yahoo!  Proper$es  (2011):  Significant  improvement  in   engagement  across  Yahoo!  Network   •  Fully  deployed  on  LinkedIn  Today  Module  (2012):  Significant   improvement  in  click-­‐through  rates  (numbers  not  revealed  due  to   reasons  of  confiden$ality)   •  Yahoo!  RightMedia  exchange  (2012):  Fully  deployed  algorithms  to   es$mate  response  rates  (CTR,  conversion  rates).  Significant   improvement  in  revenue  (numbers  not  revealed  due  to  reasons  of   confiden$ality)   •  LinkedIn  self-­‐serve  ads  (2012-­‐2013):Fully  deployed   •  LinkedIn  News  Feed  (2013-­‐2014):  Fully  deployed   •  Several  others  in  progress….  
  93. 93. Broad  Themes   •  Curse  of  dimensionality   –  Large  number  of  observa$ons  (rows),  large  number  of  poten$al  features   (columns)   –  Use  domain  knowledge  and  machine  learning  to  reduce  the  “effec$ve”   dimension  (constraints  on  parameters  reduce  degrees  of  freedom)   •  I  will  give  examples  as  we  move  along   •   We  oden  assume  our  job  is  to  analyze  “Big  Data”  but  we  oden  have   control  on  what  data  to  collect  through  clever  experimenta$on   –  This  can  fundamentally  change  solu$ons   •  Think  of  computa$on  and  models  together  for  Big  data   •  Op$miza$on:  What  we  are  trying  to  op$mize  is  oden  complex,models  to   work  in  harmony  with  op$miza$on   –  Pareto  op$mality  with  compe$ng  objec$ves  
  94. 94. Sta$s$cal  Problem   •  Rank  items  (from  an  admissible  pool)  for  user  visits  in  some   context  to  maximize  a  u$lity  of  interest   •  Examples  of  u$lity  func$ons   –  Click-­‐rates  (CTR)   –  Share-­‐rates  (CTR*  [Share|Click]  )   –  Revenue  per  page-­‐view  =  CTR*bid  (more  complex  due  to  second  price   auc$on)   •  CTR  is  a  fundamental  measure  that  opens  the  door  to  a  more   principled  approach  to  rank  items   •  Converge  rapidly  to  maximum  u$lity  items     –  Sequen$al  decision  making  process  (explore/exploit)    
  95. 95. item  j  from  a  set  of  candidates   User  i     with   user  features   (e.g.,  industry,    behavioral  features,   Demographic  features, ……)                  (i,  j)  :  response  yij  visits   Algorithm  selects   (click  or  not)   Which  item  should  we  select?    Ÿ  The  item  with  highest  predicted  CTR    Ÿ  An  item  for  which  we  need  data  to            predict  its  CTR   Exploit   Explore   LinkedIn Today, Yahoo! Today Module: Choose Items to maximize CTR This is an “Explore/Exploit” Problem
  96. 96. The Explore/Exploit Problem (to maximize CTR) •  Problem definition: Pick k items from a pool of N for a large number of serves to maximize the number of clicks on the picked items •  Easy!? Pick the items having the highest click-through rates (CTRs) •  But … –  The system is highly dynamic: •  Items come and go with short lifetimes •  CTR of each item may change over time –  How much traffic should be allocated to explore new items to achieve optimal performance ? •  Too little → Unreliable CTR estimates due to “starvation” •  Too much → Little traffic to exploit the high CTR items
  97. 97. Y!  front  Page  Applica$on   •  Simplify:  Maximize  CTR  on  first  slot  (F1)     •  Item  Pool   –  Editorially  selected  for  high  quality  and  brand  image   –  Few  ar$cles  in  the  pool  but  item  pool  dynamic    
  98. 98. CTR Curves of Items on LinkedIn Today CTR  
  99. 99. Impact  of  repeat  item  views  on  a  given   user   •  Same  user  is  shown  an  item  mul$ple  $mes   (despite  not  clicking)  
  100. 100. Simple  algorithm  to  es$mate  most  popular   item  with  small  but  dynamic  item  pool   •  Simple  Explore/Exploit  scheme –  ε%  explore:  with  a  small  probability  (e.g.  5%),  choose  an  item  at   random  from  the  pool   –  (100−ε)%  exploit:  with  large  probability  (e.g.  95%),  choose   highest  scoring  CTR  item   •  Temporal  Smoothing   –  Item  CTRs  change  over  $me,  provide  more  weight  to  recent  data  in   es$ma$ng  item  CTRs   •  Kalman  filter,  moving  average   •  Discount  item  score  with  repeat  views   –  CTR(item)  for  a  given  user  drops  with  repeat  views  by  some  “discount”   factor  (es$mated  from  data)   •  Segmented  most  popular   –  Perform  separate  most-­‐popular  for  each  user  segment    
  101. 101. Time  series  Model:  Kalman  filter   •  Dynamic  Gamma-­‐Poisson:  click-­‐rate  evolves  over  $me   in  a  mul$plica$ve  fashion   •  Es$mated  Click-­‐rate  distribu$on  at  $me  t+1     –  Prior  mean:   –  Prior  variance:       High  CTR  items  more  adap$ve  
  102. 102. More  economical  explora$on?  Beaer   bandit  solu$ons   •  Consider  two  armed  problem   p2 (unknown payoff probabilities) The  gambler  has  1000  plays,  what  is  the  best  way  to  experiment  ?                                                (to  maximize  total  expected  reward)      This  is  called  the  “mul$-­‐armed  bandit”  problem,  have  been  studied  for  a  long  $me.    Op$mal  solu$on:  Play  the  arm  that  has  maximum  poten:al  of  being  good    Op:mism  in  the  face  of  uncertainty   p1 >
  103. 103. Item  Recommenda$on:  Bandits?   •  Two  Items:  Item  1  CTR=  2/100  ;  Item  2  CTR=  250/10000   –  Greedy:  Show  Item  2  to  all;  not  a  good  idea   –  Item  1  CTR  es$mate  noisy;  item  could  be  poten$ally   beaer   •  Invest  in  Item  1  for  beaer  overall  performance  on  average       –  Exploit  what  is  known  to  be  good,  explore  what  is  poten$ally  good   CTR Probabilitydensity Item 2 Item 1
  104. 104. Next few hours Most Popular Recommendation Personalized Recommendation Offline Models Collaborative filtering (cold-start problem) Online Models Time-series models Incremental CF, online regression Intelligent Initialization Prior estimation Prior estimation, dimension reduction Explore/Exploit Multi-armed bandits Bandits with covariates
  105. 105. Offline Components: Collaborative Filtering in Cold-start Situations
  106. 106. Problem Item j with User i with user features xi (demographics, browse history, search history, …) item features xj (keywords, content categories, ...) (i, j) : response yijvisits Algorithm selects (explicit rating, implicit click/no-click) Predict the unobserved entries based on features and the observed entries
  107. 107. Model Choices •  Feature-based (or content-based) approach –  Use features to predict response •  (regression, Bayes Net, mixture models, …) –  Limitation: need predictive features •  Bias often high, does not capture signals at granular levels •  Collaborative filtering (CF aka Memory based) –  Make recommendation based on past user-item interaction •  User-user, item-item, matrix factorization, … •  See [Adomavicius & Tuzhilin, TKDE, 2005], [Konstan, SIGMOD’08 Tutorial], etc. –  Better performance for old users and old items –  Does not naturally handle new users and new items (cold- start)
  108. 108. Collaborative Filtering (Memory based methods) User-User Similarity Item-Item similarities, incorporating both Estimating Similarities Pearson’s correlation Optimization based (Koren et al)
  109. 109. How to Deal with the Cold-Start Problem •  Heuris$c-­‐based  approaches   –  Linear  combina$on  of  regression  and  CF  models     –  Filterbot   •  Add  user  features  as  psuedo  users  and  do  collabora$ve  filtering   -­‐  Hybrid  approaches   -­‐  Use  content  based  to  fill  up  entries,  then  use  CF   •  Matrix  Factoriza$on   –  Good  performance  on  Ne•lix  (Koren,  2009)   •  Model-­‐based  approaches     –  Bilinear  random-­‐effects  model  (probabilis$c  matrix  factoriza$on)   •  Good  on  Ne•lix  data  [Ruslan  et  al  ICML,  2009]   –  Add  feature-­‐based  regression  to  matrix  factoriza$on     •  (Agarwal  and  Chen,  2009)   –  Add  topic  discovery  (from  textual  items)  to  matrix  factoriza$on     •  (Agarwal  and  Chen,  2009;  Chun  and  Blei,  2011)  
  110. 110. Per-item regression models •  When tracking users by cookies, distribution of visit patters could get extremely skewed – Majority of cookies have 1-2 visits •  Per item models (regression) based on user covariates attractive in such cases
  111. 111. Several per-item regressions: Multi-task learning Low dimension (5-10), B estimated retrospective data •  Agarwal,Chen and Elango, KDD, 2010 Affinity to old items
  112. 112. Per-user, per-item models via bilinear random-effects model
  113. 113. Motivation •  Data measuring k-way interactions pervasive –  Consider k = 2 for all our discussions •  E.g. User-Movie, User-content, User-Publisher-Ads,…. –  Power law on both user and item degrees •  Classical Techniques –  Approximate matrix through a singular value decomposition (SVD) •  After adjusting for marginal effects (user pop, movie pop,..) –  Does not work •  Matrix highly incomplete, severe over-fitting –  Key issue •  Regularization of eigenvectors (factors) to avoid overfitting
  114. 114. Early work on complete matrices •  Tukey’s 1-df model (1956) –  Rank 1 approximation of small nearly complete matrix •  Criss-cross regression (Gabriel, 1978) •  Incomplete matrices: Psychometrics (1-factor model only; small data sets; 1960s) •  Modern day recommender problems –  Highly incomplete, large, noisy.
  115. 115. Latent Factor Models “newsy” “sporty” “newsy” s item v z Affinity = u’v Affinity = s’z u s p o r t y
  116. 116. Factorization – Brief Overview •  Latent user factors: (αi , ui=(ui1,…,uin)) •  (Nn + Mm) parameters •  Key technical issue: •  Latent movie factors: (βj , vj=(v j1,….,v jn)) will overfit for moderate values of n,m Regularization Interaction jijiij BvuyE ʹ′+++= βαµ)(
  117. 117. Latent Factor Models: Different Aspects •  Matrix Factorization – Factors in Euclidean space – Factors on the simplex •  Incorporating features and ratings simultaneously •  Online updates
  118. 118. Maximum Margin Matrix Factorization (MMMF) •  Complete matrix by minimizing loss (hinge,squared- error) on observed entries subject to constraints on trace norm –  Srebro, Rennie, Jakkola (NIPS 2004) –  Convex, Semi-definite programming (expensive, not scalable) •  Fast MMMF (Rennie & Srebro, ICML, 2005) –  Constrain the Frobenious norm of left and right eigenvector matrices, not convex but becomes scalable. •  Other variation: Ensemble MMMF (DeCoste, ICML2005) –  Ensembles of partially trained MMMF (some improvements)
  119. 119. Matrix Factorization for Netflix prize data •  Minimize the objective function •  Simon Funk: Stochastic Gradient Descent •  Koren et al (KDD 2007): Alternate Least Squares –  They move to SGD later in the competition ∑ ∑∑∈ ++− obsij j j i ij T iij vuvur )()( 222 λ
  120. 120. ui vj rij au av 2 σ Optimization is through Iterated conditional modes Other variations like constraining the mean through sigmoid, using “who-rated- whom” Combining with Boltzmann Machines also improved performance ),(~ ),(~ ),(~ 2 IaMVN IaMVN Nr vj ui j T iij 0v 0u vu σ Probabilis$c  Matrix  Factoriza$on   (Ruslan  &  Minh,  2008,  NIPS)  
  121. 121. Bayesian Probabilistic Matrix Factorization (Ruslan and Minh, ICML 2008) •  Fully Bayesian treatment using an MCMC approach –  Significant improvement •  Interpretation as a fully Bayesian hierarchical model shows why that is the case –  Failing to incorporate uncertainty leads to bias in estimates –  Multi-modal posterior, MCMC helps in converging to a better one r Var-comp: au MCEM also more resistant to over-fitting
  122. 122. Non-parametric Bayesian matrix completion (Zhou et al, SAM, 2010) •  Specify rank probabilistically (automatic rank selection) )/)1(,/(~ )(~ ),(~ 1 2 rrbraBeta Berz vuzNy k kk r k jkikkij − ∑= π π σ ))1(/(Factors)#( )))1(/(,1(~ −+= −+ rbaraE rbaaBerzk
  123. 123. How to incorporate features: Deal with both warm start and cold-start •  Models to predict ratings for new pairs –  Warm-start: (user, movie) present in the training data with large sample size –  Cold-start: At least one of (user, movie) new or has small sample size •  Rough definition, warm-start/cold-start is a continuum. •  Challenges –  Highly incomplete (user, movie) matrix –  Heavy tailed degree distributions for users/movies •  Large fraction of ratings from small fraction of users/ movies –  Handling both warm-start and cold-start effectively in the presence of predictive features
  124. 124. Possible approaches •  Large scale regression based on covariates –  Does not provide good estimates for heavy users/movies –  Large number of predictors to estimate interactions •  Collaborative filtering –  Neighborhood based –  Factorization •  Good for warm-start; cold-start dealt with separately •  Single model that handles cold-start and warm-start –  Heavy users/movies → User/movie specific model –  Light users/movies → fallback on regression model –  Smooth fallback mechanism for good performance
  125. 125. Add Feature-based Regression into Matrix Factorization RLFM: Regression-based Latent Factor Model
  126. 126. Regression-based Factorization Model (RLFM) •  Main idea: Flexible prior, predict factors through regressions •  Seamlessly handles cold-start and warm- start •  Modified state equation to incorporate covariates
  127. 127. RLFM: Model Rating: ),(~ 2 σµijij Ny )(~ ijij Bernoulliy µ )(~ ijijij NPoissony µ Gaussian Model Logistic Model (for binary rating) Poisson Model (for counts) j t iji t ijij vubxt +++= βαµ )( user i gives item j Bias of user i: ),0(~, 2 0 α αα σεεα Nxg iii t i += Popularity of item j: ),0(~, 2 0 β ββ σεεβ Nxd jjj t j += Factors of user i: ),0(~, 2 INGxu u u i u iii σεε+= Factors of item j: ),0(~, 2 INDxv v v i v iji σεε+= Could use other classes of regression models
  128. 128. Graphical representation of the model
  129. 129. Advantages of RLFM •  Better regularization of factors –  Covariates “shrink” towards a better centroid •  Cold-start: Fallback regression model (FeatureOnly)
  130. 130. RLFM: Illustration of Shrinkage Plot the first factor value for each user (fitted using Yahoo! FP data)
  131. 131. Model fitting: EM for our class of models
  132. 132. The parameters for RLFM •  Latent parameters •  Hyper-parameters }){},{},{},({ jiji vuβα=Δ )IaAI,aAD,G,,( vvuu ===Θ b
  133. 133. Computing the mode Minimized
  134. 134. The EM algorithm
  135. 135. Computing the E-step •  Often hard to compute in closed form •  Stochastic EM (Markov Chain EM; MCEM) –  Compute expectation by drawing samples from –  Effective for multi-modal posteriors but more expensive •  Iterated Conditional Modes algorithm (ICM) –  Faster but biased hyper-parameter estimates
  136. 136. Monte Carlo E-step •  Through a vanilla Gibbs sampler (conditionals closed form) •  Other conditionals also Gaussian and closed form •  Conditionals of users (movies) sampled simultaneously •  Small number of samples in early iterations, large numbers in later iterations
  137. 137. M-step (Why MCEM is better than ICM) •  Update G, optimize •  Update Au=au I Ignored by ICM, underestimates factor variability Factors over-shrunk, posterior not explored well
  138. 138. Experiment 1: Better regularization •  MovieLens-100K, avg RMSE using pre-specified splits •  ZeroMean, RLFM and FeatureOnly (no cold-start issues) •  Covariates: –  Users : age, gender, zipcode (1st digit only) –  Movies: genres
  139. 139. Experiment 2: Better handling of Cold-start •  MovieLens-1M; EachMovie •  Training-test split based on timestamp •  Same covariates as in Experiment 1.
  140. 140. Experiment 4: Predicting click-rate on articles •  Goal: Predict click-rate on articles for a user on F1 position •  Article lifetimes short, dynamic updates important •  User covariates: –  Age, Gender, Geo, Browse behavior •  Article covariates –  Content Category, keywords •  2M ratings, 30K users, 4.5 K articles
  141. 141. Results on Y! FP data
  142. 142. Some other related approaches •  Stern, Herbrich and Graepel, WWW, 2009 –  Similar to RLFM, different parametrization and expectation propagation used to fit the models •  Porteus, Asuncion and Welling, AAAI, 2011 –  Non-parametric approach using a Dirichlet process •  Agarwal, Zhang and Mazumdar, Annals of Applied Statistics, 2011 –  Regression + random effects per user regularized through a Graphical Lasso
  143. 143. Add Topic Discovery into Matrix Factorization fLDA: Matrix Factorization through Latent Dirichlet Allocation
  144. 144. fLDA: Introduction •  Model the rating yij that user i gives to item j as the user’s affinity to the topics that the item has –  Unlike regular unsupervised LDA topic modeling, here the LDA topics are learnt in a supervised manner based on past rating data –  fLDA can be thought of as a “multi-task learning” version of the supervised LDA model [Blei’07] for cold-start recommendation ∑+= k jkikij zsy ... User i ’s affinity to topic k Pr(item j has topic k) estimated by averaging the LDA topic of each word in item j Old items: zjk’s are Item latent factors learnt from data with the LDA prior New items: zjk’s are predicted based on the bag of words in the items
  145. 145. Φ11,  …,  Φ1W                …   Φk1,  …,  ΦkW                …   ΦK1,  …,  ΦKW   Topic  1   Topic  k   Topic  K   LDA Topic Modeling (1) •  LDA is effective for unsupervised topic discovery [Blei’03] –  It models the generating process of a corpus of items (articles) –  For each topic k, draw a word distribution Φk = [Φk1, …, ΦkW] ~ Dir(η) –  For each item j, draw a topic distribution θj = [θj1, …, θjK] ~ Dir(λ) –  For each word, say the nth word, in item j, •  Draw a topic zjn for that word from θj = [θj1, …, θjK] •  Draw a word wjn from Φk = [Φk1, …, ΦkW] with topic k = zjn Item j Topic distribution: [θj1, …, θjK] Words: wj1, …, wjn, … Per-word topic: zj1, …, zjn, … Assume zjn = topic k Observed
  146. 146. LDA Topic Modeling (2) •  Model training: –  Estimate the prior parameters and the posterior topic×word distribution Φ based on a training corpus of items –  EM + Gibbs sampling is a popular method •  Inference for new items –  Compute the item topic distribution based on the prior parameters and Φ estimated in the training phase •  Supervised LDA [Blei’07] –  Predict a target value for each item based on supervised LDA topics ∑= k jkkj zsy Target value of item j Pr(item j has topic k) estimated by averaging the topic of each word in item j Regression weight for topic k ∑+= k jkikij zsy ...vs. One regression per user Same set of topics across different regressions
  147. 147. fLDA: Model Rating: ),(~ 2 σµijij Ny )(~ ijij Bernoulliy µ )(~ ijijij NPoissony µ Gaussian Model Logistic Model (for binary rating) Poisson Model (for counts) jkikkji t ijij zsbxt ∑+++= βαµ )( user i gives item j Bias of user i: ),0(~, 2 0 α αα σεεα Nxg iii t i += Popularity of item j: ),0(~, 2 0 β ββ σεεβ Nxd jjj t j += Topic affinity of user i: ),0(~, 2 INHxs s s i s iii σεε+= Pr(item j has topic k): )iteminwords#/()(1 jkzz jnnjk =∑= The LDA topic of the nth word in item j Observed words: ),,(~ jnjn zLDAw ηλ The nth word in item j
  148. 148. Model Fitting •  Given: –  Features X = {xi, xj, xij} –  Observed ratings y = {yij} and words w = {wjn} •  Estimate: –  Parameters: Θ = [b, g0, d0, H, σ2, aα, aβ, As, λ, η] •  Regression weights and prior parameters –  Latent factors: Δ = {αi, βj, si} and z = {zjn} •  User factors, item factors and per-word topic assignment •  Empirical Bayes approach: –  Maximum likelihood estimate of the parameters: –  The posterior distribution of the factors: ∫ ΔΘΔ=Θ=Θ ΘΘ dzdzwywy ]|,,,Pr[maxarg]|,Pr[maxargˆ ]ˆ,|,Pr[ ΘΔ yz
  149. 149. The EM Algorithm •  Iterate through the E and M steps until convergence – Let be the current estimate – E-step: Compute •  The expectation is not in closed form •  We draw Gibbs samples and compute the Monte Carlo mean – M-step: Find •  It consists of solving a number of regression and optimization problems )]|,,,Pr([log)( )ˆ,,|,( ΘΔ=Θ ΘΔ zwyEf n wyz )(maxargˆ )1( Θ=Θ Θ + fn )(ˆ n Θ
  150. 150. Supervised Topic Assignment ( ) ∏ =⋅+ + + ∝ = ¬ ¬ ¬ ji jnij jn jkjn k jn kl jn kzyfZ WZ Z kz rated )|( )Rest|Pr( λ η η Same as unsupervised LDA Likelihood of observed ratings by users who rated item j when zjn is set to topic k Probability of observing yij given the model The topic of the nth word in item j
  151. 151. fLDA: Experimental Results (Movie) •  Task: Predict the rating that a user would give a movie •  Training/test split: –  Sort observations by time –  First 75% → Training data –  Last 25% → Test data •  Item warm-start scenario –  Only 2% new items in test data Model Test RMSE RLFM 0.9363 fLDA 0.9381 Factor-Only 0.9422 FilterBot 0.9517 unsup-LDA 0.9520 MostPopular 0.9726 Feature-Only 1.0906 Constant 1.1190 fLDA is as strong as the best method It does not reduce the performance in warm-start scenarios
  152. 152. fLDA: Experimental Results (Yahoo! Buzz) •  Task: Predict whether a user would buzz-up an article •  Severe item cold-start –  All items are new in test data Data Statistics 1.2M observations 4K users 10K articles fLDA significantly outperforms other models
  153. 153. Experimental Results: Buzzing Topics Top Terms (after stemming) Topic bush, tortur, interrog, terror, administr, CIA, offici, suspect, releas, investig, georg, memo, al CIA interrogation mexico, flu, pirat, swine, drug, ship, somali, border, mexican, hostag, offici, somalia, captain Swine flu NFL, player, team, suleman, game, nadya, star, high, octuplet, nadya_suleman, michael, week NFL games court, gai, marriag, suprem, right, judg, rule, sex, pope, supreme_court, appeal, ban, legal, allow Gay marriage palin, republican, parti, obama, limbaugh, sarah, rush, gop, presid, sarah_palin, sai, gov, alaska Sarah Palin idol, american, night, star, look, michel, win, dress, susan, danc, judg, boyl, michelle_obama American idol economi, recess, job, percent, econom, bank, expect, rate, jobless, year, unemploy, month Recession north, korea, china, north_korea, launch, nuclear, rocket, missil, south, said, russia North Korea issues 3/4 topics are interpretable; 1/2 are similar to unsupervised topics
  154. 154. fLDA Summary •  fLDA is a useful model for cold-start item recommendation •  It also provides interpretable recommendations for users –  User’s preference to interpretable LDA topics •  Future directions: –  Investigate Gibbs sampling chains and the convergence properties of the EM algorithm –  Apply fLDA to other multi-task prediction problems •  fLDA can be used as a tool to generate supervised features (topics) from text data
  155. 155. Summary •  Regularizing factors through covariates effective •  Regression based factor model that regularizes better and deals with both cold-start and warm-start in a single framework in a seamless way looks attractive •  Fitting method scalable; Gibbs sampling for users and movies can be done in parallel. Regressions in M-step can be done with any off-the-shelf scalable linear regression routine •  Distributed computing on Hadoop: Multiple models and average across partitions (more later)
  156. 156. Online  Components:   Online  Models,  Intelligent   Ini$aliza$on,  Explore  /  Exploit  
  157. 157. Why Online Components? •  Cold start –  New items or new users come to the system –  How to obtain data for new items/users (explore/exploit) –  Once data becomes available, how to quickly update the model •  Periodic rebuild (e.g., daily): Expensive •  Continuous online update (e.g., every minute): Cheap •  Concept drift –  Item popularity, user interest, mood, and user-to-item affinity may change over time –  How to track the most recent behavior •  Down-weight old data –  How to model temporal patterns for better prediction •  … may not need to be online if the patterns are stationary
  158. 158. Big Picture Most Popular Recommendation Personalized Recommendation Offline Models Collaborative filtering (cold-start problem) Online Models Real systems are dynamic Time-series models Incremental CF, online regression Intelligent Initialization Do not start cold Prior estimation Prior estimation, dimension reduction Explore/Exploit Actively acquire data Multi-armed bandits Bandits with covariates Segmented  Most     Popular  Recommenda$on   Extension:  
  159. 159. Online  Components  for     Most  Popular  Recommenda$on   Online  models,  intelligent  ini$aliza$on  &   explore/exploit  
  160. 160. Most popular recommendation: Outline •  Most popular recommendation (no personalization, all users see the same thing) –  Time-series models (online models) –  Prior estimation (initialization) –  Multi-armed bandits (explore/exploit) –  Sometimes hard to beat!! •  Segmented most popular recommendation –  Create user segments/clusters based on user features –  Do most popular recommendation for each segment
  161. 161. Most Popular Recommendation •  Problem definition: Pick k items (articles) from a pool of N to maximize the total number of clicks on the picked items •  Easy!? Pick the items having the highest click- through rates (CTRs) •  But … –  The system is highly dynamic: •  Items come and go with short lifetimes •  CTR of each item changes over time –  How much traffic should be allocated to explore new items to achieve optimal performance •  Too little → Unreliable CTR estimates •  Too much → Little traffic to exploit the high CTR items
  162. 162. CTR Curves for Two Days on Yahoo! Front Page Traffic  obtained  from  a  controlled  randomized  experiment  (no  confounding)   Things  to  note:        (a)  Short  life$mes,  (b)  temporal  effects,  (c)  oden  breaking  news  stories   Each  curve  is  the  CTR  of  an  item  in  the  Today  Module  on  www.yahoo.com  over  $me  
  163. 163. For Simplicity, Assume … •  Pick only one item for each user visit –  Multi-slot optimization later •  No user segmentation, no personalization (discussion later) •  The pool of candidate items is predetermined and is relatively small (≤ 1000) –  E.g., selected by human editors or by a first-phase filtering method –  Ideally, there should be a feedback loop –  Large item pool problem later •  Effects like user-fatigue, diversity in recommendations, multi-objective optimization not considered (discussion later)
  164. 164. Online Models •  How to track the changing CTR of an item •  Data: for each item, at time t, we observe –  Number of times the item nt was displayed (i.e., #views) –  Number of clicks ct on the item •  Problem Definition: Given c1, n1, …, ct, nt, predict the CTR (click-through rate) pt+1 at time t+1 •  Potential solutions: –  Observed CTR at t: ct / nt → highly unstable (nt is usually small) –  Cumulative CTR: (∑all i ci) / (∑all i ni) → react to changes very slowly –  Moving window CTR: (∑i∈last K ci) / (∑i∈last K ni) → reasonable •  But, no estimation of Var[pt+1] (useful for explore/exploit)
  165. 165. Online Models: Dynamic Gamma-Poisson •  Model-based approach –  (ct | nt, pt) ~ Poisson(nt pt) –  pt = pt-1 εt, where εt ~ Gamma(mean=1, var=η) –  Model parameters: •  p1 ~ Gamma(mean=µ0, var=σ0 2) is the offline CTR estimate •  η specifies how dynamic/smooth the CTR is over time –  Posterior distribution (pt+1 | c1, n1, …, ct, nt) ~ Gamma(?,?) •  Solve this recursively (online update rule)    Show  the  item  nt  $mes      Receive  ct  clicks      pt  =  CTR  at  $me  t   Nota$on:   p1  µ0,  σ0 2   p2   … n1   c1   n2   c2   η  
  166. 166. Online Models: Derivation size)sample(effective/Let ),(~),,...,,|( 2 2 1111 ttt ttttt varmeanGammancncp σµγ σµ = ==−− )( ),(~),,...,,|( 2 | 2 | 2 | 2 1 |1 2 11111 ttttttt ttt ttttt varmeanGammancncp σµησσ µµ σµ ++= = == + + +++ tttttt ttttttt tttt ttttttt c n varmeanGammancncp || 2 | || | 2 ||11 / /)( size)sample(effectiveLet ),(~),,...,,|( γµσ γγµµ γγ σµ = +⋅= += == High  CTR  items  more  adap$ve   Es$mated  CTR   distribu$on   at  $me  t   Es$mated  CTR     distribu$on   at  $me  t+1  
  167. 167. Tracking behavior of Gamma- Poisson model •  Low click rate articles – More temporal smoothing
  168. 168. Intelligent Initialization: Prior Estimation •  Prior CTR distribution: Gamma(mean=µ0, var=σ0 2) –  N historical items: •  ni = #views of item i in its first time interval •  ci = #clicks on item i in its first time interval –  Model •  ci ~ Poisson(ni pi) and pi ~ Gamma(µ0, σ0 2) ⇒ ci ~ NegBinomial(µ0, σ0 2, ni) –  Maximum likelihood estimate (MLE) of (µ0, σ0 2) •  Better prior: Cluster items and find MLE for each cluster –  Agarwal & Chen, 2011 (SIGMOD) ∑ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ +⎟ ⎠ ⎞⎜ ⎝ ⎛ +−⎟ ⎠ ⎞⎜ ⎝ ⎛ +Γ+⎟ ⎠ ⎞⎜ ⎝ ⎛Γ− i iii nccNN 2 0 0 2 0 2 0 2 0 2 0 2 0 2 0 2 0 0 2 0 2 0 2 00 loglogloglogmaxarg , σ µ σ µ σ µ σ µ σ µ σ µ σµ
  169. 169. Explore/Exploit: Problem Definition $me   Item  1   Item  2   …   Item  K   x1%  page  views   x2%  page  views   …   xK%  page  views   Determine  (x1,  x2,  …,  xK)  based  on  clicks  and  views  observed  before  t  in   order  to  maximize  the  expected  total  number  of  clicks  in  the  future   t  –1    t  –2     t   now   clicks  in  the  future  
  170. 170. Modeling the Uncertainty, NOT just the Mean Simplified  semng:  Two  items   CTR Probabilitydensity Item A Item B We  know  the  CTR  of  Item  A  (say,  shown  1  million  $mes)     We  are  uncertain  about  the  CTR  of  Item  B  (only  100  $mes)   If  we  only  make  a  single  decision,   give  100%  page  views  to  Item  A     If  we  make  mul$ple  decisions  in  the  future   explore  Item  B  since  its  CTR  can  poten$ally   be  higher   ∫ > ⋅−= qp dppfqp )()(Potential CTR of item A is q CTR of item B is p Probability density function of item B’s CTR is f(p)
  171. 171. Multi-Armed Bandits: Introduction (1) Bandit “arms” p1 p2 p3 (unknown payoff probabilities) “Pulling” arm i yields a reward: reward = 1 with probability pi (success) reward = 0 otherwise (failure) For  now,  we  are  aaacking  the  problem  of  choosing  best  ar$cle/arm  for  all  users  
  172. 172. Multi-Armed Bandits: Introduction (2) Bandit “arms” p1 p2 p3 (unknown payoff probabilities) Goal:  Pull  arms  sequen$ally  to  maximize  the  total  reward   Bandit  scheme/policy:  Sequen$al  algorithm  to  play  arms  (items)   Regret  of  a  scheme  =  Expected  loss  rela$ve  to  the  “oracle”  op-mal  scheme   that  always  plays  the  best  arm   –  “best”  means  highest  success  probability   –  But,  the  best  arm  is  not  known  …  unless  you  have  an  oracle   –  Regret  is  the  price  of  explora$on   –  Low  regret  implies  quick  convergence  to  the  best  
  173. 173. Multi-Armed Bandits: Introduction (3) •  Bayesian approach –  Seeks to find the Bayes optimal solution to a Markov decision process (MDP) with assumptions about probability distributions –  Representative work: Gittins’ index, Whittle’s index –  Very computationally intensive •  Minimax approach –  Seeks to find a scheme that incurs bounded regret (with no or mild assumptions about probability distributions) –  Representative work: UCB by Lai, Auer –  Usually, computationally easy –  But, they tend to explore too much in practice (probably because the bounds are based on worse-case analysis) Skip  details
  174. 174. Multi-Armed Bandits: Markov Decision Process (1) •  Select an arm now at time t=0, to maximize expected total number of clicks in t=0,…,T •  State at time t: Θt = (θ1t, …, θKt) –  θit = State of arm i at time t (that captures all we know about arm i at t) •  Reward function Ri(Θt, Θt+1) –  Reward of pulling arm i that brings the state from Θt to Θt+1 •  Transition probability Pr[Θt+1 | Θt, pulling arm i ] •  Policy π: A function that maps a state to an arm (action) –  π(Θt) returns an arm (to pull) •  Value of policy π starting from the current state Θ0 with horizon T [ ]),(),(),( 1110)(0 0 ΘΘΘΘ Θ ππ π −+= TT VREV [ ] [ ]∫ −+⋅= 11110)(001 ),(),()(,|Pr 0 ΘΘΘΘΘΘΘ Θ dVR T ππ π Immediate  reward   Value  of  the  remaining  T-­‐1  $me  slots                                                  if  we  start  from  state  Θ1  
  175. 175. Multi-Armed Bandits: MDP (2) •  Optimal policy: •  Things to notice: –  Value is defined recursively (actually T high-dim integrals) –  Dynamic programming can be used to find the optimal policy –  But, just evaluating the value of a fixed policy can be very expensive •  Bandit Problem: The pull of one arm does not change the state of other arms and the set of arms do not change over time [ ]),(),(),( 1110)(0 0 ΘΘΘΘ Θ ππ π −+= TT VREV [ ] [ ]∫ −+⋅= 11110)(001 ),(),()(,|Pr 0 ΘΘΘΘΘΘΘ Θ dVR T ππ π Immediate  reward   Value  of  the  remaining  T-­‐1  $me  slots                                                  if  we  start  from  state  Θ1   ),(maxarg 0Θπ π TV
  176. 176. Multi-Armed Bandits: MDP (3) •  Which arm should be pulled next? –  Not necessarily what looks best right now, since it might have had a few lucky successes –  Looks like it will be a function of successes and failures of all arms •  Consider a slightly different problem setting –  Infinite time horizon, but –  Future rewards are geometrically discounted Rtotal = R(0) + γ.R(1) + γ2.R(2) + … (0<γ<1) •  Theorem [Gittins 1979]: The optimal policy decouples and solves a bandit problem for each arm independently Policy  π(Θt)  is  a  func$on  of  (θ1t,  …,  θKt)   Policy  π(Θt)  =  argmaxi  {  g(θit)  }   One  K-­‐dimensional  problem   K  one-­‐dimensional  problems   S$ll  computa$onally  expensive!!   Gimns’  Index  
  177. 177. Multi-Armed Bandits: MDP (4) Bandit Policy 1.  Compute the priority (Gittins’ index) of each arm based on its state 2.  Pull arm with max priority, and observe reward 3.  Update the state of the pulled arm Priority   1   Priority   2   Priority   3  
  178. 178. Multi-Armed Bandits: MDP (5) •  Theorem [Gittins 1979]: The optimal policy decouples and solves a bandit problem for each arm independently –  Many proofs and different interpretations of Gittins’ index exist •  The index of an arm is the fixed charge per pull for a game with two options, whether to pull the arm or not, so that the charge makes the optimal play of the game have zero net reward –  Significantly reduces the dimension of the problem space –  But, Gittins’ index g(θit) is still hard to compute •  For the Gamma-Poisson or Beta-Binomial models θit = (#successes, #pulls) for arm i up to time t •  g maps each possible (#successes, #pulls) pair to a number –  Approximate methods are used in practice –  Lai et al. have derived these for exponential family distributions
  179. 179. Multi-Armed Bandits: Minimax Approach (1) •  Compute the priority of each arm i in a way that the regret is bounded –  Lowest regret in the worst case •  One common policy is UCB1 [Auer 2002] Number of successes of arm i Number of pulls of arm i Total number of pulls of all arms Observed success rate Factor representing uncertainty ii i i n n n c log2 Priority ⋅ +=
  180. 180. Multi-Armed Bandits: Minimax Approach (2) •  As total observations n becomes large: –  Observed payoff tends asymptotically towards the true payoff probability –  The system never completely “converges” to one best arm; only the rate of exploration tends to zero Observed payoff Factor representing uncertainty ii i i n n n c log2 Priority ⋅ +=
  181. 181. Multi-Armed Bandits: Minimax Approach (3) •  Sub-optimal arms are pulled O(log n) times •  Hence, UCB1 has O(log n) regret •  This is the lowest possible regret (but the constants matter J) •  E.g. Regret after n plays is bounded by Observed payoff Factor representing uncertainty ii i i n n n c log2 Priority ⋅ += ibesti K j j i ibesti n µµ π µµ −=Δ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ Δ⋅⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ ++⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ Δ ∑∑ =< where, 3 1 ln 8 1 2 :
  182. 182. •  Classical multi-armed bandits –  A fixed set of arms with fixed rewards –  Observe the reward before the next pull •  Bayesian approach (Markov decision process) –  Gittins’ index [Gittins 1979]: Bayes optimal for classical bandits •  Pull the arm currently having the highest index value –  Whittle’s index [Whittle 1988]: Extension to a changing reward function –  Computationally intensive •  Minimax approach (providing guaranteed regret bounds) –  UCB1 [Auer 2002]: Upper bound of a model agnostic confidence interval •  Index of arm i = •  Heuristics –  ε-Greedy: Random exploration using fraction ε of traffic –  Softmax: Pick arm i with probability –  Posterior draw: Index = drawing from posterior CTR distribution of an arm ∑j j i }/ˆexp{ }/ˆexp{ τµ τµ Classical Multi-Armed Bandits: Summary ii itemofCTRpredictedˆ =µ iii nnnc log2⋅+ retemperatu=τ
  183. 183. Do Classical Bandits Apply to Web Recommenders? Traffic  obtained  from  a  controlled  randomized  experiment  (no  confounding)   Things  to  note:        (a)  Short  life$mes,  (b)  temporal  effects,  (c)  oden  breaking  news  stories   Each  curve  is  the  CTR  of  an  item  in  the  Today  Module  on  www.yahoo.com  over  $me  
  184. 184. Characteristics of Real Recommender Systems•  Dynamic set of items (arms) –  Items come and go with short lifetimes (e.g., a day) –  Asymptotically optimal policies may fail to achieve good performance when item lifetimes are short •  Non-stationary CTR –  CTR of an item can change dramatically over time •  Different user populations at different times •  Same user behaves differently at different times (e.g., morning, lunch time, at work, in the evening, etc.) •  Attention to breaking news stories decays over time •  Batch serving for scalability –  Making a decision and updating the model for each user visit in real time is expensive –  Batch serving is more feasible: Create time slots (e.g., 5 min); for each slot, decide the fraction xi of the visits in the slot to give to item i [Agarwal  et  al.,  ICDM,  2009]  
  185. 185. Explore/Exploit in Recommender Systems $me   Item  1   Item  2   …   Item  K   x1%  page  views   x2%  page  views   …   xK%  page  views   Determine  (x1,  x2,  …,  xK)  based  on  clicks  and  views  observed  before  t  in   order  to  maximize  the  expected  total  number  of  clicks  in  the  future   t  –1    t  –2     t   now   clicks  in  the  future   Let’s  solve  this  from  first  principle  
  186. 186. Bayesian Solution: Two Items, Two Time Slots (1) •  Two time slots: t = 0 and t = 1 –  Item P: We are uncertain about its CTR, p0 at t = 0 and p1 at t = 1 –  Item Q: We know its CTR exactly, q0 at t = 0 and q1 at t = 1 •  To determine x, we need to estimate what would happen in the future Question: What fraction x of N0 views to item P (1-x) to item Q t=0 t=1 Now time N0 views N1 views End   Obtain  c  clicks  ader  serving  x   (not  yet  observed;  random  variable)    Assume  we  observe  c;  we  can  update  p1   CTR density Item Q Item P q1   p1(x,c) CTR density Item Q Item P q0   p0  If  x  and  c  are  given,  op$mal  solu$on:   Give  all  views  to  Item  P  iff    E[  p1  I  x,  c  ]  >  q1   ),(ˆ1 cxp ),(ˆ1 cxp
  187. 187. •  Expected total number of clicks in the two time slots }]),,(ˆ[max{)1(ˆ 1110000 qcxpENqxNpxN c+−+ Gain(x, q0, q1) = Expected number of additional clicks if we explore the uncertain item P with fraction x of views in slot 0, compared to a scheme that only shows the certain item Q in both slots Solution: argmaxx Gain(x, q0, q1) Bayesian Solution: Two Items, Two Time Slots (2) }]0,),(ˆ[max{)ˆ( 1110001100 qcxpENqpxNqNqN c −+−++= E[#clicks] at t = 0 E[#clicks] at t = 1 Item P Item Q Show  the  item  with  higher  E[CTR]:   }),,(ˆmax{ 11 qcxp E[#clicks] if we always show item Q Gain(x, q0, q1) Gain of exploring the uncertain item P using x
  188. 188. •  Approximate by the normal distribution –  Reasonable approximation because of the central limit theorem •  Proposition: Using the approximation, the Bayes optimal solution x can be found in time O(log N0) ),(ˆ1 cxp ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ −⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ − Φ−+⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ − ⋅+−= )ˆ( )( ˆ 1 )( ˆ )()ˆ(),,( 11 1 11 1 11 1100010 qp x pq x pq xNqpxNqqxGain σσ φσ )1()()( )],(ˆ[)( 2 0 0 1 2 1 baba ab xNba xN cxpVarx +++++ ==σ )/()],(ˆ[ˆ 11 baacxpEp c +== ),(~ofPrior 1 baBetap Bayesian Solution: Two Items, Two Time Slots (3)
  189. 189. Bayesian Solution: Two Items, Two Time Slots (4) •  Quiz: Is it correct that the more we are uncertain about the CTR of an item, the more we should explore the item? Uncertainty:  Low  Uncertainty:  High   Different   curves  are  for   different  prior   mean  semngs   (Frac$on  of  views  to  give  to  the  item)