ENAR short course

45 %
55 %
Information about ENAR short course

Published on March 15, 2014

Author: dipu1025

Source: slideshare.net

Sta$s$cal  Compu$ng     For  Big  Data   Deepak  Agarwal   LinkedIn  Applied  Relevance  Science   dagarwal@linkedin.com   ENAR  2014,  Bal$more,  USA  

Main  Collaborators:  several  others  at  both  Y!   and  LinkedIn   •  I  won’t  be  here  without  them,  extremely  lucky  to  work  with  such  talented   individuals   Bee-Chung Chen Liang Zhang Bo Long Jonathan Traupman Paul Ogilvie

Structure  of  This  Tutorial   •  Part  I:  Introduc$on  to  Map-­‐Reduce  and  the   Hadoop  System   –  Overview  of  Distributed  Compu$ng   –  Introduc$on  to  Map-­‐Reduce   –  Some  sta$s$cal  computa$ons  using  Map-­‐Reduce   •  Bootstrap,  Logis$c  Regression   •  Part  II:  Recommender  Systems  for  Web   Applica$ons   –  Introduc$on   –  Content  Recommenda$on   –  Online  Adver$sing  

Big  Data  becoming  Ubiquitous   •  Bioinforma$cs   •  Astronomy   •  Internet   •  Telecommunica$ons   •  Climatology   •  …    

Big  Data:  Some  size  es$mates   •  1000  human  genomes:  >  100TB  of  data  (1000   genomes  project)   •  Sloan  Digital  Sky  Survey:  200GB  data  per  night   (>140TB  aggregated)   •  Facebook:  A  billion  monthly  ac$ve  users   •  LinkedIn:      roughly  >  280M  members  worldwide   •  Twiaer:  >  500  million  tweets  a  day   •  Over  6  billion  mobile  phones  in  the  world   genera$ng  data  everyday  

Big  Data:  Paradigm  shid   •  Classical  Sta$s$cs   –  Generalize  using  small  data     •  Paradigm  Shid  with  Big  Data   –  We  now  have  an  almost  infinite  supply  of  data   –  Easy  Sta$s$cs  ?  Just  appeal  to  asympto$c  theory?   •  So  the  issue  is  mostly  computa$onal?   –  Not  quite   •  More  data  comes  with  more  heterogeneity   •  Need  to  change  our  sta$s$cal  thinking  to  adapt   –  Classical  sta$s$cs  s$ll  invaluable  to  think  about  big  data  analy$cs    

Some  Sta$s$cal  Challenges   •  Exploratory  Analysis  (EDA),  Visualiza$on   – Retrospec$ve  (on  Terabytes)   – More  Real  Time    (streaming  computa$ons  every   few  minutes/hours)   •  Sta$s$cal  Modeling   – Scale  (computa$onal  challenge)   – Curse  of  dimensionality     •  Millions  of  predictors,  heterogeneity   – Temporal  and  Spa$al  correla$ons  

Sta$s$cal  Challenges  con$nued   •  Experiments   – To  test  new  methods,  test  hypothesis  from   randomized  experiments   – Adap$ve  experiments     •  Forecas$ng     – Planning,  adver$sing   •  Many  more  I  are  not  fully  well  versed  in        

Defining  Big  Data     •  How  to  know  you  have  the  big  data  problem?   – Is  it  only  the  number  of  terabytes  ?   – What  about  dimensionality,  structured/ unstructured,  computa$ons  required,…   •  No  clear  defini$on,  different  point  of  views   – When  desired  computa$on  cannot  be  completed   in  the  s$pulated  $me  with  current  best  algorithm   using  cores  available  on  a  commodity  PC      

 Distributed  Compu$ng  for  Big  Data       •  Distributed  compu$ng  invaluable  tool  to  scale   computa$ons  for  big  data   •  Some  distributed  compu$ng  models   – Mul$-­‐threading   – Graphics  Processing  Units  (GPU)   – Message  Passing  Interface  (MPI)   – Map-­‐Reduce  

Evalua$ng  a  method  for  a  problem   •  Scalability   –  Process  X  GB  in  Y  hours   •  Ease  of  use  for  a  sta$s$cian     •  Reliability  (fault  tolerance)   –  Especially  in  an  industrial  environment   •  Cost   –  Hardware  and  cost  of  maintaining   •  Good  for  the  computa$ons  required?   –  E.g.,  Itera$ve  versus  one  pass   •  Resource  sharing  

Mul$threading   •  Mul$ple  threads  take  advantage  of  mul$ple   CPUs   •  Shared  memory   •  Threads  can  execute  independently  and   concurrently   •  Can  only  handle  Gigabytes  of  data   •  Reliable  

Graphics  Processing  Units  (GPU)   •  Number  of  cores:   –  CPU:  Order  of  10   –  GPU:  smaller  cores   •  Order  of  1000   •  Can  be  >100x  faster  than  CPU   –  Parallel  computa$onally  intensive  tasks  off-­‐loaded  to  GPU   •  Good  for  certain  computa$onally-­‐intensive  tasks   •  Can  only  handle  Gigabytes  of  data   •  Not  trivial  to  use,  requires  good  understanding  of  low-­‐level  architecture   for  efficient  use   –  But  things  changing,  it  is  gemng  more  user  friendly  

Message  Passing  Interface  (MPI)   •  Language  independent  communica$on   protocol  among  processes  (e.g.  computers)   •  Most  suitable  for  master/slave  model   •  Can  handle  Terabytes  of  data   •  Good  for  itera$ve  processing   •  Fault  tolerance  is  low  

Map-­‐Reduce  (Dean  &  Ghemawat,   2004)   Mappers   Reducers   Data   Output   •  Computa$on  split  to  Map   (scaaer)  and  Reduce  (gather)   stages   •  Easy  to  Use:     –  User  needs  to  implement  two   func$ons:  Mapper  and   Reducer   •  Easily  handles  Terabytes  of   data   •  Very  good  fault  tolerance   (failed  tasks  automa$cally   get  restarted)  

Comparison  of  Distributed  Compu$ng  Methods   Mul$threading   GPU   MPI   Map-­‐Reduce   Scalability  (data   size)   Gigabytes   Gigabytes   Terabytes   Terabytes   Fault  Tolerance   High   High   Low   High   Maintenance  Cost   Low   Medium   Medium   Medium-­‐High   Itera$ve  Process   Complexity   Cheap   Cheap   Cheap   Usually   expensive   Resource  Sharing   Hard   Hard   Easy   Easy   Easy  to  Implement?   Easy   Needs   understanding   of  low-­‐level  GPU   architecture   Easy   Easy  

Example  Problem   •  Tabula$ng  word  counts  in  corpus  of   documents   •  Similar  to  table  func$on  in  R  

Word  Count  Through  Map-­‐Reduce   Hello  World   Bye  World     Hello  Hadoop   Goodbye  Hadoop   Mapper  1   <Hello,  1>   <Hadoop,  1>   <Goodbye,  1>   <Hadoop,1>   <Hello,  1>   <World,  1>   <Bye,  1>   <World,1>   Mapper  2   Reducer  1   Words  from  A-­‐G   Reducer  2   Words  from  H-­‐Z   <Bye,  1>   <Goodbye,  1>   <Hello,  2>   <World,  2>   <Hadoop,  2>  

Key  Ideas  about  Map-­‐Reduce   Big  Data   Par$$on  1   Par$$on  2   …   Par$$on  N   Mapper  1   Mapper  2   …   Mapper  N   <Key,  Value>   <Key,  Value>  <Key,  Value>  <Key,  Value>   Reducer  1   Reducer  2   Reducer  M  …   Output  1   Output  1  Output  1  Output  1  

Key  Ideas  about  Map-­‐Reduce   •  Data  are  split  into  par$$ons  and  stored  in  many   different  machines  on  disk  (distributed  storage)   •  Mappers  process  data  chunks  independently    and   emit  <Key,  Value>  pairs   •  Data  with  the  same  key  are  sent  to  the  same   reducer.  One  reducer  can  receive  mul$ple  keys   •  Every  reducer  sorts  its  data  by  key   •  For  each  key,  the  reducer  processes  the  values   corresponding  to  the  key  according  to  the   customized  reducer  func$on  and  output  

Compute  Mean  for  Each  Group   ID   Group  No.   Score   1   1   0.5   2   3   1.0   3   1   0.8   4   2   0.7   5   2   1.5   6   3   1.2   7   1   0.8   8   2   0.9   9   4   1.3   …   …   …  

Key  Ideas  about  Map-­‐Reduce   •  Data  are  split  into  par$$ons  and  stored  in  many  different  machines  on   disk  (distributed  storage)   •  Mappers  process  data  chunks  independently    and  emit  <Key,  Value>  pairs   –  For  each  row:   •  Key  =  Group  No.   •  Value  =  Score   •  Data  with  the  same  key  are  sent  to  the  same  reducer.  One  reducer  can   receive  mul$ple  keys   –  E.g.  2  reducers   –  Reducer  1  receives  data  with  key  =  1,  2   –  Reducer  2  receives  data  with  key  =  3,  4   •  Every  reducer  sorts  its  data  by  key   –  E.g.  Reducer  1:  <key  =  1,  values=[0.5,  0.8,  0.8]>,  <key=2,  values=<0.7,  1.5,  0.9>   •  For  each  key,  the  reducer  processes  the  values  corresponding  to  the  key   according  to  the  customized  reducer  func$on  and  output   –  E.g.  Reducer  1  output:  <1,  mean(0.5,  0.8,  0.8)>,  <2,  mean(0.7,  1.5,  0.9)>  

Key  Ideas  about  Map-­‐Reduce   •  Data  are  split  into  par$$ons  and  stored  in  many  different  machines  on   disk  (distributed  storage)   •  Mappers  process  data  chunks  independently    and  emit  <Key,  Value>  pairs   –  For  each  row:   •  Key  =  Group  No.   •  Value  =  Score   •  Data  with  the  same  key  are  sent  to  the  same  reducer.  One  reducer  can   receive  mul$ple  keys   –  E.g.  2  reducers   –  Reducer  1  receives  data  with  key  =  1,  2   –  Reducer  2  receives  data  with  key  =  3,  4   •  Every  reducer  sorts  its  data  by  key   –  E.g.  Reducer  1:  <key  =  1,  values=[0.5,  0.8,  0.8]>,  <key=2,  values=<0.7,  1.5,  0.9>   •  For  each  key,  the  reducer  processes  the  values  corresponding  to  the  key   according  to  the  customized  reducer  func$on  and  output   –  E.g.  Reducer  1  output:  <1,  mean(0.5,  0.8,  0.8)>,  <2,  mean(0.7,  1.5,  0.9)>   What  you  need   to  implement  

Mapper:   Input:  Data   for  (row  in  Data)   {    groupNo  =  row$groupNo;    score  =  row$score;    Output(c(groupNo,  score));   }   Reducer:   Input:  Key  (groupNo),  List  Value  (a  list  of  scores  that  belong  to  the  Key)   count  =  0;   sum  =  0;   for  (v  in  Value)   {    sum  +=  v;    count++;   }   Output(c(Key,  sum/count));   Pseudo  Code  (in  R)  

Exercise  1   •  Problem:  Average  height  per  {Grade,  Gender}?   •  What  should  be  the  mapper  output  key?   •  What  should  be  the  mapper  output  value?   •  What  are  the  reducer  input?   •  What  are  the  reducer  output?   •  Write  mapper  and  reducer  for  this?   Student  ID   Grade   Gender   Height  (cm)   1   3   M   120   2   2   F   115   3   2   M   116   …   …   …  

•  Problem:  Average  height  per  Grade  and  Gender?   •  What  should  be  the  mapper  output  key?   –  {Grade,  Gender}   •  What  should  be  the  mapper  output  value?   –  Height   •  What  are  the  reducer  input?   –  Key:  {Grade,  Gender},  Value:  List  of  Heights     •  What  are  the  reducer  output?   –  {Grade,  Gender,  mean(Heights)}   Student  ID   Grade   Gender   Height  (cm)   1   3   M   120   2   2   F   115   3   2   M   116   …   …   …  

Exercise  2   •  Problem:  Number  of  students  per  {Grade,  Gender}?   •  What  should  be  the  mapper  output  key?   •  What  should  be  the  mapper  output  value?   •  What  are  the  reducer  input?   •  What  are  the  reducer  output?   •  Write  mapper  and  reducer  for  this?   Student  ID   Grade   Gender   Height  (cm)   1   3   M   120   2   2   F   115   3   2   M   116   …   …   …  

•  Problem:  Number  of  students  per  {Grade,  Gender}?   •  What  should  be  the  mapper  output  key?   –  {Grade,  Gender}   •  What  should  be  the  mapper  output  value?   –  1   •  What  are  the  reducer  input?   –  Key:  {Grade,  Gender},  Value:  List  of  1’s   •  What  are  the  reducer  output?   –  {Grade,  Gender,  sum(value  list)}   –  OR:  {Grade,  Gender,  length(value  list)}   Student  ID   Grade   Gender   Height  (cm)   1   3   M   120   2   2   F   115   3   2   M   116   …   …   …  

More  on  Map-­‐Reduce   •  Depends  on  distributed  file  systems   •  Typically  mappers  are  the  data  storage  nodes   •  Map/Reduce  tasks  automa$cally  get  restarted   when  they  fail  (good  fault  tolerance)   •  Map  and  Reduce  I/O  are  all  on  disk   –  Data  transmission  from  mappers  to  reducers  are   through  disk  copy   •  Itera$ve  process  through  Map-­‐Reduce   –  Each  itera$on  becomes  a  map-­‐reduce  job   –  Can  be  expensive  since  map-­‐reduce  overhead  is  high  

The  Apache  Hadoop  System   •  An  open-­‐source  sodware  for  reliable,  scalable,   distributed  compu$ng     •  The  most  popular  distributed  compu$ng   system  in  the  world   •  Key  modules:   – Hadoop  Distributed  File  System  (HDFS)   – Hadoop  YARN  (job  scheduling  and  cluster   resource  management)   – Hadoop  MapReduce  

Major  Tools  on  Hadoop   •  Pig   –  A  high-­‐level  language  for  Map-­‐Reduce  computa$on   •  Hive   –  A  SQL-­‐like  query  language  for  data  querying  via  Map-­‐Reduce   •  Hbase   –  A  distributed  &  scalable  database  on  Hadoop   –  Allows  random,  real  $me  read/write  access  to  big  data   –  Voldemort  is  similar  to  Hbase   •  Mahout   –  A  scalable  machine  learning  library   •  …  

Hadoop  Installa$on   •  Semng  up  Hadoop  on  your  desktop/laptop:   – hap://hadoop.apache.org/docs/stable/ single_node_setup.html   •  Semng  up  Hadoop  on  a  cluster  of  machines   – hap://hadoop.apache.org/docs/stable/ cluster_setup.html  

Hadoop  Distributed  File  System  (HDFS)   •  Master/Slave  architecture   •  NameNode:  a  single  master  node  that  controls  which   data  block  is  stored  where.     •  DataNodes:  slave  nodes  that  store  data  and  do  R/W   opera$ons   •  Clients  (Gateway):  Allow  users  to  login  and  interact   with  HDFS  and  submit  Map-­‐Reduce  jobs   •  Big  data  is  split  to  equal-­‐sized  blocks,  each  block  can  be   stored  in  different  DataNodes   •  Disk  failure  tolerance:  data  is  replicated  mul$ple  $mes  

Load  the  Data  into  Pig   •  A  =  LOAD  ‘Sample-­‐1.dat'  USING  PigStorage()  AS   (ID  :  int,  groupNo:  int,  score:  float);     –  The  path  of  the  data  on  HDFS  ader  LOAD     •  USING  PigStorage()  means  delimit  the  data  by  tab   (can  be  omiaed)   •  If  data  are  delimited  by  other  characters,  e.g.   space,  use  USING  PigStorage(‘  ‘)     •  Data  schema  defined  ader  AS     •  Variable  types:  int,  long,  float,  double,  chararray,   …  

Structure  of  This  Tutorial   •  Part  I:  Introduc$on  to  Map-­‐Reduce  and  the   Hadoop  System   – Overview  of  Distributed  Compu$ng   – Introduc$on  to  Map-­‐Reduce   – Introduc$on  to  the  Hadoop  System   –   Examples  of  Sta$s$cal  Compu$ng  for  Big  Data   •  Bag  of  Liale  Bootstraps   •  Large  Scale  Logis$c  Regression  

Bag  of  Liale  Bootstraps   Kleiner  et  al.  2012  

Bootstrap  (Efron,  1979)   •  A  re-­‐sampling  based  method  to  obtain  sta$s$cal   distribu$on  of  sample  es$mators   •  Why  are  we  interested  ?   –  Re-­‐sampling  is  embarrassingly  parallelizable     •  For  example:   –  Standard  devia$on  of  the  mean  of  N  samples  (μ)   –  For  i  =  1  to  r  do     •  Randomly  sample  with  replacement  N  $mes  from  the  original   sample  -­‐>  bootstrap  data  i   •  Compute  mean  of  the  i-­‐th  bootstrap  data  -­‐>  μi   –  Es$mate  of  Sd(μ)  =  Sd([μ1,…μr])   –  r  is  usually  a  large  number,  e.g.  200  

Bootstrap  for  Big  Data   •  Can  have  r  nodes  running  in  parallel,  each   sampling  one  bootstrap  data   •  However…   – N  can  be  very  large   – Data  may  not  fit  into  memory   – Collec$ng  N  samples  with  replacement  on  each   node  can  be  computa$onally  expensive  

M  out  of  N  Bootstrap     (Bikel  et  al.  1997)   •  Obtain  SdM(μ)  by  sampling  M  samples  with   replacement  for  each  bootstrap,  where  M<N   •  Apply  analy$cal  correc$on  to  SdM(μ)  to  obtain   Sd(μ)  using  prior  knowledge  of  convergence  rate   of  sample  es$mates   •  However…   –  Prior  knowledge  is  required   –  Choice  of  M  is  cri$cal  to  performance   –  Finding  op$mal  value  of  M  needs  more    computa$on  

Bag  of  Liale  Bootstraps  (BLB)   •  Example:  Standard  devia$on  of  the  mean     •  Generate  S  sampled  data  sets,  each  obtained  by  random   sampling  without  replacement  a  subset  of  size  b  (or   par$$on  the  original  data  into  S  par$$ons,  each  with  size   b)   •  For  each  data  p  =  1  to  S  do   –  For  i  =  1  to  r  do     •  N  samples  with  replacement  on  data  of  size  b   •  Compute  mean  of  the  resampled  data  μpi   –  Compute  Sdp(μ)  =  Sd([μp1,…μpr])   •  Es$mate  of  Sd(μ)  =  Avg([Sd1(μ),…,  SdS(μ)])  

Bag  of  Liale  Bootstraps  (BLB)   •  Interest:  ξ(θ),  where  θ  is  an  es$mate  obtained  from   size  N  data   –   ξ  is  some  func$on  of  θ,  such  as  standard  devia$on,  …   •  Generate  S  sampled  data  sets,  each  obtained  from  random   sampling  without  replacement  a  subset  of  size  b  (or  par$$on   the  original  data  into  S  par$$ons,  each  with  size  b)   •  For  each  data  p  =  1  to  S  do   –  For  i  =  1  to  r  do     •  Sample  N  samples  with  replacement  on  data  of  size  b   •  Compute  mean  of  the  resampled  data  θpi   –  Compute  ξp(θ)  =  ξ([θp1,…θpr])   •  Es$mate  of  ξ(μ)  =  Avg([ξ1(θ),…,  ξS(θ)])  

Bag  of  Liale  Bootstraps  (BLB)   •  Interest:  ξ(θ),  where  θ  is  an  es$mate  obtained  from   size  N  data   –   ξ  is  some  func$on  of  θ,  such  as  standard  devia$on,  …   •  Generate  S  sampled  data  sets,  each  obtained  from  random   sampling  without  replacement  a  subset  of  size  b  (or  par$$on   the  original  data  into  S  par$$ons,  each  with  size  b)   •  For  each  data  p  =  1  to  S  do   –  For  i  =  1  to  r  do     •  Sample  N  samples  with  replacement  on  the  data  of  size  b   •  Compute  mean  of  the  resampled  data  θpi   –  Compute  ξp(θ)  =  ξ([θp1,…θpr])   •  Es$mate  of  ξ(μ)  =  Avg([ξ1(θ),…,  ξS(θ)])   Mapper  Reducer   Gateway  

Why  is  BLB  Efficient   •  Before:   – N  samples  with  replacement  from  size  N  data  is   expensive  when  N  is  large   •  Now:   – N  samples  with  replacement  from  size  b  data   – b  can  be  several  magnitude  smaller  than  N  (e.g.  b   =  Nγ,  γ  in  [0.5,  1))   – Equivalent  to:  A  mul$nomial  sampler  with  dim  =  b   – Storage  =  O(b),  Computa$onal  complexity  =  O(b)  

Simula$on  Experiment   •  95%  CI  of  Logis$c  Regression  Coefficients   •  N  =  20000,  10  explanatory  variables   •  Rela$ve  Error  =  |Es$mated  CI  width  –  True  CI   width  |  /  True  CI  width   •  BLB-­‐γ:  BLB  with  b  =  Nγ   •  BOFN-­‐γ:  b  out  of  N  sampling  with  b  =  Nγ   •  BOOT:  Naïve  bootstrap  

Simula$on  Experiment  

Real  Data   •  95%  CI  of  Logis$c  Regression  Coefficients   •  N  =  6M,  3000  explanatory  variables   •  Data  size  =  150GB,  r  =  50,  s  =  5,  γ  =  0.7  

Summary  of  BLB   •  A  new  algorithm  for  bootstrapping  on  big  data   •  Advantages   – Fast  and  efficient   – Easy  to  parallelize   – Easy  to  understand  and  implement   – Friendly  to  Hadoop,  makes  it  rou$ne  to  perform   sta$s$cal  calcula$ons  on  Big  data  

Large  Scale  Logis$c  Regression  

Logis$c  Regression   •  Binary  response:  Y   •  Covariates:  X   •  Yi  ~  Bernoulli(pi)   •  log(pi/(1-­‐pi))  =  Xi Tβ  ;    β  ~  MVN(0  ,  1/λ  I  )   •  Widely  used  (research  and  applica$ons)  

Large  Scale  Logis$c  Regression   •  Binary  response:  Y   –  E.g.,  Click  /  Non-­‐click  on  an  ad  on  a  webpage   •  Covariates:  X   –  User  covariates:     •  Age,  gender,  industry,  educa$on,  job,  job  $tle,  …   –  Item  covariates:   •  Categories,  keywords,  topics,  …   –  Context  covariates:   •  Time,  page  type,  posi$on,  …   –  2-­‐way  interac$on:   •  User  covariates  X  item  covariates   •  Context  covariates  X  item  covariates   •  …  

Computa$onal  Challenge   •  Hundreds  of  millions/billions  of  observa$ons     •  Hundreds  of  thousands/millions  of  covariates   •  Fimng  such  a  logis$c  regression  model  on  a   single  machine  not  feasible   •  Model  fimng  itera$ve  using  methods  like   gradient  descent,  Newton’s  method  etc   – Mul$ple  passes  over  the  data  

Recap  on  Op$miza$on  method   •  Problem:  Find  x  to  min(F(x))   •  Itera$on  n:  xn  =  xn-­‐1  –  bn-­‐1  F’(xn-­‐1)   •   bn-­‐1  is  the  step  size  that  can  change  every   itera$on   •  Iterate  un$l  convergence     •  Conjugate  gradient,  LBFGS,  Newton  trust   region,  …  all  of  this  kind  

Itera$ve  Process  with  Hadoop     Disk   Mappers   Disk   Reducers   Disk  Mappers  Disk  Reducers   Disk   Mappers   Disk   Reducers  

Limita$ons  of  Hadoop  for  fimng  a  big   logis$c  regression   •  Itera$ve  process  is  expensive  and  slow   •  Every  itera$on  =  a  Map-­‐Reduce  job   •  I/O  of  mapper  and  reducers  are  both  through   disk   •  Plus:  Wai$ng  in  queue  $me     •  Q:  Can  we  find  a  fimng  method  that  scales   with  Hadoop  ?  

Large  Scale  Logis$c  Regression   •  Naïve:     –  Par$$on  the  data  and  run  logis$c  regression  for  each  par$$on   –  Take  the  mean  of  the  learned  coefficients   –  Problem:  Not  guaranteed  to  converge  to  the  model  from  single   machine!   •  Alterna$ng  Direc$on  Method  of  Mul$pliers  (ADMM)   –  Boyd  et  al.  2011   –  Set  up  constraints:  each  par$$on’s  coefficient  =  global   consensus   –  Solve  the  op$miza$on  problem  using  Lagrange  Mul$pliers   –  Advantage:  guaranteed  to  converge  to  a  single  machine  logis$c   regression  on  the  en$re  data  with  reasonable  number  of   itera$ons    

Large  Scale  Logis$c  Regression  via  ADMM   BIG  DATA   Par$$on  1   Par$$on  2   Par$$on  3   Par$$on  K   Logis$c   Regression   Logis$c   Regression   Logis$c   Regression   Logis$c   Regression   Consensus   Computa$on   Iteration 1

Large  Scale  Logis$c  Regression  via  ADMM   BIG  DATA   Par$$on  1   Par$$on  2   Par$$on  3   Par$$on  K   Logis$c   Regression   Consensus   Computa$on   Logis$c   Regression   Logis$c   Regression   Logis$c   Regression   Iteration 1

Large  Scale  Logis$c  Regression  via  ADMM   BIG  DATA   Par$$on  1   Par$$on  2   Par$$on  3   Par$$on  K   Logis$c   Regression   Logis$c   Regression   Logis$c   Regression   Logis$c   Regression   Consensus   Computa$on   Iteration 2

Details  of  ADMM  

Dual  Ascent  Method   •  Consider  a  convex  op$miza$on  problem     •  Lagrangian  for  the  problem:     •  Dual  Ascent:   round and motivation. Dual Ascent der the equality-constrained convex optimization problem minimize f(x) subject to Ax = b, (2. ariable x ∈ Rn , where A ∈ Rm×n and f : Rn → R is convex. e Lagrangian for problem (2.1) is L(x,y) = f(x) + yT (Ax − b) he dual function is g(y) = inf x L(x,y) = −f∗ (−AT y) − bT y, y is the dual variable or Lagrange multiplier, and f∗ is the conv round and motivation. Dual Ascent der the equality-constrained convex optimization problem minimize f(x) subject to Ax = b, (2. variable x ∈ Rn , where A ∈ Rm×n and f : Rn → R is convex. he Lagrangian for problem (2.1) is L(x,y) = f(x) + yT (Ax − b) he dual function is g(y) = inf x L(x,y) = −f∗ (−AT y) − bT y, y is the dual variable or Lagrange multiplier, and f∗ is the conv gate of f; see [20, §3.3] or [140, §12] for background. The du rimal optimal point x from a dual optimal point y as x = argmin x L(x,y ), vided there is only one minimizer of L(x,y ). (This is the case e.g., f is strictly convex.) In the sequel, we will use the notation minx F(x) to denote any minimizer of F, even when F does not e a unique minimizer. In the dual ascent method, we solve the dual problem using gradient ent. Assuming that g is differentiable, the gradient g(y) can be uated as follows. We first find x+ = argminx L(x,y); then we have (y) = Ax+ − b, which is the residual for the equality constraint. The l ascent method consists of iterating the updates xk+1 := argmin x L(x,yk ) (2.2) yk+1 := yk + αk (Axk+1 − b), (2.3) ere αk > 0 is a step size, and the superscript is the iteration counter.

Augmented  Lagrangians   •  Bring  robustness  to  the  dual  ascent  method   •  Yield  convergence  without  assump$ons  like  strict   convexity  or  finiteness  of  f   •      •  The  value  of  ρ  influences  the  convergence  rate   aph-structured optimization problems. Augmented Lagrangians and the Method of Multiplie mented Lagrangian methods were developed in part to br tness to the dual ascent method, and in particular, to yield c nce without assumptions like strict convexity or finiteness of augmented Lagrangian for (2.1) is Lρ(x,y) = f(x) + yT (Ax − b) + (ρ/2) Ax − b 2 2, (22.3 Augmented Lagrangians and the Method where ρ > 0 is called the penalty parameter. (Note standard Lagrangian for the problem.) The augmen can be viewed as the (unaugmented) Lagrangian asso problem 2

Alterna$ng  Direc$on  Method  of   Mul$pliers  (ADMM)   •  Problem:     •  Augmented  Lagrangians     •  ADMM:   MM is an algorithm that is intended to blend the decompos dual ascent with the superior convergence properties of the m multipliers. The algorithm solves problems in the form minimize f(x) + g(z) subject to Ax + Bz = c h variables x ∈ Rn and z ∈ Rm , where A ∈ Rp×n , B ∈ Rp×m Rp . We will assume that f and g are convex; more specific as ns will be discussed in §3.2. The only difference from the g ear equality-constrained problem (2.1) is that the variable, ca re, has been split into two parts, called x and z here, with the e function separable across this splitting. The optimal value blem (3.1) will be denoted by p = inf{f(x) + g(z) | Ax + Bz = c}. with variables x ∈ Rn and z ∈ Rm , where A ∈ Rp×n , B ∈ Rp×m , and c ∈ Rp . We will assume that f and g are convex; more specific assump- tions will be discussed in §3.2. The only difference from the general linear equality-constrained problem (2.1) is that the variable, called x there, has been split into two parts, called x and z here, with the objec- tive function separable across this splitting. The optimal value of the problem (3.1) will be denoted by p = inf{f(x) + g(z) | Ax + Bz = c}. As in the method of multipliers, we form the augmented Lagrangian Lρ(x,z,y) = f(x) + g(z) + yT (Ax + Bz − c) + (ρ/2) Ax + Bz − c 2 2. 13 14 Alternating Direction Method of Multipliers ADMM consists of the iterations xk+1 := argmin x Lρ(x,zk ,yk ) (3.2) zk+1 := argmin z Lρ(xk+1 ,z,yk ) (3.3) yk+1 := yk + ρ(Axk+1 + Bzk+1 − c), (3.4) where ρ > 0. The algorithm is very similar to dual ascent and the

Large  Scale  Logis$c  Regression  via  ADMM   •  Nota$on     –  (Xi  ,  yi):  data  in  the  ith  par$$on   –  βi:  coefficient  vector  for  par$$on  i   –  β:  Consensus  coefficient  vector   –  r(β):  penalty  component  such  as  ||β||2 2     •  Op$miza$on  problem   Brief Article The Author July 7, 2013 min NX i=1 li(yi, XT i i) + r( ) subject to i =

ADMM  updates   LOCAL  REGRESSIONS   Shrinkage  towards  current   best  global  es$mate   UPDATED   CONSENSUS  

An  example  implementa$on   •  ADMM  for  Logis$c  regression  model  fimng  with   L2/L1  penalty   •  Each  itera$on  of  ADMM  is  a  Map-­‐Reduce  job   –  Mapper:  par$$on  the  data  into  K  par$$ons   –  Reducer:  For  each  par$$on,  use  liblinear/glmnet  to  fit   a  L1/L2  logis$c  regression   –  Gateway:  consensus  computa$on  by  results  from  all   reducers,  and  sends  back  the  consensus  to  each   reducer  node  

KDD  CUP  2010  Data   •  Bridge  to  Algebra  2008-­‐2009  data  in     haps://pslcdatashop.web.cmu.edu/KDDCup/ downloads.jsp   •  Binary  response,  20M  covariates   •  Only  keep  covariates  with  >=  10  occurrences   =>  2.2M  covariates   •  Training  data:  8,407,752  samples   •  Test  data  :  510,302  samples  

Avg  Training  Loglikelihood  vs  Number   of  Itera$ons  

Test  AUC  vs  Number  of  Itera$ons    

Beaer  Convergence  Can     Be  Achieved  By   •  Beaer  Ini$aliza$on   – Use  results  from  Naïve  method  to  ini$alize  the   parameters   •  Adap$vely  change  step  size  (ρ)  for  each   itera$on  based  on  the  convergence  status  of   the  consensus  

Recommender Problems for Web Applications

Agenda •  Topic of Interest –  Recommender problems for dynamic, time- sensitive applications •  Content Optimization, Online Advertising, Movie recommendation, shopping,… •  Introduction •  Offline components –  Regression, Collaborative filtering (CF), … •  Online components + initialization –  Time-series, online/incremental methods, explore/ exploit (bandit) •  Evaluation methods + Multi-Objective •  Challenges

Three components we will focus on •  Defining the problem –  Formulate objectives whose optimization achieves some long- term goals for the recommender system •  E.g. How to serve content to optimize audience reach and engagement, optimize some combination of engagement and revenue ? •  Modeling (to estimate some critical inputs) –  Predict rates of some positive user interaction(s) with items based on data obtained from historical user-item interactions •  E.g. Click rates, average time-spent on page, etc •  Could be explicit feedback like ratings •  Experimentation –  Create experiments to collect data proactively to improve models, helps in converging to the best choice(s) cheaply and rapidly. •  Explore and Exploit (continuous experimentation) •  DOE (testing hypotheses by avoiding bias inherent in data)

Modern Recommendation Systems •  Goal –  Serve the right item to a user in a given context to optimize long-term business objectives •  A scientific discipline that involves –  Large scale Machine Learning & Statistics •  Offline Models (capture global & stable characteristics) •  Online Models (incorporates dynamic components) •  Explore/Exploit (active and adaptive experimentation) –  Multi-Objective Optimization •  Click-rates (CTR), Engagement, advertising revenue, diversity, etc –  Inferring user interest •  Constructing User Profiles –  Natural Language Processing to understand content •  Topics, “aboutness”, entities, follow-up of something, breaking news,…

Some examples from content optimization •  Simple version –  I have a content module on my page, content inventory is obtained from a third party source which is further refined through editorial oversight. Can I algorithmically recommend content on this module? I want to improve overall click-rate (CTR) on this module •  More advanced –  I got X% lift in CTR. But I have additional information on other downstream utilities (e.g. advertising revenue). Can I increase downstream utility without losing too many clicks? •  Highly advanced –  There are multiple modules running on my webpage. How do I perform a simultaneous optimization?

Recommend applications Recommend search queries Recommend news article Recommend packages: Image Title, summary Links to other pages Pick  4  out  of  a  pool  of  K          K  =  20  ~  40          Dynamic     Routes  traffic  other  pages  

Problems in this example •  Optimize CTR on multiple modules –  Today Module, Trending Now, Personal Assistant, News –  Simple solution: Treat modules as independent, optimize separately. May not be the best when there are strong correlations. •  For any single module –  Optimize some combination of CTR, downstream engagement, and perhaps advertising revenue.

Online Advertising Advertisers Ad Network Ads Page Recommend Best ad(s) User Publisher Response rates (click, conversion, ad-view) Bids Auction Click conversion Select argmax f(bid,response rates) ML /Statistical model Examples: Yahoo, Google, MSN, … Ad exchanges (RightMedia, DoubleClick, …)

LinkedIn  Today:  Content  Module   Objective: Serve content to maximize engagement metrics like CTR (or weighted CTR)

LinkedIn  Ads:  Match  ads  to  users   visi$ng  LinkedIn  

Right  Media  Ad  Exchange:  Unified   Marketplace   Match  ads  to  page  views  on  publisher  sites   Has  ad   impression  to   sell  -­‐-­‐   AUCTIONS   Bids  $0.50   Bids  $0.75  via  Network…   …  which  becomes   $0.45  bid   Bids  $0.65—WINS!   AdSense   Ad.com   Bids  $0.60  

Recommender problems in general USER     Item  Inventory   Ar$cles,  web  page,     ads,  …   Use  an  automated  algorithm     to  select  item(s)  to  show     Get  feedback  (click,  $me  spent,..)     Refine  the  models     Repeat  (large  number  of  :mes)   Op:mize  metric(s)  of  interest   (Total  clicks,  Total  revenue,…)   Example applications Search: Web, Vertical Online Advertising Content ….. Context   query,  page,  …  

•  Items: Articles, ads, modules, movies, users, updates, etc. •  Context: query keywords, pages, mobile, social media, etc. •  Metric to optimize (e.g., relevance score, CTR, revenue, engagement) –  Currently, most applications are single-objective –  Could be multi-objective optimization (maximize X subject to Y, Z,..) •  Properties of the item pool –  Size (e.g., all web pages vs. 40 stories) –  Quality of the pool (e.g., anything vs. editorially selected) –  Lifetime (e.g., mostly old items vs. mostly new items) Important Factors

Factors affecting Solution (continued) •  Properties of the context –  Pull: Specified by explicit, user-driven query (e.g., keywords, a form) –  Push: Specified by implicit context (e.g., a page, a user, a session) •  Most applications are somewhere on continuum of pull and push •  Properties of the feedback on the matches made –  Types and semantics of feedback (e.g., click, vote) –  Latency (e.g., available in 5 minutes vs. 1 day) –  Volume (e.g., 100K per day vs. 300M per day) •  Constraints specifying legitimate matches –  e.g., business rules, diversity rules, editorial Voice –  Multiple objectives •  Available Metadata (e.g., link graph, various user/item attributes)

Predicting User-Item Interactions (e.g. CTR) •  Myth: We have so much data on the web, if we can only process it the problem is solved –  Number of things to learn increases with sample size •  Rate of increase is not slow –  Dynamic nature of systems make things worse –  We want to learn things quickly and react fast •  Data is sparse in web recommender problems –  We lack enough data to learn all we want to learn and as quickly as we would like to learn –  Several Power laws interacting with each other •  E.g. User visits power law, items served power law –  Bivariate Zipf: Owen & Dyer, 2011

Can Machine Learning help? •  Fortunately, there are group behaviors that generalize to individuals & they are relatively stable –  E.g. Users in San Francisco tend to read more baseball news •  Key issue: Estimating such groups –  Coarse group : more stable but does not generalize that well. –  Granular group: less stable with few individuals –  Getting a good grouping structure is to hit the “sweet spot” •  Another big advantage on the web –  Intervene and run small experiments on a small population to collect data that helps rapid convergence to the best choices(s) •  We don’t need to learn all user-item interactions, only those that are good.

Predicting user-item interaction rates Offline   (  Captures  stable  characteris$cs   at  coarse  resolu$ons)   (Logis$c,  Boos$ng,….)     Feature    construc$on   Content:  IR,  clustering,  taxonomy,  en$ty,..     User  profiles:  clicks,  views,  social,  community,..   Near  Online   (Finer  resolu$on   Correc$ons)   (item,  user  level)   (Quick  updates)   Explore/Exploit   (Adap$ve  sampling)   (helps  rapid  convergence   to  best  choices)   Initialize

Post-click: An example in Content Optimization Recommender          EDITORIAL               content Clicks on FP links influence downstream supply distribution    AD  SERVER                DISPLAY          ADVERTISING   Revenue   Downstream   engagement      (Time  spent)  

Serving Content on Front Page: Click Shaping •  What do we want to optimize? •  Current: Maximize clicks (maximize downstream supply from FP) •  But consider the following –  Article 1: CTR=5%, utility per click = 5 –  Article 2: CTR=4.9%, utility per click=10 •  By promoting 2, we lose 1 click/100 visits, gain 5 utils •  If we do this for a large number of visits --- lose some clicks but obtain significant gains in utility? –  E.g. lose 5% relative CTR, gain 40% in utility (revenue, engagement, etc)

High  level  picture   http request Statistical Models updated in Batch mode: e.g. once every 30 mins Server   Item   Recommenda$on   system:  thousands   of  computa$ons  in   sub-­‐seconds   User Interacts e.g. click, does nothing

High  level  overview:  Item   Recommenda$on  System   User Info Item Index Id, meta-data ML/ Statistical Models Score Items P(Click), P(share), Semantic-relevance score,…. Rank Items: sort by score (CTR,bid*CTR,..) combine scores using Multi-obj optim, Threshold on some scores,…. User-item interaction Data: batch process Updated in batch: Activity, profile Pre-filter SPAM,editorial,,.. Feature extraction NLP, cllustering,..

ML/Sta$s$cal  models  for  scoring   Number of items Scored by ML Traffic volume 1000100 100k 1M 100M Few hours Few days Several days LinkedIn Today Yahoo! Front Page Right Media Ad exchange LinkedIn Ads

Summary  of  deployments       •  Yahoo!  Front  page  Today  Module  (2008-­‐2011):  300%  improvement   in  click-­‐through  rates   –  Similar  algorithms  delivered  via  a  self-­‐serve  pla•orm,  adopted  by   several  Yahoo!  Proper$es  (2011):  Significant  improvement  in   engagement  across  Yahoo!  Network   •  Fully  deployed  on  LinkedIn  Today  Module  (2012):  Significant   improvement  in  click-­‐through  rates  (numbers  not  revealed  due  to   reasons  of  confiden$ality)   •  Yahoo!  RightMedia  exchange  (2012):  Fully  deployed  algorithms  to   es$mate  response  rates  (CTR,  conversion  rates).  Significant   improvement  in  revenue  (numbers  not  revealed  due  to  reasons  of   confiden$ality)   •  LinkedIn  self-­‐serve  ads  (2012-­‐2013):Fully  deployed   •  LinkedIn  News  Feed  (2013-­‐2014):  Fully  deployed   •  Several  others  in  progress….  

Broad  Themes   •  Curse  of  dimensionality   –  Large  number  of  observa$ons  (rows),  large  number  of  poten$al  features   (columns)   –  Use  domain  knowledge  and  machine  learning  to  reduce  the  “effec$ve”   dimension  (constraints  on  parameters  reduce  degrees  of  freedom)   •  I  will  give  examples  as  we  move  along   •   We  oden  assume  our  job  is  to  analyze  “Big  Data”  but  we  oden  have   control  on  what  data  to  collect  through  clever  experimenta$on   –  This  can  fundamentally  change  solu$ons   •  Think  of  computa$on  and  models  together  for  Big  data   •  Op$miza$on:  What  we  are  trying  to  op$mize  is  oden  complex,models  to   work  in  harmony  with  op$miza$on   –  Pareto  op$mality  with  compe$ng  objec$ves  

Sta$s$cal  Problem   •  Rank  items  (from  an  admissible  pool)  for  user  visits  in  some   context  to  maximize  a  u$lity  of  interest   •  Examples  of  u$lity  func$ons   –  Click-­‐rates  (CTR)   –  Share-­‐rates  (CTR*  [Share|Click]  )   –  Revenue  per  page-­‐view  =  CTR*bid  (more  complex  due  to  second  price   auc$on)   •  CTR  is  a  fundamental  measure  that  opens  the  door  to  a  more   principled  approach  to  rank  items   •  Converge  rapidly  to  maximum  u$lity  items     –  Sequen$al  decision  making  process  (explore/exploit)    

item  j  from  a  set  of  candidates   User  i     with   user  features   (e.g.,  industry,    behavioral  features,   Demographic  features, ……)                  (i,  j)  :  response  yij  visits   Algorithm  selects   (click  or  not)   Which  item  should  we  select?    Ÿ  The  item  with  highest  predicted  CTR    Ÿ  An  item  for  which  we  need  data  to            predict  its  CTR   Exploit   Explore   LinkedIn Today, Yahoo! Today Module: Choose Items to maximize CTR This is an “Explore/Exploit” Problem

The Explore/Exploit Problem (to maximize CTR) •  Problem definition: Pick k items from a pool of N for a large number of serves to maximize the number of clicks on the picked items •  Easy!? Pick the items having the highest click-through rates (CTRs) •  But … –  The system is highly dynamic: •  Items come and go with short lifetimes •  CTR of each item may change over time –  How much traffic should be allocated to explore new items to achieve optimal performance ? •  Too little → Unreliable CTR estimates due to “starvation” •  Too much → Little traffic to exploit the high CTR items

Y!  front  Page  Applica$on   •  Simplify:  Maximize  CTR  on  first  slot  (F1)     •  Item  Pool   –  Editorially  selected  for  high  quality  and  brand  image   –  Few  ar$cles  in  the  pool  but  item  pool  dynamic    

CTR Curves of Items on LinkedIn Today CTR  

Impact  of  repeat  item  views  on  a  given   user   •  Same  user  is  shown  an  item  mul$ple  $mes   (despite  not  clicking)  

Simple  algorithm  to  es$mate  most  popular   item  with  small  but  dynamic  item  pool   •  Simple  Explore/Exploit  scheme –  ε%  explore:  with  a  small  probability  (e.g.  5%),  choose  an  item  at   random  from  the  pool   –  (100−ε)%  exploit:  with  large  probability  (e.g.  95%),  choose   highest  scoring  CTR  item   •  Temporal  Smoothing   –  Item  CTRs  change  over  $me,  provide  more  weight  to  recent  data  in   es$ma$ng  item  CTRs   •  Kalman  filter,  moving  average   •  Discount  item  score  with  repeat  views   –  CTR(item)  for  a  given  user  drops  with  repeat  views  by  some  “discount”   factor  (es$mated  from  data)   •  Segmented  most  popular   –  Perform  separate  most-­‐popular  for  each  user  segment    

Time  series  Model:  Kalman  filter   •  Dynamic  Gamma-­‐Poisson:  click-­‐rate  evolves  over  $me   in  a  mul$plica$ve  fashion   •  Es$mated  Click-­‐rate  distribu$on  at  $me  t+1     –  Prior  mean:   –  Prior  variance:       High  CTR  items  more  adap$ve  

More  economical  explora$on?  Beaer   bandit  solu$ons   •  Consider  two  armed  problem   p2 (unknown payoff probabilities) The  gambler  has  1000  plays,  what  is  the  best  way  to  experiment  ?                                                (to  maximize  total  expected  reward)      This  is  called  the  “mul$-­‐armed  bandit”  problem,  have  been  studied  for  a  long  $me.    Op$mal  solu$on:  Play  the  arm  that  has  maximum  poten:al  of  being  good    Op:mism  in  the  face  of  uncertainty   p1 >

Item  Recommenda$on:  Bandits?   •  Two  Items:  Item  1  CTR=  2/100  ;  Item  2  CTR=  250/10000   –  Greedy:  Show  Item  2  to  all;  not  a  good  idea   –  Item  1  CTR  es$mate  noisy;  item  could  be  poten$ally   beaer   •  Invest  in  Item  1  for  beaer  overall  performance  on  average       –  Exploit  what  is  known  to  be  good,  explore  what  is  poten$ally  good   CTR Probabilitydensity Item 2 Item 1

Next few hours Most Popular Recommendation Personalized Recommendation Offline Models Collaborative filtering (cold-start problem) Online Models Time-series models Incremental CF, online regression Intelligent Initialization Prior estimation Prior estimation, dimension reduction Explore/Exploit Multi-armed bandits Bandits with covariates

Offline Components: Collaborative Filtering in Cold-start Situations

Problem Item j with User i with user features xi (demographics, browse history, search history, …) item features xj (keywords, content categories, ...) (i, j) : response yijvisits Algorithm selects (explicit rating, implicit click/no-click) Predict the unobserved entries based on features and the observed entries

Model Choices •  Feature-based (or content-based) approach –  Use features to predict response •  (regression, Bayes Net, mixture models, …) –  Limitation: need predictive features •  Bias often high, does not capture signals at granular levels •  Collaborative filtering (CF aka Memory based) –  Make recommendation based on past user-item interaction •  User-user, item-item, matrix factorization, … •  See [Adomavicius & Tuzhilin, TKDE, 2005], [Konstan, SIGMOD’08 Tutorial], etc. –  Better performance for old users and old items –  Does not naturally handle new users and new items (cold- start)

Collaborative Filtering (Memory based methods) User-User Similarity Item-Item similarities, incorporating both Estimating Similarities Pearson’s correlation Optimization based (Koren et al)

How to Deal with the Cold-Start Problem •  Heuris$c-­‐based  approaches   –  Linear  combina$on  of  regression  and  CF  models     –  Filterbot   •  Add  user  features  as  psuedo  users  and  do  collabora$ve  filtering   -­‐  Hybrid  approaches   -­‐  Use  content  based  to  fill  up  entries,  then  use  CF   •  Matrix  Factoriza$on   –  Good  performance  on  Ne•lix  (Koren,  2009)   •  Model-­‐based  approaches     –  Bilinear  random-­‐effects  model  (probabilis$c  matrix  factoriza$on)   •  Good  on  Ne•lix  data  [Ruslan  et  al  ICML,  2009]   –  Add  feature-­‐based  regression  to  matrix  factoriza$on     •  (Agarwal  and  Chen,  2009)   –  Add  topic  discovery  (from  textual  items)  to  matrix  factoriza$on     •  (Agarwal  and  Chen,  2009;  Chun  and  Blei,  2011)  

Per-item regression models •  When tracking users by cookies, distribution of visit patters could get extremely skewed – Majority of cookies have 1-2 visits •  Per item models (regression) based on user covariates attractive in such cases

Several per-item regressions: Multi-task learning Low dimension (5-10), B estimated retrospective data •  Agarwal,Chen and Elango, KDD, 2010 Affinity to old items

Per-user, per-item models via bilinear random-effects model

Motivation •  Data measuring k-way interactions pervasive –  Consider k = 2 for all our discussions •  E.g. User-Movie, User-content, User-Publisher-Ads,…. –  Power law on both user and item degrees •  Classical Techniques –  Approximate matrix through a singular value decomposition (SVD) •  After adjusting for marginal effects (user pop, movie pop,..) –  Does not work •  Matrix highly incomplete, severe over-fitting –  Key issue •  Regularization of eigenvectors (factors) to avoid overfitting

Early work on complete matrices •  Tukey’s 1-df model (1956) –  Rank 1 approximation of small nearly complete matrix •  Criss-cross regression (Gabriel, 1978) •  Incomplete matrices: Psychometrics (1-factor model only; small data sets; 1960s) •  Modern day recommender problems –  Highly incomplete, large, noisy.

Latent Factor Models “newsy” “sporty” “newsy” s item v z Affinity = u’v Affinity = s’z u s p o r t y

Factorization – Brief Overview •  Latent user factors: (αi , ui=(ui1,…,uin)) •  (Nn + Mm) parameters •  Key technical issue: •  Latent movie factors: (βj , vj=(v j1,….,v jn)) will overfit for moderate values of n,m Regularization Interaction jijiij BvuyE ʹ′+++= βαµ)(

Latent Factor Models: Different Aspects •  Matrix Factorization – Factors in Euclidean space – Factors on the simplex •  Incorporating features and ratings simultaneously •  Online updates

Maximum Margin Matrix Factorization (MMMF) •  Complete matrix by minimizing loss (hinge,squared- error) on observed entries subject to constraints on trace norm –  Srebro, Rennie, Jakkola (NIPS 2004) –  Convex, Semi-definite programming (expensive, not scalable) •  Fast MMMF (Rennie & Srebro, ICML, 2005) –  Constrain the Frobenious norm of left and right eigenvector matrices, not convex but becomes scalable. •  Other variation: Ensemble MMMF (DeCoste, ICML2005) –  Ensembles of partially trained MMMF (some improvements)

Matrix Factorization for Netflix prize data •  Minimize the objective function •  Simon Funk: Stochastic Gradient Descent •  Koren et al (KDD 2007): Alternate Least Squares –  They move to SGD later in the competition ∑ ∑∑∈ ++− obsij j j i ij T iij vuvur )()( 222 λ

ui vj rij au av 2 σ Optimization is through Iterated conditional modes Other variations like constraining the mean through sigmoid, using “who-rated- whom” Combining with Boltzmann Machines also improved performance ),(~ ),(~ ),(~ 2 IaMVN IaMVN Nr vj ui j T iij 0v 0u vu σ Probabilis$c  Matrix  Factoriza$on   (Ruslan  &  Minh,  2008,  NIPS)  

Bayesian Probabilistic Matrix Factorization (Ruslan and Minh, ICML 2008) •  Fully Bayesian treatment using an MCMC approach –  Significant improvement •  Interpretation as a fully Bayesian hierarchical model shows why that is the case –  Failing to incorporate uncertainty leads to bias in estimates –  Multi-modal posterior, MCMC helps in converging to a better one r Var-comp: au MCEM also more resistant to over-fitting

Non-parametric Bayesian matrix completion (Zhou et al, SAM, 2010) •  Specify rank probabilistically (automatic rank selection) )/)1(,/(~ )(~ ),(~ 1 2 rrbraBeta Berz vuzNy k kk r k jkikkij − ∑= π π σ ))1(/(Factors)#( )))1(/(,1(~ −+= −+ rbaraE rbaaBerzk

How to incorporate features: Deal with both warm start and cold-start •  Models to predict ratings for new pairs –  Warm-start: (user, movie) present in the training data with large sample size –  Cold-start: At least one of (user, movie) new or has small sample size •  Rough definition, warm-start/cold-start is a continuum. •  Challenges –  Highly incomplete (user, movie) matrix –  Heavy tailed degree distributions for users/movies •  Large fraction of ratings from small fraction of users/ movies –  Handling both warm-start and cold-start effectively in the presence of predictive features

Possible approaches •  Large scale regression based on covariates –  Does not provide good estimates for heavy users/movies –  Large number of predictors to estimate interactions •  Collaborative filtering –  Neighborhood based –  Factorization •  Good for warm-start; cold-start dealt with separately •  Single model that handles cold-start and warm-start –  Heavy users/movies → User/movie specific model –  Light users/movies → fallback on regression model –  Smooth fallback mechanism for good performance

Add Feature-based Regression into Matrix Factorization RLFM: Regression-based Latent Factor Model

Regression-based Factorization Model (RLFM) •

#views presentations

Add a comment


fifa | 15/02/15
lol, That's a good post! Like you, Thank you! Excuse me, Please look at my username! Cheap Fifa Coins. fifa http://www.fifacoing.com/

Related presentations

Related pages

ENAR 2016 Short Courses

ENAR 2016 | Spring Meeting | March 6-9 | Preliminary Program 95 SC2: Statistical Analysis of Network Data FULL DAY: 8:00 am to 5:00 pm Eric Kolaczyk
Read more

ENAR 2015 Spring Meeting: Welcome

Eastern North American Region. ... The 2015 ENAR Spring Meeting will be held ... The 2015 ENAR Spring Meeting offers a superb program of short courses, ...
Read more

GitHub - muschellij2/ENARSC2015: ENAR short Course 2015

ENARSC2015 - ENAR short Course 2015 ... Use SSH Clone with HTTPS Use Git or checkout with SVN using the web URL.
Read more

GitHub - OHDSI/EnarShortCourse: Materials for short course ...

EnarShortCourse - Materials for short course at ENAR 2015
Read more

David Ruppert - Short Courses - Cornell University

Short Courses of David Ruppert . ... ENAR (with R. Carroll) ... Semiparametric Regression (one-day short course) JSM 2009, Aug 2, 2009 JSM 2009, Aug 2, 2009
Read more

Raymond J. Carroll

Software for the ENAR Short Course on Measurement Error Models Stata Programs . How to download the measurement error programs . for regression calibration
Read more

2001 ENAR SPRING MEETING - Vanderbilt University

2009 ENAR SPRING MEETING SHORT COURSE/TUTORIAL EVALUATION March 15-18, 2009 Grand Hyatt – San Antonio, Texas Please complete the following and return to ...
Read more

Short Course for ENAR 2009 - Sunday ... - Andrew O. Finley

Short Course for ENAR 2009 - Sunday, March 15, 2009. Hierarchical Modeling and Analysis of Spatial-Temporal Data: Emphasis in Forestry, Ecology, and ...
Read more

1 Instructors 2 Description - Michigan State University

Hierarchical Modeling and Analysis of Spatial-Temporal Data: Emphasis in Forestry, Ecology, and Environmental Sciences 1 Instructors Andrew O. Finley ...
Read more