SlideShare ist ein Scribd-Unternehmen logo
1 von 80
Downloaden Sie, um offline zu lesen
“Survival” Analysis of
     Web Users
             Dell Zhang
DCSIS, Birkbeck, University of London



                                        1
Outline
• What Is It
• Why Is It Useful
• Case Study
  – The Departure Dynamics of Wikipedia Editors




                                                  2
What Is It




             3
Time-To-Event Data
• Survival Analysis is a branch of statistics which
  deals with the modelling of time-to-event data
  – The outcome variable of interest is time until an
    event occurs.
     • death, disease, failure
     • recovery, marriage
  – It is called reliability theory/analysis in
    engineering, and duration analysis/modelling in
    economics or sociology.

                                                        4
Y   X

        How to build
        a probabilistic model of Y ?




                                       5
Y   X

        How to build
        a probabilistic model of Y ?

        How to build
        a probabilistic model of Y given X ?




                                               6
Y   X

        How to build
        a probabilistic model of Y ?

        How to build
        a probabilistic model of Y given X ?




                                               7
Censoring
• A key problem in survival analysis
  – It occurs when we have some information about
    individual survival time, but we don’t know the
    survival time exactly.




                                                      8
9
Y   X


        Options:

        1) Wait for those patients to die?

        2) Discard the censored data?

        3) Use the censored data as if they were
           not censored?

        4) ……




                                               10
Goals
• Survival Analysis attempts to answer
  questions such as
  – What is the fraction of a population which will
    survive past a certain time? Of those that survive,
    at what rate will they die?
  – Can multiple causes of death be taken into
    account?
  – How do particular circumstances or characteristics
    increase or decrease the odds of survival?

                                                      11
• Censoring of data
• Comparing groups
   – (1 treatment vs. 2 placebo)
• Confounding or Interaction
  factors
   – Log WBC




                                   12
Why Is It Useful

for Online Marketing etc.




                            13
The Data Are There
• Events meaningful to online marketing
  – Time to Clicking the Ad
  – Informational: Time to Finding the Wanted Info
  – Transactional: Time to Buying the Product
  – Social: Time to Joining/Leaving the Community
  – ……

                                     Time Matters!

                                                     14
Evidence-Based Marketing
• Let’s work as (real) doctors
  – Users = Patients
  – Advertisement (Marketing) = Treatment

                      Survival Analysis brings
                        the time dimension
                      back to the centre stage.



                                                  15
17
18
Predict whether a new question asked on Stack Overflow will be closed
        when
                                                                 19
Case Study

The Departure Dynamics of
    Wikipedia Editors



                            20
About 90,000 regularly active volunteer editors around the world21
22
Departure Dynamics
• Who are likely to “die”?
• How soon will they “die”?
• Why do they “die”?

     “live”= stay in the editors’ community
           = keep editing
     “die” = leave the editors’ community
           = stop editing (for 5 months)

                                              23
Who are likely to “die”?

      (WikiChallenge)




                           24
25
2001-01-01                2010-04-01   2010-09-01




             2001-06-01                2010-09-01   2011-02-01



                                                        26
27
Behavioural Dynamics Features




Exponential Steps

                    months

                     Web Search (SIGIR-2009),
                     Social Tagging (WWW-2009),
                     Language Modelling (ICTIR-2009)

                                                  28
29
30
31
Gradient Boosted Trees (GBT)




                                         32
                       © 2008-2012 ~maniraptora
Gradient Boosted Trees (GBT)
• The success of GBT in our task is probably
  attributable to
  – its ability to capture the complex nonlinear
    relationship between the target variable and the
    features,
  – its insensitivity to different feature value ranges as
    well as outliers, and
  – its resistance to overfitting via regularisation
    mechanisms such as shrinkage and subsampling
    (Friedman 1999a; 1999b).
• GBT vs RF
                                                             33
34
35
36
37
Final Result
• The 2nd best valid algorithm in the
  WikiChallenge
  – RMSLE = 0.862582: 41.7% improvement over
    WMF’s in-house solution
  – Much simpler model than the top performing
    system : 21 behavioural dynamics features vs. 206
    features
  – WMF is now implementing this algorithm
    permanently and looks forward to using it in the
    production environment.

                                                    38
How soon will they “die”?




                            39
110,000 random samples         birth & death




     January 2001


              The evolution of Wikipedia editors' community.
                                                               40
110,000 random samples         active editors




     January 2001


              The evolution of Wikipedia editors' community.
                                                               41
Survival Function

What is the fraction of a population which
will survive past a certain time?




                                             42
Occasional Editors                   Customary Editors




    The histogram of Wikipedia editors' lifetime.        43
Kaplan-Meier Estimator




                         44
45
The empirical survival function.   46
Normal Distribution




                      47
 Probability Plot
Extreme Value Distribution




                             48
    Probability Plot
Rayleigh Distribution




                        49
 Probability Plot
Exponential Distribution




                           50
   Probability Plot
Lognormal Distribution




                         51
  Probability Plot
Weibull Distribution




                       52
 Probability Plot
The survival function.   53
Weibull distribution




                       54
Expected Future Lifetime




              median lifetime: 53 days


                                         55
Hazard Function
Of those that survive, at what rate will they die?




   The instantaneous potential per unit time for the event to occur,
   given that the individual has survived t.

                                                                   56
Bathtub Curve




http://en.wikipedia.org/wiki/Bathtub_curve   57
The hazard function.   58
The hazard function.   59
Conclusions
• For customary Wikipedia editors,
  – the survival function can be well described by a
    Weibull distribution (with the median lifetime of
    about 53 days);
  – there are two critical phases (0-2 weeks and 8-20
    weeks) when the hazard rate of becoming inactive
    increases;
  – more active editors tend to keep active in editing
    for longer time.

                                                     60
Why do they “die”?




                     61
Covariates
Last
Edit




                    62
63
64
Cox Proportional Hazards Model




                                 65
Semi-Parametric
• The semi-parametric property of the Cox
  model => its popularity
  – The baseline hazard is unspecified
  – Robust: it will closely approximate the correct
    parametric model
  – Using a minimum of assumptions




                                                      66
Cox PH vs. Logistic




                      67
Maximum Likelihood Estimation




                                68
Cox Proportional Hazards Model


                     β         se        z          p
      X1:
                   -0.1095   0.0172   -6.3664    0.1935e-9
namespace==Main
       X2:
                   -0.0688   0.0036   -19.2474   0.0000e-9
 log(1+cur_size)




                                                        69
Hazard Ratio




               70
Adjusted Survival Curves




                           71
72
Next Step




            73
Cartoon: Ron Hipschman
Data: David Hand 74
Lightning Does Strike Twice!
• Roy Sullivan, a former park ranger from Virginia
  – He was struck by lightning 7 times
     •   1942 (lost big-toe nail)
     •   1969 (lost eyebrows)
     •   1970 (left shoulder seared)
     •   1972 (hair set on fire)
     •   1973 (hair set on fire & legs seared)
     •   1976 (ankle injured)
     •   1977 (chest & stomach burned)
  – He committed suicide in September 1983.

                                                     75
A Lot More To Do
• Multiple Occurrences of “Death”
  – Recurrent Event Survival Analysis (e.g., based on
    Counting Process)
• Multiple Types of “Death”
  – Competing Risks Survival Analysis




                                                        76
Software Tools
• R
  – The ‘survival’ package
• Matlab
  – The ‘statistics’ toolbox
• Python
  – The ‘statsmodels’ module?




                                 77
References
• David G. Kleinbaum and Mitchel Klein. Survival Analysis: A Self-Learning
  Text. Springer, 3rd edition, 2011. http://goo.gl/wFtta
• John Wallace. How Big Data is Changing Retail Marketing Analytics.
  Webinar, Apr 2005. http://goo.gl/OlMmi
• Dell Zhang, Karl Prior, and Mark Levene. How Long Do Wikipedia Editors
  Keep Active? In Proceedings of the 8th International Symposium on Wikis
  and Open Collaboration (WikiSym), Linz, Austria, Aug 2012.
  http://goo.gl/On3qr
• Dell Zhang. Wikipedia Edit Number Prediction based on Temporal
  Dynamics. The Computing Research Repository (CoRR) abs/1110.5051. Oct
  2011. http://goo.gl/s2Dex




                                                                         78
?

    79
80

Weitere ähnliche Inhalte

Was ist angesagt?

RecSys 2020 A Human Perspective on Algorithmic Similarity Schendel 9-2020
RecSys 2020 A Human Perspective on Algorithmic Similarity Schendel 9-2020RecSys 2020 A Human Perspective on Algorithmic Similarity Schendel 9-2020
RecSys 2020 A Human Perspective on Algorithmic Similarity Schendel 9-2020
Zachary Schendel
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
Justin Basilico
 
Artwork Personalization at Netflix
Artwork Personalization at NetflixArtwork Personalization at Netflix
Artwork Personalization at Netflix
Justin Basilico
 

Was ist angesagt? (20)

Time, Context and Causality in Recommender Systems
Time, Context and Causality in Recommender SystemsTime, Context and Causality in Recommender Systems
Time, Context and Causality in Recommender Systems
 
Screen Space Decals in Warhammer 40,000: Space Marine
Screen Space Decals in Warhammer 40,000: Space MarineScreen Space Decals in Warhammer 40,000: Space Marine
Screen Space Decals in Warhammer 40,000: Space Marine
 
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
 
RecSys 2020 A Human Perspective on Algorithmic Similarity Schendel 9-2020
RecSys 2020 A Human Perspective on Algorithmic Similarity Schendel 9-2020RecSys 2020 A Human Perspective on Algorithmic Similarity Schendel 9-2020
RecSys 2020 A Human Perspective on Algorithmic Similarity Schendel 9-2020
 
Recommendation at Netflix Scale
Recommendation at Netflix ScaleRecommendation at Netflix Scale
Recommendation at Netflix Scale
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in Recommendations
 
LAFS Game Mechanics - Resource Management Mechanics
LAFS Game Mechanics - Resource Management MechanicsLAFS Game Mechanics - Resource Management Mechanics
LAFS Game Mechanics - Resource Management Mechanics
 
Game Programming 02 - Component-Based Entity Systems
Game Programming 02 - Component-Based Entity SystemsGame Programming 02 - Component-Based Entity Systems
Game Programming 02 - Component-Based Entity Systems
 
Recent Trends in Personalization at Netflix
Recent Trends in Personalization at NetflixRecent Trends in Personalization at Netflix
Recent Trends in Personalization at Netflix
 
Artwork Personalization at Netflix
Artwork Personalization at NetflixArtwork Personalization at Netflix
Artwork Personalization at Netflix
 
Personalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing RecommendationsPersonalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing Recommendations
 
Fact Store at Scale for Netflix Recommendations with Nitin Sharma and Kedar S...
Fact Store at Scale for Netflix Recommendations with Nitin Sharma and Kedar S...Fact Store at Scale for Netflix Recommendations with Nitin Sharma and Kedar S...
Fact Store at Scale for Netflix Recommendations with Nitin Sharma and Kedar S...
 
Contextualization at Netflix
Contextualization at NetflixContextualization at Netflix
Contextualization at Netflix
 
Calibrated Recommendations
Calibrated RecommendationsCalibrated Recommendations
Calibrated Recommendations
 
Tutorial: Context In Recommender Systems
Tutorial: Context In Recommender SystemsTutorial: Context In Recommender Systems
Tutorial: Context In Recommender Systems
 
Hill Stephen Rendering Tools Splinter Cell Conviction
Hill Stephen Rendering Tools Splinter Cell ConvictionHill Stephen Rendering Tools Splinter Cell Conviction
Hill Stephen Rendering Tools Splinter Cell Conviction
 
Missing values in recommender models
Missing values in recommender modelsMissing values in recommender models
Missing values in recommender models
 
Killzone Shadow Fall: Creating Art Tools For A New Generation Of Games
Killzone Shadow Fall: Creating Art Tools For A New Generation Of GamesKillzone Shadow Fall: Creating Art Tools For A New Generation Of Games
Killzone Shadow Fall: Creating Art Tools For A New Generation Of Games
 

Andere mochten auch

Andere mochten auch (20)

Survival analysis
Survival analysisSurvival analysis
Survival analysis
 
Subscription Survival Analysis
Subscription Survival AnalysisSubscription Survival Analysis
Subscription Survival Analysis
 
Survival Analysis Project
Survival Analysis Project Survival Analysis Project
Survival Analysis Project
 
Amazon Aurora: The New Relational Database Engine from Amazon
Amazon Aurora: The New Relational Database Engine from AmazonAmazon Aurora: The New Relational Database Engine from Amazon
Amazon Aurora: The New Relational Database Engine from Amazon
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel AvivSelf Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
 
OAuth 2.0 refresher Talk
OAuth 2.0 refresher TalkOAuth 2.0 refresher Talk
OAuth 2.0 refresher Talk
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
 
Architecting for Greater Security on AWS
Architecting for Greater Security on AWSArchitecting for Greater Security on AWS
Architecting for Greater Security on AWS
 
Py.test
Py.testPy.test
Py.test
 
Path Analysis (Camino de Senderos)
Path Analysis (Camino de Senderos)Path Analysis (Camino de Senderos)
Path Analysis (Camino de Senderos)
 
Core deposits sensitivity and survival analysis
Core deposits sensitivity and survival analysisCore deposits sensitivity and survival analysis
Core deposits sensitivity and survival analysis
 
Masterless Puppet Using AWS S3 Buckets and IAM Roles
Masterless Puppet Using AWS S3 Buckets and IAM RolesMasterless Puppet Using AWS S3 Buckets and IAM Roles
Masterless Puppet Using AWS S3 Buckets and IAM Roles
 
(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda
(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda
(CMP407) Lambda as Cron: Scheduling Invocations in AWS Lambda
 
SIP
SIP SIP
SIP
 
An introduction to weibull analysis
An introduction to weibull analysisAn introduction to weibull analysis
An introduction to weibull analysis
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
 
Path Analysis
Path AnalysisPath Analysis
Path Analysis
 
Path analysis
Path analysisPath analysis
Path analysis
 
Survival Analysis for Predicting Employee Turnover
Survival Analysis for Predicting Employee TurnoverSurvival Analysis for Predicting Employee Turnover
Survival Analysis for Predicting Employee Turnover
 

Ähnlich wie Survival Analysis of Web Users

DeepLabCut AI Residency
DeepLabCut AI ResidencyDeepLabCut AI Residency
DeepLabCut AI Residency
Vic Shao-Chih Chiang
 
Kain07109 google-091010182704-phpapp01 (1)
Kain07109 google-091010182704-phpapp01 (1)Kain07109 google-091010182704-phpapp01 (1)
Kain07109 google-091010182704-phpapp01 (1)
Rob Blaauboer
 
IS Undergrads Class 19
IS Undergrads Class 19IS Undergrads Class 19
IS Undergrads Class 19
Joao Cunha
 
Myp unit introduction microorganisms
Myp unit introduction microorganismsMyp unit introduction microorganisms
Myp unit introduction microorganisms
aimorales
 

Ähnlich wie Survival Analysis of Web Users (20)

From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0
From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0
From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0
 
2023-08-22 CoLLAs Tutorial - Beyond CIL.pdf
2023-08-22 CoLLAs Tutorial - Beyond CIL.pdf2023-08-22 CoLLAs Tutorial - Beyond CIL.pdf
2023-08-22 CoLLAs Tutorial - Beyond CIL.pdf
 
math bio for 1st year math students
math bio for 1st year math studentsmath bio for 1st year math students
math bio for 1st year math students
 
Segmentation for Targeting
Segmentation for TargetingSegmentation for Targeting
Segmentation for Targeting
 
Tale of the Knowledge Organization In an Age of Wicked Problems
Tale of the Knowledge Organization In an Age of Wicked ProblemsTale of the Knowledge Organization In an Age of Wicked Problems
Tale of the Knowledge Organization In an Age of Wicked Problems
 
DeepLabCut AI Residency
DeepLabCut AI ResidencyDeepLabCut AI Residency
DeepLabCut AI Residency
 
Crowdsourcing for HCI Research with Amazon Mechanical Turk
Crowdsourcing for HCI Research with Amazon Mechanical TurkCrowdsourcing for HCI Research with Amazon Mechanical Turk
Crowdsourcing for HCI Research with Amazon Mechanical Turk
 
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3
 
Philosophy of Big Data: Big Data, the Individual, and Society
Philosophy of Big Data: Big Data, the Individual, and SocietyPhilosophy of Big Data: Big Data, the Individual, and Society
Philosophy of Big Data: Big Data, the Individual, and Society
 
6 Radical Work Changes In Next Decade
6 Radical Work Changes In Next Decade6 Radical Work Changes In Next Decade
6 Radical Work Changes In Next Decade
 
TensorFlow London: Cutting edge generative models
TensorFlow London: Cutting edge generative modelsTensorFlow London: Cutting edge generative models
TensorFlow London: Cutting edge generative models
 
'Living Lab' for HCI - presentation made at HCI International 2009
'Living Lab' for HCI - presentation made at HCI International 2009'Living Lab' for HCI - presentation made at HCI International 2009
'Living Lab' for HCI - presentation made at HCI International 2009
 
Kain07109 google-091010182704-phpapp01 (1)
Kain07109 google-091010182704-phpapp01 (1)Kain07109 google-091010182704-phpapp01 (1)
Kain07109 google-091010182704-phpapp01 (1)
 
Gene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KGene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2K
 
IS Undergrads Class 19
IS Undergrads Class 19IS Undergrads Class 19
IS Undergrads Class 19
 
Building a Biomedical Knowledge Garden
Building a Biomedical Knowledge Garden Building a Biomedical Knowledge Garden
Building a Biomedical Knowledge Garden
 
Myp unit introduction microorganisms
Myp unit introduction microorganismsMyp unit introduction microorganisms
Myp unit introduction microorganisms
 
PhD defense
PhD defensePhD defense
PhD defense
 
Stories for survival and succes in nature and in business
Stories for survival and succes in nature and in businessStories for survival and succes in nature and in business
Stories for survival and succes in nature and in business
 
Change
ChangeChange
Change
 

Mehr von Data Science London

Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)
Data Science London
 

Mehr von Data Science London (20)

Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
 
Nowcasting Business Performance
Nowcasting Business PerformanceNowcasting Business Performance
Nowcasting Business Performance
 
Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunching
 
Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)
 
Big Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresBig Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least Squares
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysis
 
ACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, Today
 
Beyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignBeyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems Design
 
Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and Future
 
Data Science for Live Music
Data Science for Live MusicData Science for Live Music
Data Science for Live Music
 
Research at last.fm
Research at last.fmResearch at last.fm
Research at last.fm
 
Music and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryMusic and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music Industry
 
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with Mahout
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapR
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in Mahout
 
Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook Users
 
Practical Magic with Incanter
Practical Magic with IncanterPractical Magic with Incanter
Practical Magic with Incanter
 

Survival Analysis of Web Users

  • 1. “Survival” Analysis of Web Users Dell Zhang DCSIS, Birkbeck, University of London 1
  • 2. Outline • What Is It • Why Is It Useful • Case Study – The Departure Dynamics of Wikipedia Editors 2
  • 4. Time-To-Event Data • Survival Analysis is a branch of statistics which deals with the modelling of time-to-event data – The outcome variable of interest is time until an event occurs. • death, disease, failure • recovery, marriage – It is called reliability theory/analysis in engineering, and duration analysis/modelling in economics or sociology. 4
  • 5. Y X How to build a probabilistic model of Y ? 5
  • 6. Y X How to build a probabilistic model of Y ? How to build a probabilistic model of Y given X ? 6
  • 7. Y X How to build a probabilistic model of Y ? How to build a probabilistic model of Y given X ? 7
  • 8. Censoring • A key problem in survival analysis – It occurs when we have some information about individual survival time, but we don’t know the survival time exactly. 8
  • 9. 9
  • 10. Y X Options: 1) Wait for those patients to die? 2) Discard the censored data? 3) Use the censored data as if they were not censored? 4) …… 10
  • 11. Goals • Survival Analysis attempts to answer questions such as – What is the fraction of a population which will survive past a certain time? Of those that survive, at what rate will they die? – Can multiple causes of death be taken into account? – How do particular circumstances or characteristics increase or decrease the odds of survival? 11
  • 12. • Censoring of data • Comparing groups – (1 treatment vs. 2 placebo) • Confounding or Interaction factors – Log WBC 12
  • 13. Why Is It Useful for Online Marketing etc. 13
  • 14. The Data Are There • Events meaningful to online marketing – Time to Clicking the Ad – Informational: Time to Finding the Wanted Info – Transactional: Time to Buying the Product – Social: Time to Joining/Leaving the Community – …… Time Matters! 14
  • 15. Evidence-Based Marketing • Let’s work as (real) doctors – Users = Patients – Advertisement (Marketing) = Treatment Survival Analysis brings the time dimension back to the centre stage. 15
  • 16.
  • 17. 17
  • 18. 18
  • 19. Predict whether a new question asked on Stack Overflow will be closed when 19
  • 20. Case Study The Departure Dynamics of Wikipedia Editors 20
  • 21. About 90,000 regularly active volunteer editors around the world21
  • 22. 22
  • 23. Departure Dynamics • Who are likely to “die”? • How soon will they “die”? • Why do they “die”? “live”= stay in the editors’ community = keep editing “die” = leave the editors’ community = stop editing (for 5 months) 23
  • 24. Who are likely to “die”? (WikiChallenge) 24
  • 25. 25
  • 26. 2001-01-01 2010-04-01 2010-09-01 2001-06-01 2010-09-01 2011-02-01 26
  • 27. 27
  • 28. Behavioural Dynamics Features Exponential Steps months Web Search (SIGIR-2009), Social Tagging (WWW-2009), Language Modelling (ICTIR-2009) 28
  • 29. 29
  • 30. 30
  • 31. 31
  • 32. Gradient Boosted Trees (GBT) 32 © 2008-2012 ~maniraptora
  • 33. Gradient Boosted Trees (GBT) • The success of GBT in our task is probably attributable to – its ability to capture the complex nonlinear relationship between the target variable and the features, – its insensitivity to different feature value ranges as well as outliers, and – its resistance to overfitting via regularisation mechanisms such as shrinkage and subsampling (Friedman 1999a; 1999b). • GBT vs RF 33
  • 34. 34
  • 35. 35
  • 36. 36
  • 37. 37
  • 38. Final Result • The 2nd best valid algorithm in the WikiChallenge – RMSLE = 0.862582: 41.7% improvement over WMF’s in-house solution – Much simpler model than the top performing system : 21 behavioural dynamics features vs. 206 features – WMF is now implementing this algorithm permanently and looks forward to using it in the production environment. 38
  • 39. How soon will they “die”? 39
  • 40. 110,000 random samples birth & death January 2001 The evolution of Wikipedia editors' community. 40
  • 41. 110,000 random samples active editors January 2001 The evolution of Wikipedia editors' community. 41
  • 42. Survival Function What is the fraction of a population which will survive past a certain time? 42
  • 43. Occasional Editors Customary Editors The histogram of Wikipedia editors' lifetime. 43
  • 45. 45
  • 46. The empirical survival function. 46
  • 47. Normal Distribution 47 Probability Plot
  • 48. Extreme Value Distribution 48 Probability Plot
  • 49. Rayleigh Distribution 49 Probability Plot
  • 50. Exponential Distribution 50 Probability Plot
  • 51. Lognormal Distribution 51 Probability Plot
  • 52. Weibull Distribution 52 Probability Plot
  • 55. Expected Future Lifetime median lifetime: 53 days 55
  • 56. Hazard Function Of those that survive, at what rate will they die? The instantaneous potential per unit time for the event to occur, given that the individual has survived t. 56
  • 60. Conclusions • For customary Wikipedia editors, – the survival function can be well described by a Weibull distribution (with the median lifetime of about 53 days); – there are two critical phases (0-2 weeks and 8-20 weeks) when the hazard rate of becoming inactive increases; – more active editors tend to keep active in editing for longer time. 60
  • 61. Why do they “die”? 61
  • 63. 63
  • 64. 64
  • 66. Semi-Parametric • The semi-parametric property of the Cox model => its popularity – The baseline hazard is unspecified – Robust: it will closely approximate the correct parametric model – Using a minimum of assumptions 66
  • 67. Cox PH vs. Logistic 67
  • 69. Cox Proportional Hazards Model β se z p X1: -0.1095 0.0172 -6.3664 0.1935e-9 namespace==Main X2: -0.0688 0.0036 -19.2474 0.0000e-9 log(1+cur_size) 69
  • 72. 72
  • 73. Next Step 73
  • 75. Lightning Does Strike Twice! • Roy Sullivan, a former park ranger from Virginia – He was struck by lightning 7 times • 1942 (lost big-toe nail) • 1969 (lost eyebrows) • 1970 (left shoulder seared) • 1972 (hair set on fire) • 1973 (hair set on fire & legs seared) • 1976 (ankle injured) • 1977 (chest & stomach burned) – He committed suicide in September 1983. 75
  • 76. A Lot More To Do • Multiple Occurrences of “Death” – Recurrent Event Survival Analysis (e.g., based on Counting Process) • Multiple Types of “Death” – Competing Risks Survival Analysis 76
  • 77. Software Tools • R – The ‘survival’ package • Matlab – The ‘statistics’ toolbox • Python – The ‘statsmodels’ module? 77
  • 78. References • David G. Kleinbaum and Mitchel Klein. Survival Analysis: A Self-Learning Text. Springer, 3rd edition, 2011. http://goo.gl/wFtta • John Wallace. How Big Data is Changing Retail Marketing Analytics. Webinar, Apr 2005. http://goo.gl/OlMmi • Dell Zhang, Karl Prior, and Mark Levene. How Long Do Wikipedia Editors Keep Active? In Proceedings of the 8th International Symposium on Wikis and Open Collaboration (WikiSym), Linz, Austria, Aug 2012. http://goo.gl/On3qr • Dell Zhang. Wikipedia Edit Number Prediction based on Temporal Dynamics. The Computing Research Repository (CoRR) abs/1110.5051. Oct 2011. http://goo.gl/s2Dex 78
  • 79. ? 79
  • 80. 80