SlideShare ist ein Scribd-Unternehmen logo
1 von 57
Downloaden Sie, um offline zu lesen
Measuring the New
Wikipedia Community
PyData 2013
Ryan Faulkner (rfaulkner@wikimedia.org)
Wikimedia Foundation
Overview
Introduction
Problem & Motivation
Proposed Solution
User Metrics
A Short Example
Extending the Solution
Using the Tool
Live Demo!!
Introduction
Me: Data Analyst at Wikimedia
Machine Learning @ McGill
Fundraising - A/B testing
Editor Experiments - increasing the number of
Active editors
Editor Engagement Experiments (E3) team @ the
Wikimedia Foundation
Micro-feature experimentation
Problem
What's wrong with Wikipedia?
Problem - Editor Decline
http://strategy.wikimedia.org/wiki/Editor_Trends_Study
Problem - Approach
Can we stimulate the community of users to become more
numerous and productive?
○ Focus on new users
■ Encourage contribution, make it easier
○ Lower the threshold for account creation
■ Bring more people in.
○ Rapid experimentation on features that retain more
users and stimulate increased participation.
■ This will help us determine what works with less
cost
Problem - Evaluation
○ Data Consistency
■ Anomaly Detection
■ Auto-correlation (seasonality)
○ "A/B" testing
■ Hypothesis testing - student's t, chi-square
■ Linear / Logistic regression
○ Multivariate testing
■ Analysis of variance
Problem - What we need
Currently a lot of the work around analysis is done
manually and is a large drain on resources:
○ Faster Data gathering
○ Knowing what we're logging and measuring &
faster ETL
○ Faster Analysis
○ Broadening Service and iterating on results
Problem - What we need
Build better infrastructure around how we interpret and
analyze our data.
○ Determine what to measure.
■ Rigorously define relevant metrics
○ Expose the metrics from our data store
■ Python is great for writing code quickly to handle
tasks with data
■ Library support for data analysis (pandas,
numpy)
Solution
The tools to build.
Solution - Proposed
We need to measure User Behaviour
"User Metrics" & "UMAPI"
User Metrics & UMAPI
Python implementation for gathering data from MediaWiki data stores,
producing well defined metrics, and facilitating subsequent modelling and
analysis. This includes a way to provide an interface for making different types
of requests and returning standard responses.
Solution - Why Bother
What exactly do we gain by building these
classes? Why not just query the database?
1. Reproducibility & Standardization
2. Extensibility
3. Concise definition
4. Increase turn around
a. Multiprocessing to optimize metrics generation
(e.g. Revert rate on 100K users
via MySQL = 24hrs,
via User Metrics < 10mins)
Solution - Why Python?
Why not C++, Java, or PHP?
1. Speed of development
2. Simplify the code base & easy extensibility
a. more "Scientist Friendly"
3. Good support for data processing
4. Better integration for downstream data analysis
5. The way that metrics work lends them to "Pythonic"
artifacts. List comprehension, decorator patterns, duck-
typing, RESTful API.
User Metrics
How do we form a picture about what happens
on Wikipedia?
User Metrics - User activity
Events (not exhaustive):
■ Registration
■ Making an edit
■ Contributions of Namespaces
■ Reverting edits
■ Blocking
User Metrics - What do we want to
know about users?
○ How much do they contribute?
○ How often do they contribute?
○ Potential vandals. Do they go on to be reverted,
blocked, banned?
User Metrics - Metrics Definitions
https://meta.wikimedia.org/wiki/Research:Metrics
Retention Metrics
Survival(t) Boolean measure of an editor surviving beyond t
Threshold(t,n) Boolean measure of an editor reaching activity threshold n by time t
Live Account(t) Boolean measure of whether the new user click the edit button?
Volume Metrics
Edit Rate Float result of user's rate of contribution.
Content Integer bytes added by revision and edit count.
Sessions Average session length (future)
Time to Threshold Time to reach a threshold (e.g. first edit)
User Metrics - Metrics Definitions
Content Quality
Revert Rate Float representing the proportion of revisions reverted.
Block Boolean indicating a block event on the user.
Content Persistence Integer indicating how long this user's edits survive (future)
Contribution Type
Namespace of Edits Integer edit counts in all namespaces.
Scale of Change Float representation of fraction of total page content modified (future)
User Metrics - Bytes Added
user
revision
history
(over a predifined
period)
Revision k:
byte increase
(user ID, bytes_added, bytes_removed, edit count)
User Metrics - Threshold
user
revision
history
(over a predefined
period)
(user ID, threshold_reached={0,1})
registration
Events since
registration up
to time "t"
if len(event_list) >= n:
threshold_reached = True
else:
threshold_reached = False
User Metrics - Revert Rate
user
revision
history
(over a predefined
period)
for each
revision look
at page
history
Future Revisions
Past Revisions
checksum k
checksum i
if checksum i == checksum k:
# reverted!
(user ID, revert_rate, total_revisions)
User Metrics - Implementation
https://github.com/wikimedia/user_metrics
1. MySQL & Redis (future) data store
a. All of the backend dependency is abstracted out of
metrics classes
2. Python implementation - MySQLdb (SQLalchemy)
3. Strategy Pattern of Parent user metrics class
4. Metrics built mainly from four core MediaWiki tables:
a. revision, user, page, logging
5. Python Decorator methods for handling metric
aggregation
User Metrics
A Concrete Example
How can we use this
framework?
Example - Post Edit Feedback
What effect does editing feedback (confirmation/gratitude)
have on new editors?
Example - Results
An Extended Solution
Turn the data machine into a service.
Editor Metrics go beyond feature
experimentation ...
It became clear that...
● We needed a service to let clients generate their own
user metrics data sets
● We wanted to add a way for this methodology to
extend beyond E3 and potentially WMF
● A force multiplier was necessary to iterate on editor
data in more interesting ways (Machine Learning &
more sophisticated analyses)
User Metrics API [UMAPI]
Open Source (almost) RESTful API (Flask)
Computes metrics per user (User Metrics)
Combines metrics in different ways depending on
request types
HTTP response in JSON with resulting data
Store data internally for reuse
UMAPI
http://metrics.wikimedia.org/
https://github.com/wikimedia/user_metrics
https://github.com/rfaulkner/E3_analysis
https://pypi.python.org/pypi/wmf_user_metrics/0.1.3-dev
UMAPI - Overview
Service GET requests based on a combination of URL
paths + query params
e.g. /cohort/metric?date_start=..&date_end=...&...
Define user "cohorts" on which to operate
API engine maps to metrics request object (Mediator
Pattern) which is handed off to a request manager which
builds and runs request
JSON response
UMAPI - Overview
Basic cPickle file cache for responses
Can substitute caching system (e.g. memcached)
Reusing request data where it overlaps
Request Types:
"Raw" - metrics per user
Aggregation over cohorts: mean, sum, median, etc.
Time series requests
UMAPI Architecture
HTTP GET request
JSON response
Apache
Flask / App
Servermod_wsgi
Request
Notifications
Listener
Request
Control
Response
Control Cache
MediaWiki
Slaves
User
Metrics
API
Messaging Queues
Metrics objects -
Separate
Processes
Asynchronous Callbacks
UMAPI Architecture - Listeners
Request Notifications Callback
Handles managing and notifications on job status
Request Controller
Queues requests
Spawns jobs from metrics objects
Coordinates parameters
Response Controller
Reconstruct response data
Write to cache
We will want to consider large groups of users, for instance,
a test or control group in some experiment:
Aggregate groups of users
lists of user IDs
Cohort registration (under construction)
adding new cohorts to the model
Single user endpoint
Boolean expressions over cohorts supported
UMAPI - User Cohorts
User Metric Periods
How do we define the periods over which metrics are
measured?
Registration
Look "t" hours since user registration
User Defined
User supplied start and end dates
Conditional Registration
Registration as above with condition that registration falls within input
UMAPI - RequestMeta Module
Mediator Pattern to handle passing request data among
different portions of the architecture
Abstraction allows for easy filtering and default behaviour
of request parameters
Requests can easily be turned into reproducible and unique
hashes for caching
How the Service Works
The user experience with user metrics.
UMAPI - Pipeline
Cohort
or
combo
Raw Params
Time
Series
Aggregator
Aggregator Params
Params JSON
JSON
JSON
UMAPI - Frontend Flow
Job Queue
As you fire off requests the queue tracks what's running:
Response - Bytes Added
Response - Threshold
Response - Edit Rate
Response - Threshold w/ params
Response - Aggregation
Response - Aggregation
Response - Time series
Response - Combining Cohorts
"usertags_meta" - cohort definitions
Response - Combining Cohorts
Two intersecting cohorts:
Response - Combining Cohorts
AND (&)
Response - Combining Cohorts
OR (~)
Response - Single user endpoint
e.g.http://metrics-api.wikimedia.org/user/Renklauf/threshold?t=10000
Looking ahead ...
Connectivity metrics (additional metrics)
○ Graph database? (Neo4j, gremlin w/ postgreSQL)
○ User talk and common article edits
Better in-memory modelling
○ python-memcached
○ better reuse of generated data based on request data
Beyond English Wikipedia
Implemented!
Looking ahead ...
More sophisticated and robust data modelling
○ Modelling richer data: contribution histories, articles
edited, aggregate metrics
○ Classification: Logistic classifiers, Support Vector
Machine, Deep Belief Networks, Dimensionality
Reduction
○ Modelling revision text - Neural Networks, Hidden
Markov Models
DEMO!!
http://metrics.wikimedia.org/cohorts/e3_pef1_confirmation/threshold
http://metrics.wikimedia.org/cohorts/e3_pef1_confirmation/threshold?aggregator=average
http://metrics.wikimedia.org/cohorts/e3_pef1_confirmation/edit_rate
http://metrics.wikimedia.org/cohorts/e3_pef1_confirmation/edit_rate?aggregator=dist
http://metrics.wikimedia.org/cohorts/ryan_test_2/bytes_added?
time_series&start=20120101&end=20130101&aggregator=sum&group=input&interval=720
The End
http://metrics.wikimedia.org/
stat1.wikimedia.org:4000
https://github.com/wikimedia/user_metrics
https://github.com/rfaulkner/E3_analysis
https://pypi.python.org/pypi/wmf_user_metrics/0.1.3-dev
Questions?

Weitere ähnliche Inhalte

Ähnlich wie Measuring the New Wikipedia Community (PyData SV 2013)

The Art and Science of Requirements Gathering
The Art and Science of Requirements GatheringThe Art and Science of Requirements Gathering
The Art and Science of Requirements Gathering
Vanessa Turke
 
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
Yahoo Developer Network
 
Practical operability techniques for distributed systems - Velocity EU 2017
Practical operability techniques for distributed systems - Velocity EU 2017Practical operability techniques for distributed systems - Velocity EU 2017
Practical operability techniques for distributed systems - Velocity EU 2017
Skelton Thatcher Consulting Ltd
 
An Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time ApplicationsAn Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time Applications
Johann Schleier-Smith
 
Improvement from proof of concept into the production environment cater for...
Improvement from proof of concept into the production environment   cater for...Improvement from proof of concept into the production environment   cater for...
Improvement from proof of concept into the production environment cater for...
Conference Papers
 

Ähnlich wie Measuring the New Wikipedia Community (PyData SV 2013) (20)

Everything You Always Wanted to Know About Cohorts (But Were Afraid to Ask)
Everything You Always Wanted to Know About Cohorts (But Were Afraid to Ask)Everything You Always Wanted to Know About Cohorts (But Were Afraid to Ask)
Everything You Always Wanted to Know About Cohorts (But Were Afraid to Ask)
 
Data and Business Team Collaboration
Data and Business Team CollaborationData and Business Team Collaboration
Data and Business Team Collaboration
 
A Comparative Study of Recommendation System Using Web Usage Mining
A Comparative Study of Recommendation System Using Web Usage Mining A Comparative Study of Recommendation System Using Web Usage Mining
A Comparative Study of Recommendation System Using Web Usage Mining
 
The Art and Science of Requirements Gathering
The Art and Science of Requirements GatheringThe Art and Science of Requirements Gathering
The Art and Science of Requirements Gathering
 
UCIAD overview
UCIAD overviewUCIAD overview
UCIAD overview
 
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
Apache Hadoop India Summit 2011 talk "Online Content Optimization using Hadoo...
 
cametrics-report-final
cametrics-report-finalcametrics-report-final
cametrics-report-final
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
ADAPTIVE MODEL FOR WEB SERVICE RECOMMENDATION
ADAPTIVE MODEL FOR WEB SERVICE RECOMMENDATIONADAPTIVE MODEL FOR WEB SERVICE RECOMMENDATION
ADAPTIVE MODEL FOR WEB SERVICE RECOMMENDATION
 
CHARACTERIZING BEHAVIOUR
CHARACTERIZING BEHAVIOURCHARACTERIZING BEHAVIOUR
CHARACTERIZING BEHAVIOUR
 
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
 
Quality Measurement Framework Puts the End User in Focus
Quality Measurement Framework Puts the End User in FocusQuality Measurement Framework Puts the End User in Focus
Quality Measurement Framework Puts the End User in Focus
 
Library Management System
Library Management SystemLibrary Management System
Library Management System
 
A competitive food retail architecture with microservices
A competitive food retail architecture with microservicesA competitive food retail architecture with microservices
A competitive food retail architecture with microservices
 
Practical operability techniques for distributed systems - Velocity EU 2017
Practical operability techniques for distributed systems - Velocity EU 2017Practical operability techniques for distributed systems - Velocity EU 2017
Practical operability techniques for distributed systems - Velocity EU 2017
 
Performance testing : An Overview
Performance testing : An OverviewPerformance testing : An Overview
Performance testing : An Overview
 
an approach to recommend pages to user after path completion
an approach to recommend pages to user after path completionan approach to recommend pages to user after path completion
an approach to recommend pages to user after path completion
 
An Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time ApplicationsAn Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time Applications
 
Cd24534538
Cd24534538Cd24534538
Cd24534538
 
Improvement from proof of concept into the production environment cater for...
Improvement from proof of concept into the production environment   cater for...Improvement from proof of concept into the production environment   cater for...
Improvement from proof of concept into the production environment cater for...
 

Mehr von PyData

Mehr von PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Measuring the New Wikipedia Community (PyData SV 2013)

  • 1. Measuring the New Wikipedia Community PyData 2013 Ryan Faulkner (rfaulkner@wikimedia.org) Wikimedia Foundation
  • 2. Overview Introduction Problem & Motivation Proposed Solution User Metrics A Short Example Extending the Solution Using the Tool Live Demo!!
  • 3. Introduction Me: Data Analyst at Wikimedia Machine Learning @ McGill Fundraising - A/B testing Editor Experiments - increasing the number of Active editors Editor Engagement Experiments (E3) team @ the Wikimedia Foundation Micro-feature experimentation
  • 5. Problem - Editor Decline http://strategy.wikimedia.org/wiki/Editor_Trends_Study
  • 6. Problem - Approach Can we stimulate the community of users to become more numerous and productive? ○ Focus on new users ■ Encourage contribution, make it easier ○ Lower the threshold for account creation ■ Bring more people in. ○ Rapid experimentation on features that retain more users and stimulate increased participation. ■ This will help us determine what works with less cost
  • 7. Problem - Evaluation ○ Data Consistency ■ Anomaly Detection ■ Auto-correlation (seasonality) ○ "A/B" testing ■ Hypothesis testing - student's t, chi-square ■ Linear / Logistic regression ○ Multivariate testing ■ Analysis of variance
  • 8. Problem - What we need Currently a lot of the work around analysis is done manually and is a large drain on resources: ○ Faster Data gathering ○ Knowing what we're logging and measuring & faster ETL ○ Faster Analysis ○ Broadening Service and iterating on results
  • 9. Problem - What we need Build better infrastructure around how we interpret and analyze our data. ○ Determine what to measure. ■ Rigorously define relevant metrics ○ Expose the metrics from our data store ■ Python is great for writing code quickly to handle tasks with data ■ Library support for data analysis (pandas, numpy)
  • 11. Solution - Proposed We need to measure User Behaviour "User Metrics" & "UMAPI" User Metrics & UMAPI Python implementation for gathering data from MediaWiki data stores, producing well defined metrics, and facilitating subsequent modelling and analysis. This includes a way to provide an interface for making different types of requests and returning standard responses.
  • 12. Solution - Why Bother What exactly do we gain by building these classes? Why not just query the database? 1. Reproducibility & Standardization 2. Extensibility 3. Concise definition 4. Increase turn around a. Multiprocessing to optimize metrics generation (e.g. Revert rate on 100K users via MySQL = 24hrs, via User Metrics < 10mins)
  • 13. Solution - Why Python? Why not C++, Java, or PHP? 1. Speed of development 2. Simplify the code base & easy extensibility a. more "Scientist Friendly" 3. Good support for data processing 4. Better integration for downstream data analysis 5. The way that metrics work lends them to "Pythonic" artifacts. List comprehension, decorator patterns, duck- typing, RESTful API.
  • 14. User Metrics How do we form a picture about what happens on Wikipedia?
  • 15. User Metrics - User activity Events (not exhaustive): ■ Registration ■ Making an edit ■ Contributions of Namespaces ■ Reverting edits ■ Blocking
  • 16. User Metrics - What do we want to know about users? ○ How much do they contribute? ○ How often do they contribute? ○ Potential vandals. Do they go on to be reverted, blocked, banned?
  • 17. User Metrics - Metrics Definitions https://meta.wikimedia.org/wiki/Research:Metrics Retention Metrics Survival(t) Boolean measure of an editor surviving beyond t Threshold(t,n) Boolean measure of an editor reaching activity threshold n by time t Live Account(t) Boolean measure of whether the new user click the edit button? Volume Metrics Edit Rate Float result of user's rate of contribution. Content Integer bytes added by revision and edit count. Sessions Average session length (future) Time to Threshold Time to reach a threshold (e.g. first edit)
  • 18. User Metrics - Metrics Definitions Content Quality Revert Rate Float representing the proportion of revisions reverted. Block Boolean indicating a block event on the user. Content Persistence Integer indicating how long this user's edits survive (future) Contribution Type Namespace of Edits Integer edit counts in all namespaces. Scale of Change Float representation of fraction of total page content modified (future)
  • 19. User Metrics - Bytes Added user revision history (over a predifined period) Revision k: byte increase (user ID, bytes_added, bytes_removed, edit count)
  • 20. User Metrics - Threshold user revision history (over a predefined period) (user ID, threshold_reached={0,1}) registration Events since registration up to time "t" if len(event_list) >= n: threshold_reached = True else: threshold_reached = False
  • 21. User Metrics - Revert Rate user revision history (over a predefined period) for each revision look at page history Future Revisions Past Revisions checksum k checksum i if checksum i == checksum k: # reverted! (user ID, revert_rate, total_revisions)
  • 22. User Metrics - Implementation https://github.com/wikimedia/user_metrics 1. MySQL & Redis (future) data store a. All of the backend dependency is abstracted out of metrics classes 2. Python implementation - MySQLdb (SQLalchemy) 3. Strategy Pattern of Parent user metrics class 4. Metrics built mainly from four core MediaWiki tables: a. revision, user, page, logging 5. Python Decorator methods for handling metric aggregation
  • 24. A Concrete Example How can we use this framework?
  • 25. Example - Post Edit Feedback What effect does editing feedback (confirmation/gratitude) have on new editors?
  • 27. An Extended Solution Turn the data machine into a service.
  • 28. Editor Metrics go beyond feature experimentation ... It became clear that... ● We needed a service to let clients generate their own user metrics data sets ● We wanted to add a way for this methodology to extend beyond E3 and potentially WMF ● A force multiplier was necessary to iterate on editor data in more interesting ways (Machine Learning & more sophisticated analyses)
  • 29. User Metrics API [UMAPI] Open Source (almost) RESTful API (Flask) Computes metrics per user (User Metrics) Combines metrics in different ways depending on request types HTTP response in JSON with resulting data Store data internally for reuse
  • 31. UMAPI - Overview Service GET requests based on a combination of URL paths + query params e.g. /cohort/metric?date_start=..&date_end=...&... Define user "cohorts" on which to operate API engine maps to metrics request object (Mediator Pattern) which is handed off to a request manager which builds and runs request JSON response
  • 32. UMAPI - Overview Basic cPickle file cache for responses Can substitute caching system (e.g. memcached) Reusing request data where it overlaps Request Types: "Raw" - metrics per user Aggregation over cohorts: mean, sum, median, etc. Time series requests
  • 33. UMAPI Architecture HTTP GET request JSON response Apache Flask / App Servermod_wsgi Request Notifications Listener Request Control Response Control Cache MediaWiki Slaves User Metrics API Messaging Queues Metrics objects - Separate Processes Asynchronous Callbacks
  • 34. UMAPI Architecture - Listeners Request Notifications Callback Handles managing and notifications on job status Request Controller Queues requests Spawns jobs from metrics objects Coordinates parameters Response Controller Reconstruct response data Write to cache
  • 35. We will want to consider large groups of users, for instance, a test or control group in some experiment: Aggregate groups of users lists of user IDs Cohort registration (under construction) adding new cohorts to the model Single user endpoint Boolean expressions over cohorts supported UMAPI - User Cohorts
  • 36. User Metric Periods How do we define the periods over which metrics are measured? Registration Look "t" hours since user registration User Defined User supplied start and end dates Conditional Registration Registration as above with condition that registration falls within input
  • 37. UMAPI - RequestMeta Module Mediator Pattern to handle passing request data among different portions of the architecture Abstraction allows for easy filtering and default behaviour of request parameters Requests can easily be turned into reproducible and unique hashes for caching
  • 38. How the Service Works The user experience with user metrics.
  • 39. UMAPI - Pipeline Cohort or combo Raw Params Time Series Aggregator Aggregator Params Params JSON JSON JSON
  • 41. Job Queue As you fire off requests the queue tracks what's running:
  • 45. Response - Threshold w/ params
  • 48. Response - Time series
  • 49. Response - Combining Cohorts "usertags_meta" - cohort definitions
  • 50. Response - Combining Cohorts Two intersecting cohorts:
  • 51. Response - Combining Cohorts AND (&)
  • 52. Response - Combining Cohorts OR (~)
  • 53. Response - Single user endpoint e.g.http://metrics-api.wikimedia.org/user/Renklauf/threshold?t=10000
  • 54. Looking ahead ... Connectivity metrics (additional metrics) ○ Graph database? (Neo4j, gremlin w/ postgreSQL) ○ User talk and common article edits Better in-memory modelling ○ python-memcached ○ better reuse of generated data based on request data Beyond English Wikipedia Implemented!
  • 55. Looking ahead ... More sophisticated and robust data modelling ○ Modelling richer data: contribution histories, articles edited, aggregate metrics ○ Classification: Logistic classifiers, Support Vector Machine, Deep Belief Networks, Dimensionality Reduction ○ Modelling revision text - Neural Networks, Hidden Markov Models