3. Introduction
Me: Data Analyst at Wikimedia
Machine Learning @ McGill
Fundraising - A/B testing
Editor Experiments - increasing the number of
Active editors
Editor Engagement Experiments (E3) team @ the
Wikimedia Foundation
Micro-feature experimentation
5. Problem - Editor Decline
http://strategy.wikimedia.org/wiki/Editor_Trends_Study
6. Problem - Approach
Can we stimulate the community of users to become more
numerous and productive?
○ Focus on new users
■ Encourage contribution, make it easier
○ Lower the threshold for account creation
■ Bring more people in.
○ Rapid experimentation on features that retain more
users and stimulate increased participation.
■ This will help us determine what works with less
cost
7. Problem - Evaluation
○ Data Consistency
■ Anomaly Detection
■ Auto-correlation (seasonality)
○ "A/B" testing
■ Hypothesis testing - student's t, chi-square
■ Linear / Logistic regression
○ Multivariate testing
■ Analysis of variance
8. Problem - What we need
Currently a lot of the work around analysis is done
manually and is a large drain on resources:
○ Faster Data gathering
○ Knowing what we're logging and measuring &
faster ETL
○ Faster Analysis
○ Broadening Service and iterating on results
9. Problem - What we need
Build better infrastructure around how we interpret and
analyze our data.
○ Determine what to measure.
■ Rigorously define relevant metrics
○ Expose the metrics from our data store
■ Python is great for writing code quickly to handle
tasks with data
■ Library support for data analysis (pandas,
numpy)
11. Solution - Proposed
We need to measure User Behaviour
"User Metrics" & "UMAPI"
User Metrics & UMAPI
Python implementation for gathering data from MediaWiki data stores,
producing well defined metrics, and facilitating subsequent modelling and
analysis. This includes a way to provide an interface for making different types
of requests and returning standard responses.
12. Solution - Why Bother
What exactly do we gain by building these
classes? Why not just query the database?
1. Reproducibility & Standardization
2. Extensibility
3. Concise definition
4. Increase turn around
a. Multiprocessing to optimize metrics generation
(e.g. Revert rate on 100K users
via MySQL = 24hrs,
via User Metrics < 10mins)
13. Solution - Why Python?
Why not C++, Java, or PHP?
1. Speed of development
2. Simplify the code base & easy extensibility
a. more "Scientist Friendly"
3. Good support for data processing
4. Better integration for downstream data analysis
5. The way that metrics work lends them to "Pythonic"
artifacts. List comprehension, decorator patterns, duck-
typing, RESTful API.
15. User Metrics - User activity
Events (not exhaustive):
■ Registration
■ Making an edit
■ Contributions of Namespaces
■ Reverting edits
■ Blocking
16. User Metrics - What do we want to
know about users?
○ How much do they contribute?
○ How often do they contribute?
○ Potential vandals. Do they go on to be reverted,
blocked, banned?
17. User Metrics - Metrics Definitions
https://meta.wikimedia.org/wiki/Research:Metrics
Retention Metrics
Survival(t) Boolean measure of an editor surviving beyond t
Threshold(t,n) Boolean measure of an editor reaching activity threshold n by time t
Live Account(t) Boolean measure of whether the new user click the edit button?
Volume Metrics
Edit Rate Float result of user's rate of contribution.
Content Integer bytes added by revision and edit count.
Sessions Average session length (future)
Time to Threshold Time to reach a threshold (e.g. first edit)
18. User Metrics - Metrics Definitions
Content Quality
Revert Rate Float representing the proportion of revisions reverted.
Block Boolean indicating a block event on the user.
Content Persistence Integer indicating how long this user's edits survive (future)
Contribution Type
Namespace of Edits Integer edit counts in all namespaces.
Scale of Change Float representation of fraction of total page content modified (future)
19. User Metrics - Bytes Added
user
revision
history
(over a predifined
period)
Revision k:
byte increase
(user ID, bytes_added, bytes_removed, edit count)
20. User Metrics - Threshold
user
revision
history
(over a predefined
period)
(user ID, threshold_reached={0,1})
registration
Events since
registration up
to time "t"
if len(event_list) >= n:
threshold_reached = True
else:
threshold_reached = False
21. User Metrics - Revert Rate
user
revision
history
(over a predefined
period)
for each
revision look
at page
history
Future Revisions
Past Revisions
checksum k
checksum i
if checksum i == checksum k:
# reverted!
(user ID, revert_rate, total_revisions)
22. User Metrics - Implementation
https://github.com/wikimedia/user_metrics
1. MySQL & Redis (future) data store
a. All of the backend dependency is abstracted out of
metrics classes
2. Python implementation - MySQLdb (SQLalchemy)
3. Strategy Pattern of Parent user metrics class
4. Metrics built mainly from four core MediaWiki tables:
a. revision, user, page, logging
5. Python Decorator methods for handling metric
aggregation
28. Editor Metrics go beyond feature
experimentation ...
It became clear that...
● We needed a service to let clients generate their own
user metrics data sets
● We wanted to add a way for this methodology to
extend beyond E3 and potentially WMF
● A force multiplier was necessary to iterate on editor
data in more interesting ways (Machine Learning &
more sophisticated analyses)
29. User Metrics API [UMAPI]
Open Source (almost) RESTful API (Flask)
Computes metrics per user (User Metrics)
Combines metrics in different ways depending on
request types
HTTP response in JSON with resulting data
Store data internally for reuse
31. UMAPI - Overview
Service GET requests based on a combination of URL
paths + query params
e.g. /cohort/metric?date_start=..&date_end=...&...
Define user "cohorts" on which to operate
API engine maps to metrics request object (Mediator
Pattern) which is handed off to a request manager which
builds and runs request
JSON response
32. UMAPI - Overview
Basic cPickle file cache for responses
Can substitute caching system (e.g. memcached)
Reusing request data where it overlaps
Request Types:
"Raw" - metrics per user
Aggregation over cohorts: mean, sum, median, etc.
Time series requests
33. UMAPI Architecture
HTTP GET request
JSON response
Apache
Flask / App
Servermod_wsgi
Request
Notifications
Listener
Request
Control
Response
Control Cache
MediaWiki
Slaves
User
Metrics
API
Messaging Queues
Metrics objects -
Separate
Processes
Asynchronous Callbacks
34. UMAPI Architecture - Listeners
Request Notifications Callback
Handles managing and notifications on job status
Request Controller
Queues requests
Spawns jobs from metrics objects
Coordinates parameters
Response Controller
Reconstruct response data
Write to cache
35. We will want to consider large groups of users, for instance,
a test or control group in some experiment:
Aggregate groups of users
lists of user IDs
Cohort registration (under construction)
adding new cohorts to the model
Single user endpoint
Boolean expressions over cohorts supported
UMAPI - User Cohorts
36. User Metric Periods
How do we define the periods over which metrics are
measured?
Registration
Look "t" hours since user registration
User Defined
User supplied start and end dates
Conditional Registration
Registration as above with condition that registration falls within input
37. UMAPI - RequestMeta Module
Mediator Pattern to handle passing request data among
different portions of the architecture
Abstraction allows for easy filtering and default behaviour
of request parameters
Requests can easily be turned into reproducible and unique
hashes for caching
38. How the Service Works
The user experience with user metrics.
53. Response - Single user endpoint
e.g.http://metrics-api.wikimedia.org/user/Renklauf/threshold?t=10000
54. Looking ahead ...
Connectivity metrics (additional metrics)
○ Graph database? (Neo4j, gremlin w/ postgreSQL)
○ User talk and common article edits
Better in-memory modelling
○ python-memcached
○ better reuse of generated data based on request data
Beyond English Wikipedia
Implemented!
55. Looking ahead ...
More sophisticated and robust data modelling
○ Modelling richer data: contribution histories, articles
edited, aggregate metrics
○ Classification: Logistic classifiers, Support Vector
Machine, Deep Belief Networks, Dimensionality
Reduction
○ Modelling revision text - Neural Networks, Hidden
Markov Models