1. Implementing and Visualizing Click-Stream
Data with MongoDB
Jan 22, 2013 - New York MongoDB User Group
Cameron Sim - LearnVest.com
Monday, April 15, 13
2. Agenda
About LearnVest
HL Application Architecture
Data Capture
Event Packaging
MongoDB Data Warehousing
Loading & Visualization
Finishing up
Monday, April 15, 13
3. LearnVest Inc.
www.learnvest.com
Mission Statement
Aiming to making Financial Planning as accessible as having a gym membership
Company Key Products
Founded in 2008 by Alexa Von Tobel, CEO Account Aggregation and Management
(Bank, Credit, Loan, Investment, Mortgage)
50+ People and Growing rapidly
Based in NYC Original and Syndicated Newsletter Content
Platforms Financial Planning
Web & iPhone (tiered product offering)
Stack
Operational Analytics
Wordpress, Backbone.js, Node.js MongoDB 2.2.0 (3-node replica-set)
Java Spring 3, Redis, Memcached, Java 6, Spring 3
MongoDB, ActiveMQ, Nginx, MySQL 5.x pyMongo
Django 1.4
Monday, April 15, 13
13. Philosophy For Data Collection
Capture Everything
• User-Driven events over web and mobile
• System-level exceptions
• Everything else
Temporary Data
• Be ‘ok’ with approximate data
• Operational Databases are the system of record
Aggregate events as they come in
• Remove the overhead of basic metrics (counts, sums) on core events
• Group by user unique id and increment counts per event, over time-dimensions
(day, week-ending, month, year)
Monday, April 15, 13
14. Data Capture
IOS
- (void) sendAnalyticEventType:(NSString*)eventType
object:(NSString*)object
name:(NSString*)name
page:(NSString*)page
source:(NSString*)source;
{
NSMutableDictionary *eventData = [NSMutableDictionary dictionary];
if (eventType!=nil) [params setObject:eventType forKey:@"eventType"];
if (object!=nil) [eventData setObject:object forKey:@"object"];
if (name!=nil) [eventData setObject:name forKey:@"name"];
if (page!=nil) [eventData setObject:page forKey:@"page"];
if (source!=nil) [eventData setObject:source forKey:@"source"];
if (eventData!=nil) [params setObject:eventData forKey:@"eventData"];
[[LVNetworkEngine sharedManager] analytics_send:params];
}
Monday, April 15, 13
15. Data Capture
WEB (JavaScript)
function internalTrackPageView() {
var cookie = {
userContext: jQuery.cookie('UserContextCookie'),
};
var trackEvent = {
eventType: "pageView",
eventData: {
page: window.location.pathname + window.location.search
}
};
// AJAX
jQuery.ajax({
url: "/api/track",
type: "POST",
dataType: "json",
data: JSON.stringify(trackEvent),
// Set Request Headers
beforeSend: function (xhr, settings) {
xhr.setRequestHeader('Accept', 'application/json');
xhr.setRequestHeader('User-Context', cookie.userContext);
if(settings.type === 'PUT' || settings.type === 'POST') {
xhr.setRequestHeader('Content-Type', 'application/json');
}
}
});
}
Monday, April 15, 13
16. Bus Event Packaging
1. Spring 3 RESTful service layer, controller methods define the eventCode via @tracking
annotation
2. Custom Intercepter class extends HandlerInterceptorAdapter and implements
postHandle() (for each event) to invoke calls via Spring @async to an EventPublisher
3. EventPublisher publishes to common event bus queue with multiple subscribers, one of
which packages the eventPayload Map<String, Object> object and forwards to Analytics Rest
Service
Monday, April 15, 13
17. Bus Event Packaging
1) Spring RestController Methods
Interface
@RequestMapping(value = "/user/login", method = RequestMethod.POST,
headers="Accept=application/json")
public Map<String, Object> userLogin(@RequestBody Map<String, Object> event,
HttpServletRequest request);
Concrete/Impl Class
@Override
@Tracking("user.login")
public Map<String, Object> userLogin(@RequestBody Map<String, Object> event,
HttpServletRequest request){
//Implementation
return event;
}
Monday, April 15, 13
18. Bus Event Packaging
2) Custom Intercepter class extends HandlerInterceptorAdapter
protected void handleTracking(String trackingCode, Map<String, Object> modelMap,
HttpServletRequest request) {
Map<String, Object> responseModel = new HashMap<String, Object>();
// remove non-serializables & copy over data from modelMap
try {
this.eventPublisher.publish(trackingCode, responseModel, request);
} catch (Exception e) {
log.error("Error tracking event '" + trackingCode + "' : "
+ ExceptionUtils.getStackTrace(e));
}
}
Monday, April 15, 13
19. Bus Event Packaging
2) Custom Intercepter class extends HandlerInterceptorAdapter
public void publish (String eventCode, Map<String,Object> eventData,
HttpServletRequest request) {
Map<String,Object> payload = new HashMap<String,Object>();
String eventId=UUID.randomUUID().toString();
Map<String, String> requestMap = HttpRequestUtils.getRequestHeaders(request);
//Normalize message
payload.put("eventType", eventData.get("eventType"));
payload.put("eventData", eventData.get("eventType"));
payload.put("version", eventData.get("eventType"));
payload.put("eventId", eventId);
payload.put("eventTime", new Date());
payload.put("request", requestMap);
.
.
.
//Send to the Analytics Service for MongoDB persistence
}
public void sendPost(EventPayload payload){
HttpEntity request = new HttpEntity(payload.getEventPayload(), headers);
Map m = restTemplate.postForObject(endpoint, request, java.util.Map.class);
}
Monday, April 15, 13
23. MongoDB Data Warehousing
MongoDB Information
• v2.2.0
• 3-node replica-set
• 1 Large (primary), 2x Medium (secondary) AWS Amazon-Linux machines
• Each with single 500GB EBS volumes mounted to /opt/data
MongoDB Config File
dbpath = /opt/data/mongodb/data
rest = true
replSet = voyager
Volumes
~IM events daily on web, ~600K on mobile
2-3 GB per day at start, slowed to ~1GB per day
Currently at 78GB (collecting since August 2012)
Future Scaling Strategy
• Setup 2nd Replica-Set
• Shard replica-sets to n at 60% / 250GB per EBS volume
• Shard key probably based on sequential mix of email_address & additional string
Monday, April 15, 13
24. MongoDB Data Warehousing
Approach
1. Persist all events, bucketed by source:-
WEB
MOBILE
2. Persist all events, bucketed by source, event code and time:-
WEB/MOBILE
user.login
time (day, week-ending, month, year)
3. Insert into collection e_web / e_mobile
4. Upsert into:-
e_web_user_login_day
e_web_user_login_week
e_web_user_login_month
e_web_user_login_year
5. Predictable model for scaling and measuring business growth
Monday, April 15, 13
28. MongoDB Data Warehousing
Indexing Strategy
• Indexes on core collections (e_web and e_mobile) come in under 3GB on 7.5GB Large
Instance and 3.75GB on Medium instances
• Split datetime in two fields and compound index on date with other fields like eventType
and user unique id (user-context)
• Heavy insertion rates, much lower read rates....so less indexes the better
Monday, April 15, 13
30. Loading & Visualization
Objective
• Show historic and intraday stats on core use cases (logins, conversions)
• Show user funnel rates on conversion pages
• Show general usability - how do users really use the Web and IOS platforms?
Non-Functionals
• Intraday doesn’t need to be “real-time”, polling is good enough for now
• Overnight batch job for historic must scale horizontally
General Implementation Strategy
• Do all heavy lifting & object manipulation, UI should just display graph or table
• Modularize the service to be able to regenerate any graphs/tables without a full load
Monday, April 15, 13
31. Loading & Visualization
Java Batch Service
Java Mongo library to query key collections and return user counts and sum of events
DBCursor webUserLogins = c.find(
new BasicDBObject("date", sdf.format(new Date())));
private HashMap<String, Object> getSumAndCount(DBCursor cursor){
HashMap<String, Object> m = new HashMap<String, Object>();
int sum=0;
int count=0;
DBObject obj;
while(cursor.hasNext()){
obj=(DBObject)cursor.next();
count++;
sum=sum+(Integer)obj.get("count");
}
m.put("sum", sum);
m.put("count", count);
m.put("average", sdf.format(new Float(sum)/count));
return m;
}
Monday, April 15, 13
32. Loading & Visualization
Java Batch Service
Use Aggregation Framework where required on core collections (e_web) and external data
//create aggregation objects
DBObject project = new BasicDBObject("$project",
new BasicDBObject("day_value", fields) );
DBObject day_value = new BasicDBObject( "day_value", "$day_value");
DBObject groupFields = new BasicDBObject( "_id", day_value);
//create the fields to group by, in this case “number”
groupFields.put("number", new BasicDBObject( "$sum", 1));
//create the group
DBObject group = new BasicDBObject("$group", groupFields);
//execute
AggregationOutput output = mycollection.aggregate( project, group );
for(DBObject obj : output.results()){
.
.
}
Monday, April 15, 13
33. Loading & Visualization
Java Batch Service
MongoDB Command Line example on aggregation over a time period, e.g. month
> db.e_web.aggregate(
[
{ $match : { created_date : { $gt : ISODate("2012-10-25T00:00:00")}}},
{ $project : {
day_value : {"day" : { $dayOfMonth : "$created_date" },
"month":{ $month : "$created_date" }}
}},
{ $group : {
_id : {day_value:"$day_value"} ,
number : { $sum : 1 }
} },
{ $sort : { day_value : -1 } }
]
)
Monday, April 15, 13
35. Loading & Visualization
Django and HighCharts
Extract data (pyMongo)
def getHomeChart(dt_from, dt_to):
"""Called by home method to get latest 30 day numbers"""
try:
conn = pymongo.Connection('localhost', 27017)
db = conn['lvanalytics']
cursor = db.accountmetrics.find(
{"date" : {"$gte" : dt_from, "$lte" : dt_to}}).sort("date")
return buildMetricsDict(cursor)
except Exception as e:
logger.error(e.message)
Return the graph object (as a list or a dict of lists) to the view that called the
method
pagedata={}
pagedata['accountsGraph']=mongodb_home.getHomeChart()
return render_to_response('home.html',{'pagedata': pagedata},
context_instance=RequestContext(request))
Monday, April 15, 13
36. Loading & Visualization
Django and HighCharts
Populate the series.. (JavaScript with Django templating)
seriesOptions[0] = {
id: 'naturalAccounts',
name: "Natural Accounts",
data: [
{% for a in pagedata.metrics.accounts_natural %}
{% if not forloop.first %}, {% endif %}
[Date.UTC({{a.0}}),{{a.1}}]
{% endfor %}
],
tooltip: {
valueDecimals: 2
}
};
Monday, April 15, 13
37. Loading & Visualization
Django and HighCharts
And Create the Charts and Tables...
Monday, April 15, 13
38. Loading & Visualization
Django and HighCharts
And Create the Charts and Tables...
Monday, April 15, 13
39. Lessons Learned
• Date Time managed as two fields, Datetime and Date
• Aggregating and upserting documents as events are received works for us
• Real-time Map-Reduce in pyMongo - too slow, don’t do this.
• Django-noRel - Unstable, use Django and configure MongoDB as a
datastore only
• Memcached on Django is good enough (at the moment) - use django-
celery with rabbitmq to pre-cache all data after data loading
• HighCharts is buggy - considering D3 & other libraries
• Don’t need to retrieve data directly from MongoDB to Django, perhaps
provide all data via a service layer (at the expense of ever-additional
features in pyMongo)
Monday, April 15, 13
40. Next Steps
• A/B testing framework, experiments and variances
• Unauthenticated / Authenticated user tracking
• Provide data async over service layer
• Segmentation with graphical libraries like D3 & Cross-Filter (http://
square.github.com/crossfilter/)
• Saving Query Criteria, expanding out BI tools for internal users
• MongoDB Connector, Hadoop and Hive (maybe Tableau and other tools)
• Storm / Kafka for real-time analytics processing
• Shard the Replica-Set, looking into Gizzard as the middleware
Monday, April 15, 13
41. Thanks & Questions
Hrishi Dixit Kevin Connelly Will Larche
Chief Technology Officer Director of Engineering Lead IOS Developer
hrishi@learnvest.com kevin@learnvest.com will@learnvest.com
Jeremy Brennan Cameron Sim <your name here>
Director of UI/UX Technology Director of Analytics Tech New Awesome Developer
jeremy@learnvest.com cameron@learnvest.com you@learnvest.com
HIRED
!
Monday, April 15, 13