Games Industry Analytics Forum 2 - Plumbee

•

2 gefällt mir•4,615 views

GIAF

How we built analytics from scratch (in six easy steps) Jodi Moran, Co-founder & CTO, Plumbee

Business Technologie

How we built analytics from scratch
(in six easy steps)
Jodi Moran, Co-founder & CTO
1

Plumbee: social casino games
• Founded
October 2011
• Facebook
canvas launch
March 2012
• iOS beta
launch
December
2012
• 1.2m MAU,
250K DAU,
~40 staff
2

Goals
Never say “we don’t have that data”
Breadth of data use
Depth of data use
Agile data use
Scalable foundation for the future
5

Step #1:
Blank slate No time
No
bandwidth
No
experience
7
3rd party analytics

Third-party analytics
• Low opportunity
cost
• Full stack solution
• Lots of choices
• Get useful data to
everyone fast
8

Step #2:
3rd party
systems lack
flexibility
Want to own
the data
Don’t know
what we want
to know
Analytics is
strategic
10
Collect everything

What is everything?
• State-changing calls from client to server
• Changes of state
• State-changing calls from client to third
parties (Facebook)
Yes, this is a lot of data: 450m events (45 GB
compressed) per day.
Using Amazon Web Services makes this
possible.
11

Why we like it
No need:
– To test instrumentation
– To add instrumentation of new features
– To touch transactional databases
– To worry we won’t have the data
Easy and fast to implement
... but we still miss things.
13

Step #3:
Lots and lots
of data
Need access
Data is
unstructured
No time to
build
structure
15
Elastic MapReduce & Hive

Step #4:
Only access
via SQL
Want data to
be everyday
Still no
engineering
time
18
Google Spreadsheets

Step #5:
All data
processing is
manual
This is getting
expensive
And it takes a
long time to
run
21
Automation & optimization

(Basic) optimization
• Spot instances
• Output compression with snappy
• Python streaming jobs
• There’s a lot more we could do…
23

Step #6:
Expensive
Hive clusters
Queries take a
long time to
run
Hive
functionality
is limited
24
Relational data mart

Why Hive AND a traditional database?
15 GB of
aggregates
20 TB total
25

Goals
Never say “we don’t have that data”
Breadth of data use
Depth of data use
Agile data use
Scalable foundation for the future
27

But we have tons to do.Engineering
• Replace our custom event
aggregators with Flume
• Replace pull-based Hive &
Python streaming jobs with
Cascading + JVM-based
languages
• Change event storage from
JSON to Avro
• Better dashboards and tools
• Consider in-memory
processing, e.g. Spark/Shark
• Toward “big data”
Analysis
• More “actionable”, less
“interesting”
• Continuous optimization: split
/ multivariate testing, multi-
armed bandit
• Better predictive models
• Clustering, segmentation,
personalization
• Toward “data science”
29

30
Jodi Moran jobs.plumbee.com
jodi@plumbee.com www.plumbee.com
@jodi_p_moran apps.facebook.com/mirrorballslots
www.facebook.com/jodipmoran
www.linkedin.com/in/jmoran
Questions? Get in touch!

Empfohlen

The Real-Time CDO and the Cloud-Forward Path to Predictive AnalyticsSingleStore

3wks Google Cloud Next17 Top TenErika Bell (Rodríguez Morillo)

Building an Applied Science PortfolioBen Weber

Zillow's favorite big data & machine learning toolsnjstevens

Machines and the Magic of Fast LearningSingleStore

[2C6]Everyplay_Big_DataNAVER D2

Mastering Your Customer Data on Apache Spark by Elliott CordoSpark Summit

Development Productivity for IBM i - Build an Efficient IT Department with AB...HelpSystems

Empfohlen

The Real-Time CDO and the Cloud-Forward Path to Predictive AnalyticsSingleStore

3wks Google Cloud Next17 Top TenErika Bell (Rodríguez Morillo)

Building an Applied Science PortfolioBen Weber

Zillow's favorite big data & machine learning toolsnjstevens

Machines and the Magic of Fast LearningSingleStore

[2C6]Everyplay_Big_DataNAVER D2

Mastering Your Customer Data on Apache Spark by Elliott CordoSpark Summit

Development Productivity for IBM i - Build an Efficient IT Department with AB...HelpSystems

Lambda Architecture 2.0 for Reactive AB TestingTrieu Nguyen

The Lyft data platform: Now and in the futuremarkgrover

Stream processing for the practitioner: Blueprints for common stream processi...Aljoscha Krettek

Talk in Google fest 2013David Chen

The Business Economics and Opportunity of Open Source Data ScienceRevolution Analytics

Data Science At ZillowNicholas McClure

Scaling Production Machine Learning Pipelines with DatabricksDatabricks

Productive Data Tools for QuantsWes McKinney

Lambda Architecture and open source technology stack for real time big dataTrieu Nguyen

Snowplow: open source game analytics powered by AWSGiuseppe Gaviani

Building a Just in Time Data Warehouse by Dan Morris and Jason PohlSpark Summit

Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...Databricks

Join 2017_Deep Dive_Workflows with ZapierLooker

Enabling Real-Time Analytics for IoTSingleStore

Implementing improved and consistent arbitrary event tracking company-wide us...yalisassoon

Re-orienting your business around dataDani Solà Lagares

Building the Ideal Stack for Real-Time AnalyticsSingleStore

Tapjoy: Building a Real-Time Data Science Service for Mobile AdvertisingSingleStore

Building real-time data analytics on Google CloudJonny Daenen

Getting Started with Big Data AnalyticsRob Winters

Presentatie h&msjoerdvanp

UbisoftGIAF

Weitere ähnliche Inhalte

Was ist angesagt?

Lambda Architecture 2.0 for Reactive AB TestingTrieu Nguyen

The Lyft data platform: Now and in the futuremarkgrover

Stream processing for the practitioner: Blueprints for common stream processi...Aljoscha Krettek

Talk in Google fest 2013David Chen

The Business Economics and Opportunity of Open Source Data ScienceRevolution Analytics

Data Science At ZillowNicholas McClure

Scaling Production Machine Learning Pipelines with DatabricksDatabricks

Productive Data Tools for QuantsWes McKinney

Lambda Architecture and open source technology stack for real time big dataTrieu Nguyen

Snowplow: open source game analytics powered by AWSGiuseppe Gaviani

Building a Just in Time Data Warehouse by Dan Morris and Jason PohlSpark Summit

Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...Databricks

Join 2017_Deep Dive_Workflows with ZapierLooker

Enabling Real-Time Analytics for IoTSingleStore

Implementing improved and consistent arbitrary event tracking company-wide us...yalisassoon

Re-orienting your business around dataDani Solà Lagares

Building the Ideal Stack for Real-Time AnalyticsSingleStore

Tapjoy: Building a Real-Time Data Science Service for Mobile AdvertisingSingleStore

Building real-time data analytics on Google CloudJonny Daenen

Getting Started with Big Data AnalyticsRob Winters

Was ist angesagt? (20)

Lambda Architecture 2.0 for Reactive AB Testing

The Lyft data platform: Now and in the future

Stream processing for the practitioner: Blueprints for common stream processi...

Talk in Google fest 2013

The Business Economics and Opportunity of Open Source Data Science

Data Science At Zillow

Scaling Production Machine Learning Pipelines with Databricks

Productive Data Tools for Quants

Lambda Architecture and open source technology stack for real time big data

Snowplow: open source game analytics powered by AWS

Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl

Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...

Join 2017_Deep Dive_Workflows with Zapier

Enabling Real-Time Analytics for IoT

Implementing improved and consistent arbitrary event tracking company-wide us...

Re-orienting your business around data

Building the Ideal Stack for Real-Time Analytics

Tapjoy: Building a Real-Time Data Science Service for Mobile Advertising

Building real-time data analytics on Google Cloud

Getting Started with Big Data Analytics

Andere mochten auch

Presentatie h&msjoerdvanp

UbisoftGIAF

Product Madness - A/B TestingGIAF

3 2 enzymanadia Roidaki

Games Analytics Industry Fourm 2 - Opera SolutionsGIAF

Games Industry Analytics Forum 2 - GamesAnalyticsGIAF

Andere mochten auch (6)

Presentatie h&m

Ubisoft

Product Madness - A/B Testing

3 2 enzyma

Games Analytics Industry Fourm 2 - Opera Solutions

Games Industry Analytics Forum 2 - GamesAnalytics

Ähnlich wie Games Industry Analytics Forum 2 - Plumbee

How we built analytics from scratch (in seven easy steps)plumbee

bigdata.pptxVIJAYAPRABAP

bigdata.pdfAnjaliKumari301316

Introduction to Big DataRoi Blanco

A data analyst view of Bigdata Venkata Reddy Konasani

Big Data RampageNiko Vuokko

Big DataMahesh Bmn

Atlanta hadoop users group july 2013Christopher Curtin

Big Data at a Gaming Company: Spil GamesRob Winters

Top BI trends and predictions for 2017Panorama Software

Continuum Analytics and PythonTravis Oliphant

Streaming - Why Should I Care?Christian Trebing

Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney

Power to the People: A Stack to Empower Every User to Make Data-Driven DecisionsLooker

MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from LynchpinLynchpin Analytics Consultancy

Powering a Startup with Apache Spark with Kevin KimSpark Summit

Ellucian Live 2014 Presentation on Reporting and BIKent Brooks

Big data by Mithlesh sadhMithlesh Sadh

Data lake – On Premise VS CloudIdan Tohami

Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01Soujanya V

Ähnlich wie Games Industry Analytics Forum 2 - Plumbee (20)

How we built analytics from scratch (in seven easy steps)

bigdata.pptx

bigdata.pdf

Introduction to Big Data

A data analyst view of Bigdata

Big Data Rampage

Big Data

Atlanta hadoop users group july 2013

Big Data at a Gaming Company: Spil Games

Top BI trends and predictions for 2017

Continuum Analytics and Python

Streaming - Why Should I Care?

Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...

Power to the People: A Stack to Empower Every User to Make Data-Driven Decisions

MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin

Powering a Startup with Apache Spark with Kevin Kim

Ellucian Live 2014 Presentation on Reporting and BI

Big data by Mithlesh sadh

Data lake – On Premise VS Cloud

Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01

Kürzlich hochgeladen

M.C Lodges -- Guest House in Jhang.Aaiza Hassan

Monthly Social Media Update April 2024 pptx.pptxAndy Lambert

VIP Kolkata Call Girl Howrah 👉 8250192130 Available With Roomdivyansh0kumar0

Call Girls In DLf Gurgaon ➥99902@11544 ( Best price)100% Genuine Escort In 24...lizamodels9

0183760ssssssssssssssssssssssssssss00101011 (27).pdfRenandantas16

Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Dipal Arora

It will be International Nurses' Day on 12 MayNZSG

Catalogue ONG NUOC PPR DE NHAT .pdfOrient Homes

VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒anilsa9823

MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLSeo

Best Practices for Implementing an External Recruiting PartnershipRecruitment Process Outsourcing Association

Call Girls in Gomti Nagar - 7388211116 - With room Servicediscovermytutordmt

Insurers' journeys to build a mastery in the IoT usageMatteo Carbone

BEST ✨ Call Girls In Indirapuram Ghaziabad ✔️ 9871031762 ✔️ Escorts Service...noida100girls

Sales & Marketing Alignment: How to Synergize for SuccessAggregage

Call Girls Pune Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora

KestrelPro Flyer Japan IT Week 2024 (English)Data Analytics Company - 47Billion Inc.

Eni 2024 1Q Results - 24.04.24 business.Eni

Mondelez State of Snacking and Future Trends 2023Neil Kimberley

Grateful 7 speech thanking everyone that has helped.pdfPaul Menig

Kürzlich hochgeladen (20)

M.C Lodges -- Guest House in Jhang.

Monthly Social Media Update April 2024 pptx.pptx

VIP Kolkata Call Girl Howrah 👉 8250192130 Available With Room

Call Girls In DLf Gurgaon ➥99902@11544 ( Best price)100% Genuine Escort In 24...

0183760ssssssssssssssssssssssssssss00101011 (27).pdf

Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...

It will be International Nurses' Day on 12 May

Catalogue ONG NUOC PPR DE NHAT .pdf

VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒

MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL

Best Practices for Implementing an External Recruiting Partnership

Call Girls in Gomti Nagar - 7388211116 - With room Service

Insurers' journeys to build a mastery in the IoT usage

BEST ✨ Call Girls In Indirapuram Ghaziabad ✔️ 9871031762 ✔️ Escorts Service...

Sales & Marketing Alignment: How to Synergize for Success

Call Girls Pune Just Call 9907093804 Top Class Call Girl Service Available

KestrelPro Flyer Japan IT Week 2024 (English)

Eni 2024 1Q Results - 24.04.24 business.

Mondelez State of Snacking and Future Trends 2023

Grateful 7 speech thanking everyone that has helped.pdf

Games Industry Analytics Forum 2 - Plumbee

1. How we built analytics from scratch (in six easy steps) Jodi Moran, Co-founder & CTO 1

2. Plumbee: social casino games • Founded October 2011 • Facebook canvas launch March 2012 • iOS beta launch December 2012 • 1.2m MAU, 250K DAU, ~40 staff 2

3. Goals Never say “we don’t have that data” Breadth of data use Depth of data use Agile data use Scalable foundation for the future 5

4. In the beginning… 6

5. Step #1: Blank slate No time No bandwidth No experience 7 3rd party analytics

6. Third-party analytics • Low opportunity cost • Full stack solution • Lots of choices • Get useful data to everyone fast 8

7. 9

8. Step #2: 3rd party systems lack flexibility Want to own the data Don’t know what we want to know Analytics is strategic 10 Collect everything

9. What is everything? • State-changing calls from client to server • Changes of state • State-changing calls from client to third parties (Facebook) Yes, this is a lot of data: 450m events (45 GB compressed) per day. Using Amazon Web Services makes this possible. 11

10. 12

11. Why we like it No need: – To test instrumentation – To add instrumentation of new features – To touch transactional databases – To worry we won’t have the data Easy and fast to implement ... but we still miss things. 13

12. 14

13. Step #3: Lots and lots of data Need access Data is unstructured No time to build structure 15 Elastic MapReduce & Hive

14. 16

15. Step #4: Only access via SQL Want data to be everyday Still no engineering time 18 Google Spreadsheets

16. 19

17. 20

18. Step #5: All data processing is manual This is getting expensive And it takes a long time to run 21 Automation & optimization

19.

20. (Basic) optimization • Spot instances • Output compression with snappy • Python streaming jobs • There’s a lot more we could do… 23

21. Step #6: Expensive Hive clusters Queries take a long time to run Hive functionality is limited 24 Relational data mart

22. Why Hive AND a traditional database? 15 GB of aggregates 20 TB total 25

23. 26 Plumbee analytics today

24. Goals Never say “we don’t have that data” Breadth of data use Depth of data use Agile data use Scalable foundation for the future 27

25. ‹#›

26. But we have tons to do.Engineering • Replace our custom event aggregators with Flume • Replace pull-based Hive & Python streaming jobs with Cascading + JVM-based languages • Change event storage from JSON to Avro • Better dashboards and tools • Consider in-memory processing, e.g. Spark/Shark • Toward “big data” Analysis • More “actionable”, less “interesting” • Continuous optimization: split / multivariate testing, multi- armed bandit • Better predictive models • Clustering, segmentation, personalization • Toward “data science” 29

27. 30 Jodi Moran jobs.plumbee.com jodi@plumbee.com www.plumbee.com @jodi_p_moran apps.facebook.com/mirrorballslots www.facebook.com/jodipmoran www.linkedin.com/in/jmoran Questions? Get in touch!

Hinweis der Redaktion

Hi my name is Jodi MoranI’m the cofounder and CTO of PlumbeeI’m going to talk about something quite different from most talks you’ll hear here, which is how we built our analytics from scratchJust a couple of disclaimers before I startThis is something that worked for us, I don’t promise it will work for everyoneAnd it’s probably way more than seven stepsAnd I’m going to gloss a lot of detailsAnd your definition of easy may vary from mine
So to put things in context, first a little about Plumbee.We make social casino gamesThis means we take gambling mechanics and put them in the f2p space on Facebook and mobileOur first game is called Mirrorball SlotsThis game is video slot machines that you play with virtual coinsYou can earn virtual coins by returning daily, and by exchanging gifts with your friendsBut this will only allow you limited playing timeIf you run out and you want to play for longer, you can buy virtual coins from usThere is no payout, the slot machines pay out only virtual coinsWhy do people play? Purely for the entertainment of the slot machine
So what kind of business have we made out of this idea?We started in October 2011 with myself, two co-founders, and three founding employeesWe launched Mirrorball Slots on Facebook in March 2012, and at that time we’d grown to 15 staffIn December 2012 we launched a beta version of Mirrorball Slots on iOS, and at that time we had 29 staff, 4 of whom are dedicated to analyticsToday we have 1.2 million monthly active users and 250 thousand daily active users on FacebookWe have 39 staff, 5 of whom are working in analytics.
So what do we use analytics for as a social casino games startup?Like any startup we didn’t know at first whether we had the right product for the market, and we needed to iterate toward the right product, or “product-market fit”We apply the “Lean Startup” techniques of Eric Ries, where we build things incrementally, measure the results, and learn from that what to build nextThis also includes verifying that the core assumptions of our business are valid, i.e. that we have a business at allWe use bidding platforms like Facebook ads to acquire users, so we want to make sure we are paying the right price for those usersWe use analytics to identify which features are the most engaging, so that we can retain our users for longerWe use analytics to ensure that we giving away enough coins so that people keep playing, but also set our prices in such a way that we can convert players to spending with usWe use analytics to determine which players to offer promotions, when to offer them, and what kind of offers to make
So what did we have in mind when we built our analytics function?First of all, I have a deep and abiding fear of having to tell someone that we can’t find something out because we didn’t collect the data. Events are transitory and can’t be recreated, so you have to collect them while you have the chance. We want breadth of data use in our organization. Data shouldn’t be the territory of a few, everyone in the organization should be immersed in data constantlyIt should be an everyday practice to use data to make decisions of all kinds, so we can become a data-driven organizationWe also want data to be usable in whatever way we wish, from a very simple report to a complex predictive modelAs soon as we have the organizational sophistication to use data in a sophisticated way, the system should support itWe want that the analysis we are able to do is limited only by our imaginationWe want our data to be agileBy this I mean that it should be available in a timely fashion, when it is still usefulas appropriate for the speed at which we can or need to react to itI also mean that we can easily adjust to changes in the incoming dataWe should have a system that expects changes and can easily deal with them, Not something fragile or brittle, that is difficult to change. As a startup, we don’t know what we want yet, so change is the norm.Finally, we want a scalable foundation for the future. We need something we can use immediately, but also as a startupour ambition is to scale, so we need to plan to for thatTry our best to build things that can support what we need now as well as in the future.
So in the beginning, this is what our analytics looked like: nothing.So what and how to start building ?
We were starting from scratch. We have no legacy to support: we can choose whatever tools we think are best.We had no time to build the most sophisticated system, we needed something that would be useable immediately.We had no bandwidth: all of our time went toward building the best possible MVP that would allow us to launch as quickly as possible.We didn’t have anyone with lots of experience building analytics, so we needed to learn first what we would want and need.Our solution for this was to use a third party analytics system.
A third party analytics solution has the lowest possible opportunity cost.The engineering time required is only the time to transmit data to the third partyThe third party provides a full stack solution, including data collection, aggregation, and user-friendly dashboards and reportingThere are lots of choices these days for such solutions, many tailored to specific industries / verticalsPricing for many of these solutions also scales according to data or usage, so you don’t have a large upfront cost that a startup couldn’t affordThis is the quickest way to get useful data into the hands of all of your organization Start getting people to expect to see and use data for decision making
So we went from nothing at allTo using a third party system for analytics.But that’s not the end of the story…
Third party systems lack flexibility: by definition they make something that works for many customers, and so it caters to the lowest common denominatorAt some point you will find that they cannot tell you what you want to know; that point comes sooner than you think.Sending all your data only to a third party is very risky, even if there are provisions to export it. We wanted to own the data ourselves and ensure that it is safely stored for the future.You’ll recall I said I have a deep fear of losing data that cannot be recreated. In the beginning we didn’t know what we wanted to know, nor did we have time to think about what questions we might want answered. Finally, for us analytics is a strategic advantage.One of our goals was to build a scalable foundation for the future. Using a third party system is clearly not that.So what did we do? Collect everything.
What do I mean by everything?We collect the contents of every call from our game client to our game server that changes the persisted state for the userWe collect an exact copy of what is written to our transactional storeThese two pieces of information allow us to (theoretically) recreate the state of our transactional store at any historical moment in timeExample: request to spin reels together with the changes to user’s balance, experience, and consumable inventory that result from that spinIn addition to this, we also collect state-changing calls that are made directly from our client to third party systems, i.e. to Facebook. Example of this is the sending of an invite from one user to another.This is a lot of data. We currently collect over 300 million analytics events per day (6 GB compressed).In the past, for a startup to build the infrastructure to collect this amount of data would not be feasible.But now it couldn’t be easier, and there’s one simple reason for that: cloud services like Amazon Web Services.
So how do we actually do it? As part of our application server code, we ensure all incoming requests and all database writes are funnelled through an instrumentation libraryThese libraries transform the request or the database write to JSON, so that no schemas are required to interpret them.The JSON-formatted event is then sent to Amazon’s Simple Queue Service, which is a reliable, highly scalable, hosted queueLike all Amazon services SQS queues are created and managed via simple API calls and do not need to be monitored or operated by us, saving many of the headaches you would expect from sending huge volumes of data through a persistent queuing system.We can simply make a call to place a message in the queue and we know that we will be able to retrieve it at our convenience.The next step in our process is to dequeue the JSON messages and aggregate them into flat files.Here we take advantage of another Amazon service, Elastic Block Store (EBS). EBS provides block level network-attached storage that persists independently from the lifecycle of the virtual machines to which they are attachedIn our case that means we know that once we have written events to a file on the EBS volume, we are highly unlikely to lose them.After enough events are aggregated into a file, we compress that fileWe then upload each file to Amazon’s Simple Storage Service (S3), which is a highly scalable and extremely reliable store for large amounts of data.This architectureHas no bottlenecks, since each of SQS, the event aggregators, and S3 can independently scale indefinitelyEnsures that it is highly unlikely that we lose data, since each store (SQS, EBS, and S3) are extremely reliable and data is always written to the next store before being deleted in the previousSQS is also so highly available that it is also very unlikely that we lose data due to prolonged queue outage (individual request errors are always retried).
We like this approach for a few reasons.First of all, there’s no need to test that instrumentation has been added correctly and in the right place, since all requests and writes are forced through the instrumentation library and automatically transformed to analytics events.There’s also no need to add new instrumentation for new features, again since all instrumentation is automaticThere’s no need to touch our transactional databases to obtain data, since we collect what amounts to a full copy of the transactional database. Finally there’s no need to worry that there we are missing some vital data, since we are recording “pretty much everything”. This helps me sleep better at nightIt is also easy and fast to implement, when you are starting from scratch.Believe it or not, this is still not enough dataWe (originally) didn’t record any UI clicks – now we have someWe also miss some client-only user actions (enable/disable sound, for example)
So this is what our analytics system actually looked like when we launched in March 2012.The third party was providing what we actually used, but at the same time we were collecting everything “for the future”.
This was our next problem:We had lots and lots of data being generated, way more than what we were sending to the third party systemWe wanted access to all of that data. We knew the third party system couldn’t tell us enough.All this data is “unstructured” in the relational sense: it was JSON.But we still had no time and no bandwidth to convert it all into some nice relational structure for SQL-oriented users.The solution? Elastic Map Reduce (EMR) and Hive.Elastic MapReduce is Hadoop as a service, provided by Amazon Web Services.
So we went from thisTo adding Hive clusters to process the data.Adding hive clusters required no extra “engineering” at all:EMR cluster that included Hive launched via web UI -- AWS console Load JSON serde into Hive for convenient access to JSON structure with SQL (ability to create tables over JSON)One great advantage of Hadoop + Hive: no need for structure in the data, but still can use SQL to query it. Not many tools available for this.We used Hive for two purposes:For transformation of the raw log data into structured aggregate tablesWe designed a set of aggregate tables and then created Hive queries to fill them from the raw JSON dataEvery day, we would manually execute these queries to process the latest day’s data, and then store the results back in S3For analyst’s access to both the raw event data and the structured aggregates
So I said this required no extra engineering effort. The secret to making this work was hiring the right analysts.Our analysts have technical skills: comfortable with command line, knowledge of scripting languagesThey are comfortable working with messy, unstructured dataThey have some understanding of data architecture and therefore can envision the aggregates they want to have for analysisThe only way we could have done this without dedicated engineering was to have this unique combination of skills and determination to get to the data “no matter what”
But this was the only way to access the data: via SQL on a Hive cluster, which wasn’t very user-friendly for our CEOWe were sending around excel sheets with data in them, but again, this really didn’t provide enough visibilityOne of our goals was to make data broadly accessible, available all the time, and “in people’s faces” to help drive decisions.But we still didn’t have a lot of engineering bandwidth available to make reporting dashboards, or the time and money to implement a high-end solution.The solution: googlespreadsheetsIt’s free
This is what you can do with googlespreadsheetsThe bottom one here is weekly active users, you can see some annotations on this graph that show significant events.The top one shows the distribution of spenders across different spender segments.To share more complex analysis, we additionally implemented company-wide demos on a weekly basisDuring these demos, the analytics team presents all of the insights they have gained in the last week to the entire companyAll of the slide decks shown in the demos are left in a publicly accessible location for later referenceOne question that might be asked is: don’t you need something that allows slice and dice?Having implemented such solutions at previous workplaces, my experience is that you have two camps of users:One camp doesn’t have the time or patience to dig into data, and won’t use slice-and-dice (correctly) even if you give it to them.The other camp wants to get deeply into the data and understand it well, and for them the right tool is SQL. At Plumbee even our marketing staff know and use SQL.
Hive can export files as CSVs in S3Files in S3 are accessible via HTTPSS3 feature that allows you to authorize a request by sending a signature as one of the HTTP query parametershttps + signature allows secure access to sensitive dataSigned URL together with the Google Spreadsheets IMPORTDATA function to load the csv report into a Google SpreadsheetGoogle Spreadsheet charts to visualize the dataThis required no engineering whatsoeverAnd, it was free!This allowed us to remove the third party system altogether.
If you’ll recall, all data processing done so far was done completely manuallyEssentially, up to this point we had put no effort whatsoever into the engineering side of our analytics solutionThis was expensive both in terms of his time, and the number of machines we were using to process the dataIt also was taking an awfully long time to run the daily loads each dayThe solution: start doing some data engineering in the form of automation and optimizationBelieve me, our analyst was happy about this.
We introduced a persistent metadata store in the form of Hcatalog backed with MySQL running on Amazon’s relational database service (RDS).Using MySQL on RDS means that we don’t need to manage this database (e.g. backups and machine failures are handled automatically).Using a persistent store meant there was no need to re-create the metadata each time a new Hive cluster was launched.Additionally,HCatalog allowsPig, Hive, MapReduce to use the same metadata information.We had explored using Pig previously but without schema metadata it was too much hassle for the analysts.So, adding HCatalog also enabled use of Pig by the analysts.We also introduced Amazon’s Data Pipeline service to automate the execution of the Hive queries that created our aggregates.Data pipeline is an orchestration service that manages the dependencies between resources and tasks, retries after failures and timeouts in tasks, and integrates with Amazon’s Simple Notification Service (SNS) to notify about results.Every day at the specified time, our pipeline creates a suitable Elastic MapReduce cluster, performs Hive queries to load the aggregates in the correct order, and then notifies us via email that it has either succeed or failed.
Spot Instances allow you to bid on Amazon’s spare virtual machine capacity, and your machines only run when your bid is high enough compared to other user’s bids.Depending on the machine type and timing, it is possible to run machines at a tiny fraction of their normal price.The downside is that your machines may disappear at any time, so you need to account for this.We use spot instances for analysts’ clusters, but not yet for automated processing. Initially, all of the aggregates produced with via our manual Hive processing were stored uncompressed in S3. So we introduced compression of the output from Hive both to improve the speed of processing and save storage cost.We chose Snappy because it is very fast and it’s easier to use than e.g. LZO because it is open-source licensedWe also spent some time converting some of our Hive data processing jobs to streaming jobs, generally in PythonStreaming jobs are more efficient for processing that depends on the ordering of data, e.g. what was the balance on a users 3rd spinTo do this in Hive requires multiple self-joins, and therefore multiple passes of the data, as windowing functions aren’t yet supportedBut, before we continued down the path of more optimization, we decided to do something more impactful
Even with spot instances, the analysts doing all their work via Hive was proving very expensiveThey also had to wait a lot for queries to finishAdditionally, working with Hive is not very convenient, as it supports only a subset of SQL and doesn’t play very nicely with SQL clients.The solution for this: introduce a relational database as a data mart to store some of our aggregates.
You still want access to all the data, Bbut most of the data that gets used all the time is not that big. We use our data mart to access 15 GB of our most heavily used aggregatesWe use hive on the rest of our 20 TB of data.RDBMS“Smaller” dataStructured dataMore user-friendlyHiveBig dataUnstructured dataLess user-friendly
We chose Infobright initially as the relational database for our data mart, because it is designed for OLAP (e.g. columnar store) and (2) it runs as an engine within MySQL, which we are familiar with, since we use it in our transactional systems.(3) Community edition is FREEWe ran infobright on an EC2 instance backed by EBS, since it is not supported on RDS. – this means we had to manage it.It was very easy (and FREE) to get it up and import the data we wanted, so we used it for a few months.But, ICE is very crippled: cannot create tables, and queries run only on a single core.We have just last week finished replacing ICE with Amazon’s Redshift.Advantages of Redshift =we don’t have to manage itNot crippled – queries run fasterWe don’t have to manage it
So what kind of results do analytics bring us?This is a graph of our monthly revenue on Facebook since launch. As you can see, the trend is powerfully upward, which would not have been possible without the insights and tuning that our analytics have brought us.