Presentation from Tableau Customer Conference 2013 on building a real time reporting/analytics platform. Topics discussed include definitions of big data and real time, technology choices and rationale, use cases for real time big data, architecture, and pitfalls to avoid.
1. Billions of Rows, Millions of Insights
Right Now
Developing a Landscape for Real Time Information
2. Who is Spil Games?
• 180 million monthly and 12 million daily players
• More than one billion gameplays monthly
• Active in every country of the world (even Vatican City!)
One of the largest casual gaming companies on the planet
Local EVERYWHERE
Titles
We are a platform first, but also a publisher and a developer
3. The paradigm is shifting
The Data Lake
• Highly consistent
• Highly connectable
• Inflexible
• Slow
• Flexible
• Fast
• Going to get wet
You always
need both
Traditionally, we define data
based on what we expect
With streaming data, we capture
first and define later
4. Defining BIG Data
The Four Vs
Velocity
Variety
Volume
Veracity
Small Data = BIG Data?
Real Time
ETL, Events, Excel
Drinking from the Firehose
Heuristics
VALUE: The Only V that Matters
5. Defining the VELOCITY and VARIETY
Traditional ETL“Real Time”
• Once a day
• Once a week
• Delayed
• Faster than human
perception
• <200 milliseconds
“In Time”
In Time: Information is available fast enough to influence decisions
•While in the shop/on the site (minutes)
•While the query runs (seconds)
•While the page loads (milliseconds)
The Velocity Continuum
6. Deriving the VALUE at Spil
Informing Decisions Making Decisions
• Day to day business reporting
• Analytical reporting for self-service
analysis
• Business analytics for advising decisions
• Descriptive models to explain our
business
• Customer Lifetime Value
• Marketing ROI
• Customer content recommendations
• Email campaign targeting
• Site learning and optimization
• System monitoring and alerting
7. Why Real Time Reporting Matters
Value of
Reporting
Real time reporting is a paradigm-shifting component
of our cloud based big data strategy!
I need to see everything happening RIGHT NOW
System Monitoring
Product Changes
In-Time Customer Support
8. Real Time Systems Requirements
Requirement Rationale Our Experience
Scalable with fast loads Must handle intraday
variable load
Load swings up to 300%
during the day
Fast join performance Synthesizing traditional ETL
data and real time events
on the fly
Denormalization is great
but volume expensive; 3NF
is BAD
Resilient Real time means as few
buffers as possible
Tableau extracts can slow
the process too much
Good query optimizer Minor inefficiencies
translate to expensive
performance hits
The best MySQL engine is
still too slow for BIG data
aggregation
Concurrent loading and
querying
No offline processing for
real time data
ETLs running at the same
time as queries up to 20%
of the time
Solution: C-Store Databases
9. C-Stores and Fast Dashboards
• C-Stores persist each column independently and
allow column compression
• Queries retrieve data only from needed columns
Example: 7 billion rows, 25 columns, 10 bytes/column =
1,6 TB table
Query: Select A, sum( D ) from table where C >= X;
Row Store: 1,6TB of Data
Column Store (30% compression): <195 GB data
The Result: Dashboards can run
direct on large tables
Dashboard on 7 billion row table with
two joins, <20 seconds to refresh
11. How much data do we handle?
Through Map/Reduce:
1.2 Billion Events/Day
(150 Million Rows/Day
into DWH)
Through ETL:
100-200 Million
Rows/Day into DWH
Map/Reduce: 20 Billion
Rows
Vertica:
45 Billion Rows
Long Term Storage:
All of 2013 Events
Predictive models: >500
million scores per day
ETLs to Production DBs:
>10 Models
Reporting: 150
Dashboards, 80 data
sources
Queries: >2000 per day
Ingestion Persistence Usage
12. Data Flow for Event Data
{ token:"BAEDIDtxmZoAWAEA",
sessionId:1358331540132,
visitorId:515876866411417,
pageInSession:3, environment:"stg",
eventList:[{ eventCategory: 'displayAds',
eventAction: 'fetch', eventLabel:
'Miniclip,leaderboard,160x60,SE,2.9',
eventValue: 1, //the depth in the daizy
chaining pageInSession: 2, timing:
1730 }] }:
JSON event data is generated by client
Visitor Session Page Timing Type Action Source Value
123 456 3 1730 DisplayAd Fetch Miniclip 2
Data is structured in Map/Reduce and put into flat files
Data is loaded into Vertica for Reporting + Analysis
Tableau queries directly from fact tables
13. Why we chose our tech
• Affordable
• Highly available and resilient
• Extremely fast development due to SQL
• Excellent query performance = lazy
optimization
• Right price
• Easy (and fun!) development
• Excellent library availability
• Industry standard for Map/Reduce
• Cheap storage of “data lake”
• Easy integration with existing tech
14. • Denormalize like crazy (or cheat with pre-join projections)
• Map/Reduce doesn’t like “real” time (try Storm)
• Network is the first limit you hit
• Let Tableau write the SQL, but optimize the projections
• Tableau’s caching is inflexible, scripting can solve (kind of)
What we’ve learned along the way