Analyze Big Data for Consumer Applications with Looker BI and Amazon Redshift Customizing the customer experience based on user behavior is a constant challenge for today’s consumer apps. Business intelligence helps analyze and model large amounts of data. Looker offers a modern approach to BI leveraging AWS that’s fast, agile, and easy to manage. Join this webinar to learn how MessageMe, which provides emotionally engaging messaging apps to consumers, leverages Looker business intelligence software and the Amazon Redshift data warehouse service to analyze billions of rows of customer data in seconds.
Webinar topics include:
• How MessageMe turns billions of rows of customer data stored in Amazon Redshift into actionable insights
• How Looker connects directly to Amazon Redshift in just a few clicks, enabling MessageMe to build a modern, big data analytics in the cloud. Who should attend
• Information or Solution Architects, Data Analysts, BI Directors, DBAs, Development Leads, Developers, or Technical IT Leaders.
Presenters:
• Justin Rosenthal, CTO, MessageMe
• Keenan Rice, VP, Marketing & Alliances, Looker
• Tina Adams, Senior Product Manager, AWS
3. Webinar Overview
Submit Your Questions using the Q&A tool.
A copy of today’s presentation will be made available on:
AWS SlideShare Channel@ http://www.slideshare.net/AmazonWebServices/
AWS Webinar Channel on YouTube@ http://www.youtube.com/channel/UCTnPlVzJI-ccQXlxjSvJmw
5. What We’ll Cover
Overview of Amazon Redshift data warehouse
How Looker integrates with Amazon Redshift to enable
big data analytics in the cloud
How MessageMe turns application metrics stored in
Amazon Redshift into actionable insights with Looker BI
Q&A
6. Amazon Redshift
Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year
Tina Adams| tinaadam@amazon.com
Senior Product Manager
7. We set out to build…
A fast and powerful, petabyte-scale data warehouse that is:
A Lot Faster
A Lot Cheaper
Amazon Redshift
A Lot Simpler
8. Data warehousing done the AWS way
Deploy
• Easy to provision
• Pay as you go, no up front costs
• Fast, cheap, easy to use
• SQL
9. Common Customer Use Cases
Traditional Enterprise DW
Companies with Big Data
SaaS Companies
•
Reduce costs by extending
DW rather than adding HW
•
Improve performance by
an order of magnitude
•
Add analytic functionality
to applications
•
Migrate completely from
existing DW systems
•
Make more data
available for analysis
•
Scale DW capacity as
demand grows
•
Respond faster to business;
provision in minutes
•
Access business data via
standard reporting tools
•
Reduce HW & SW costs
by an order of magnitude
12. Amazon Redshift architecture
•
Leader Node
–
–
Stores metadata
–
•
SQL endpoint
Coordinates query execution
Compute Nodes
–
Local, columnar storage
–
Execute queries in parallel
–
Load, backup, restore via Amazon S3
–
Parallel load from Amazon Amazon S3,
DynamoDB, EMR/HDFS/SSH
Kinesis integration
–
•
•
JDBC/ODBC
Hardware optimized for data
processing
10 GigE
(HPC)
Ingestion
Backup
Restore
Scale while remaining online from a
single node to a 100 node 1.6 PB cluster
13. Amazon Redshift is priced to let you analyze all your data
Effective Hourly
Price (single node)
Effective Hourly
Price Per TB
Effective Annual
Price per TB
On-Demand
$ 0.850
$ 0.425
$ 3,723
1 Year Reservation
$ 0.500
$ 0.250
$ 2,190
3 Year Reservation
$ 0.228
$ 0.114
$
Simple Pricing
Number of Nodes x Cost per Hour
No charge for Leader Node
No upfront costs
Pay as you go
999
14. Amazon Redshift has security built-in
•
SSL to secure data in transit
•
Encryption to secure data at rest
Customer VPC
– AES-256; hardware accelerated
– All blocks on disks and in Amazon
S3 encrypted
– HSM/CloudHSM
JDBC/ODBC
Internal
Security
Group
10 GigE
(HPC)
•
No direct access to compute
nodes
•
Amazon VPC support
•
SOC1/2/3, PCI level 1, and others
Ingestion
coming soon
Backup
Restore
18. The New Data Landscape
The Missed Innovation Cycle
The Next Generation
Innovative Customers
MessageMe Intro
18
19. Ridiculous Quantities of
Event & Business Data
Business Data
New MPP
ETL
Data Warehouse
Databases
Data Analysts
Business Users
New Breed of Data Experts
Data Modeling
New Curious Generation
Limited data discovery
Expect Immediate Results
New Data Landscape
19
20. Event & Business
Application Data
New MPP
databases
No direct
data access
No
reusability
Cubes / Simple
models
BI Software
One-time-use queries
Heavy desktop apps
Traditional
BI
Back to
handcoding SQL
Data Analysts
Business Users
New Breed of Data Experts
New Curious Generation
Expect Immediate Results
Missed Innovation Cycle
BI is a relic of the old (expensive) data landscape
20
21. Load
Query
Transform
Data Analysts
Flexible Delivery
Agile Modeling
BI Software
Web Based App
Business Users
High-Resolution Discovery
Sharing & Collaboration
Looker — The Next Generation
Modern analytics, built for the new data landscape
21
22. Load
Query
Transform
Near real-time access to your Redshift data
Data Analysts computing power of theBusiness Users
• Exploit the
BI Software
Flexible Delivery
High-Resolution Discovery
AWS cloud and Redshift App
Web Based
•
Agile Modeling
•
Sharing & Collaboration
No need to re-architect or cube data
Looker Inside
22
23. Copy
Query
Transform
•
Extend the power of your data analysts
Fold data as complex as necessary
Business Users
without any
BI Software database effortDiscovery
High-Resolution
Web Based App
Sharing & Collaboration
• Use Git for agile team development
•
Data Analysts
Flexible Delivery
Agile Modeling
Looker Intelligence
23
24. Copy
Transform
•
Powerful data discovery for anyone
•
Share, save, and collaborate
Data Analysts
BI Software
Access allFlexible data, in an interactive App
the Delivery
Web Based
Agile Modeling
web application
Query
•
Business Users
High-Resolution Discovery
Sharing & Collaboration
Looker Everywhere
24
28. Powering Analytics @ MessageMe
1. Redshift + Looker
2. Example Looker Report & Model
3. MessageMe Data Storage
4. Analytics Strategies
5. DynamoDB → Redshift
29. Redshift + Looker
Empower your team to answer their own questions.
• What types of Stickers are sent the most?
• How do event/holiday themed-packs perform?
• Which SMS provider is most cost-effective?
Internal dashboards and Looker link-sharing are commonplace.
Looker makes the data accessible and Redshift makes it fast.
32. Data Storage: Why Redshift?
At Launch:
• DynamoDB for all application data
• MySQL for all statistics data
RDS Config (March 2013)
RDS Config (April 2013)
Master: db.m1.xlarge (15GB)
Slave: db.m1.xlarge (15GB)
Master: db.m1.xlarge (15GB)
Slave: db.m2.4xlarge (68GB)
90% of writes were via LOAD_DATA_INFILE, so write IOPS were not a problem.
However, index sizes were growing quickly…
33. Data Storage: Why Redshift?
MySQL Status (April 2013)
event
message
Engine
InnoDB
Engine
InnoDB
Index Width
48 Bytes / Row
Index Width
32 Bytes / Row
Row Count
~3 Billion
Row Count
~2 Billion
Index Size
144 GB
Index Size
64 GB
Slave: db.m2.4xlarge (68GB)
34. Data Storage: Why Redshift?
We could put data in, but we couldn’t get it back out!
Possible Solutions
1. Summarize
• PRO: Data compression
• CON: Data loss
2. Shard
• PRO: No data loss
• CON: Difficult to query
3. Redshift?
35. Data Storage: Current System
Redshift (90%)
MySQL (10%)
• Append-only tables
• Delayed, bulk inserts OK
•
•
Examples:
• `event`
• `message`
• `user_demographic`
Examples:
• `purchase`
• `user_demograhic`
Inline inserts
Non-negotiable uniqueness
requirements (ON DUPLICATE
KEY UPDATE)
36. Analytics Strategies w/ Billions of Rows
Deep-dive queries w/ row-level specifics
vs.
Super fast top-line metrics, aggregates
You get this out-of-the-box with Redshift
1. Summarization
2. Cached Derived Tables
How do we get these, really fast?
38. Analytics Strategies: Cached Derived Tables
Some important queries will be complex and demand row-specific data.
Summarizing is not an option, what to do?
…build Cached Derived Tables
• Turn long-running, complex queries into flat tables
39. Analytics Strategies: Cached Derived Tables
Example: Retention by Cohort
SELECT
…
INTO TABLE `sm_retention_day`
FROM (
SELECT
….
FROM `user`
JOIN `message`
JOIN `user_source`
), (
SELECT
….
FROM `user`
JOIN `user_source`
)
sm_retention_day
`join_day`
`nday`
`country`
`os_family`
`os_version`
`traffic_source`
`active_users`
`signups`
40. DynamoDB → Redshift
• Stats tables are homogenous and compact
• Application data can be heterogeneous and heavy
– Mixture of numbers, strings, binary, etc.
How many users signed up this week with a .edu email address?
COPY dynamodb://user