Scale Apps and Automate Data with AWS

Scaling your Application for Growth using
Automation
November 14,2013
Ken Leung- Euclid Analytics
Greg Narain- Chute

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Online Analytics for the Offline World
E-Commerce

Physical Stores

How Euclid Works
We use Wi-Fi technology to turn in-store behavior into actionable insights
XX:XX:XX:XX:XX:XX

Wi-Fi AP detects smartphone
MAC addresses

Shopper carrying smartphone
walks by or into store

Euclid analyzes data
for trends and insights

Insights on customer acquisition,
engagement and retention

Market Leader in Real World Analytics
•
•
•

•
•

First to develop proprietary Wi-Fi based analytics
–
–

Most advanced data analytics capabilities and experience in retail environments
Backed by tier 1 investors: Series A led by NEA, Series B led by Benchmark Capital

World-class executive team
–
–

Co-founder of Google Analytics, Founding team of ShopperTrak

Executive experience from Google, SAP, Ariba and Tibco

Experience with the world’s leading retailers
–

Specialty retail, QSR, department store, big box, automotive, malls and more

Largest data scale and rapidly accelerating adoption
–
–
–

Recording >5B events per day
Dataset with >100M unique devices (shoppers)
Gartner Cool Vendor 2012; Idea Innovation Award Winner: Business Technology 2012

Market leadership recognized by:

Euclid is a
Data Company
As of October, 2013, the
Euclid Network:
• Covers over 600
shopping centers, malls,
and street locations
• Processes 50 TB of raw
data
• Collects over 30 GB of
raw data daily

Acquire
Data

•Reliable
•Durable
•Scalable

Process
Data

•Efficient
•Flexible
•Scalable
•Versatile

Deliver
Data

•Richness
•Sophistication
•Value

Euclid’s Challenges
Common Challenges
• Scaling
• Performance
• Cost effectiveness
• Removing the technical
barriers for innovation
• “Failing fast”

Unique Challenges
• Recomputing the entire
history of Euclid data!
– Need fast results
– Need a lot of computational
power, sometimes greater
than 100x of regular daily
compute needs

Euclid’s Use of AWS
Euclid started with AWS from Day One
- Amazon EC2, Amazon RDS, Amazon EMR,
Amazon S3
- AWS Elastic Beanstalk
- Amazon Redshift
Heroku from Amazon Partner Network (APN)

Data Acquisition
Elastic Beanstalk
- Multi-AZ, multi-region
- Load balancing, auto scaling
- Monitoring, notification
- Deployment Management
- Amazon EBS-backed volume for failover data recovery
- Log rotation to Amazon S3 (99.999999999% durability)
All built-in.

Data Acquisition - code
<%@ page import="java.io.*,java.util.*,com.euclid.spongebob..server.*" %><%
Properties sensorCredentials = (Properties)this.getServletContext().getAttribute("sensor_credentials");
String sensor_id = request.getParameter("sensor_id");
String credential = request.getParameter("credential");
String body = request.getParameter("body");
if (sensor_id == null || !sensorCredentials.containsKey(sensor_id) ||
!sensorCredentials.getProperty(sensor_id).equals(credential)) {
response.sendError(HttpServletResponse.SC_UNAUTHORIZED);
return;
}
java.util.logging.Logger logger = java.util.logging.Logger.getLogger("spongebob");
logger.log(java.util.logging.Level.INFO, body);
response.setStatus(HttpServletResponse.SC_OK);
%>

Data Acquisition - Principles
• Log to Amazon EBS Volume – high I/O
performance
• As “dumb” as possible: reliable
• Fork data from disk to
– Amazon S3 for batch processing
– Kafka messaging service for real time processing

Data Acquisition – System Monitor
• Low latency
• Low CPU utilization

Data Processing - Pipeline
Raw Data

Map
Reduce
(EMR)

Product
dashboard, insights

R/D
Analytics

Pipeline – Dual Purposes
Two worlds, one platform
• Big Data Engineering – noSQL
– Pig Latin with Amazon EMR (Java, Python UDFs)
– Work flows (Jenkins), shell scripting

• Analytics, Analysts, Business – SQL
– Excel
– Tableau
– Maybe some Python, etc.

Pipeline - Architecture
Amazon S3

SQL DB: MySQL, Redshift

Raw Data
Meta
Data

Aggr.
Level 1

3rd Party
Data

Some Raw Data
Analytics

Aggr.
Level 1

Direct
DB Load

Meta
Data
3rd Party
Data

Models
Algorithms

Aggr.
Level n

MapReduce
MySQL
Product
dashboard, insights

SQL

Aggr.
Level n

R&D

Models
Algorithms

SQL: MySQL, Amazon Redshift, both by AWS
• Started with MySQL, Amazon Redshift Preview Jan
2013
• MySQL 1TB limit vs Amazon Redshift PB scale
• Performance, night and day
– E.g., count distinct of 100m rows: 5h in MySQL, 2m in Amazon Redshift

• Amazon Redshift: killer data warehouse
– Low cost
– No DBA!
– Easy integration

Pipeline - Monitoring
• System monitoring provided by AWS
• Workflow monitoring with Jenkins
– Failure notification
– Dependency management

• Data quality (including acquisition) monitoring
– Also utilize Jenkins
– Scripts that check data at various stages
– Each script as a job in the Jenkins workflow

Pipeline - Workflow
Part of the Jenkins Dependency Graph

AWS Benefits
• “Apps not Ops” – Euclid does not have/need an
Ops team
• Scale up and down on demand
• Pay as we go
• Agile (innovations, time-to-market)

Chute
1. Data
2. Automation
3. Uptime
4. Monitoring

Data
● Real time analytics is hard
● Hadoop!
○ Sqoop imports SQL data to HDFS
○ Clojure
○ Scalding (github.com/twitter/scalding)
● Elasticsearch, Logstash
○ parse logs to track activity for customers

Hadoop cluster
or
EMR

Sharded Postgres

SQOOP

Server
HDFS

S3

N number of
EC2 instances
● varnish
● logstash

plugin front ends
Kibana

ELB

Redis cluster

ElasticSearch

Events Server
● nginx
● logstash

API

Automation through DevOps
● Chute has 100 servers
○ Configured many manually
○ 82? of 100 now managed by Chef
● Whirr
● Sqoop and Cron to automate data import
● route53 with Chef for urls

Uptime
● Architect applications to scale horizontally
○ AWS launches servers on demand
○ spot and reserve pricing
● Keep services running with Chef
○ Chef makes it easy to wrap programs as
a service on AWS

Monitoring
● newrelic
○ server resource monitoring
○ application monitoring
● logstash + kibana
○ elasticsearch backend
○ redis (cluster)
○ can monitor server logs

Please give us your feedback on this
presentation

CPN209
As a thank you, we will select prize
winners daily for completed surveys!

Scale Apps and Automate Data with AWS

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (11)

Ähnlich wie Scale Apps and Automate Data with AWS

Ähnlich wie Scale Apps and Automate Data with AWS (20)

Mehr von Amazon Web Services

Mehr von Amazon Web Services (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Scale Apps and Automate Data with AWS