Since Amazon Redshift launched last year, it has been adopted by a wide variety of companies for data warehousing. In this session, learn how customers NASDAQ, HauteLook, and Roundarch Isobar are taking advantage of Amazon Redshift for three unique use cases: enterprise, big data, and SaaS. Learn about their implementations and how they made data analysis faster, cheaper, and easier with Amazon Redshift.
3. Amazon Redshift architecture
• Leader Node
–
–
–
JDBC/ODBC
SQL endpoint
Stores metadata
Coordinates query execution
• Compute Nodes
–
–
–
–
10 GigE
(HPC)
Local, columnar storage
Execute queries in parallel
Load, backup, restore via Amazon S3
Parallel load from Amazon DynamoDB
• Single node version available
Ingestion
Backup
Restore
4. Amazon Redshift is priced to let you analyze all your data
Price Per Hour for
HS1.XL Single Node
Effective Hourly
Price per TB
Effective Annual
Price per TB
On-Demand
$ 0.850
$ 0.425
$ 3,723
1 Year Reservation
$ 0.500
$ 0.250
$ 2,190
3 Year Reservation
$ 0.228
$ 0.114
$
999
Simple Pricing
Number of Nodes x Cost per Hour
No charge for Leader Node
No upfront costs
Pay as you go
6. Where innovation meets action
OUR TECHNOLOGY
WE OWN AND OPERATE
IS USED TO POWER MORE THAN
70 M ARKETPLACES
26 MARKETS
including
IN 50 COUNTRIES
3 CLEARINGHOUSES
1 MILLION
MESSAGES/SECOND
AT A MEDIAN SPEED OF
SUB-55 MICROSECONDS
POWER
1 IN 10
OF THE WORLD’S SECURITIES TRANSACTIONS
AND 5 CENTRAL
SECURITIES
OUR GLOBAL PLATFORM
CAN HANDLE MORE THAN
WE
D E P OS ITOR IE S
MORE THAN 5500
STRUCTURED PRODUCTS
ARE TIED TO OUR GLOBAL INDEXES
WITH THE NOTIONAL VALUE OF
AT LEAST $1 TRILLION
WE LIST ~3300
GLOBAL COMPANIES WORTH
$6 TRILLION
IN MARKET CAP REPRESENTING
DIVERSE INDUSTRIES AND
MANY OF THE WORLD’S
MOST WELL-KNOWN AND
INNOVATIVE BRANDS
6
7. What I do
New data and analytics platforms to store and
serve data to internal and external customers.
8. The Challenge
• Archiving Market Data
– classic “Big Data” problem
• Power Surveillance and Business
Intelligence/Analytics
• Minimize cost
– Not only infrastructure, but development/IT labor costs too
• Empower the business for self-service
9. SIP Total Monthly Message Volumes
OPRA, UQDF and CQS
Market
Data
Is Big
Data
Total Monthly Message Volume
Date
Aug-12
Sep-12
Oct-12
Nov-12
Dec-12
Jan-13
Feb-13
Mar-13
Apr-13
May-13
Jun-13
Jul-13
Aug-13
Charts courtesy of the
Financial Information
Forum
NASDAQ Exchange Daily Peak Messages
600,000,000
500,000,000
400,000,000
300,000,000
200,000,000
100,000,000
0
OPRA Annual Increase: 69%
CQS Annual Increase: 10%
UQDF Annual Decrease: 6%
Jan-13
Feb-13 Mar-13
Apr-13 May-13 Jun-13
Jul-13
Aug-13 Sep-13
Financial Information Forum, Redistribution without permission from FIF prohibited, email: fifinfo@fif.com
UQDF
2,317,804,321
1,948,330,199
1,016,336,632
2,148,867,295
2,017,355,401
2,099,233,536
1,969,123,978
2,010,832,630
2,447,109,450
2,400,946,680
2,601,863,331
2,142,134,920
2,188,338,764
CQS
8,241,554,280
7,452,279,225
7,452,279,225
9,552,313,807
8,052,399,165
7,474,101,082
7,531,093,813
7,896,498,260
9,805,224,566
9,430,865,048
11,062,086,463
8,266,215,553
9,079,813,726
Total Monthly
Message Volume
Date
OPRA
Aug-12
80,600,107,361
Sep-12
77,303,404,427
Oct-12
98,407,788,187
Nov-12
104,739,265,089
Dec-12
81,363,853,339
Jan-13
82,227,243,377
Feb-13
87,207,025,489
Mar-13
93,573,969,245
Apr-13
123,865,614,055
May-13
134,587,099,561
Jun-13
162,771,803,250
Jul-13
120,920,111,089
Aug-13
136,237,441,349
Combined
Average Daily
Volume
459,102,548
494,768,917
403,267,422
557,199,100
503,487,728
455,873,077
500,011,463
495,366,545
556,924,273
537,809,624
683,197,490
473,106,840
512,188,750
Average Daily
Volume
3,504,352,494
4,068,600,233
4,686,085,152
4,987,584,052
4,068,192,667
3,915,583,018
4,589,843,447
4,678,698,462
5,630,255,184
6,117,595,435
8,138,590,163
5,496,368,686
6,192,610,970
23
10. Our legacy solution
• On-premises MPP DB
– Relatively expensive, finite storage
– Required periodic additional expenses to add more storage
– Ongoing IT (administrative) human costs
• Legacy BI tool
– Requires developer involvement for new data sources, reports,
dashboards, etc.
11. New Solution: Amazon Redshift
• Cost Effective
– Redshift is 43% of the cost of legacy
• Assuming equal storage capacities
– Doesn’t include IT ongoing costs!
• Performance
– Easily outperforms our legacy BI/DB solution
– Insert 550K rows/second on a 2 node 8XL cluster
• Elastic
– Add additional capacity on demand, easy to grow our cluster
12. New Solution: Pentaho BI/ETL
• Amazon Redshift partner
– http://aws.amazon.com/redshift/par
tners/pentaho/
• Self Service
– Tools empower BI users to
integrate new data sources, create
their own analytics, dashboards,
and reports without requiring
development involvement
• Cost effective
13. Net Result
• New solution is cheaper, faster, and offers
capabilities that our business didn’t have before
– Empowers our business users to explore data like they never
could before
– Reduces IT and development as bottlenecks
– Margin improvement (expense reduction and supports business
decisions to grow revenue)
15. Who am I? Kevin Diamond
• CTO of HauteLook, a Nordstrom Company
• Oversee all technology, infrastructure, data,
engineering, etc.
• Major focus on great customer experience and
the analytics to provide it
16. What is HauteLook?
• Private sale, members-only limited-time sale events
• Premium fashion and lifestyle brands at exclusive prices of
50-75% off
• Over 20 new sale events begin each morning at 8am PST
• Over 14 million members
• Acquired by Nordstrom in 2011
17. Why a Data Warehouse?
• Centralized storage of multiple data sources
• Singular reporting consistency for all departments
• Data model that supports analytics not transactions
• Operational reports vs. analytical reports
– Real-time vs. previous day
18. Why Amazon Redshift?
• Looked at some competitors:
– Ranged from $ to $$$
– All required Software, Implementation and BIG Hardware
• Skipped the RFP
• Jumped into the Public Beta of Amazon Redshift
and never looked back
19. How We Implemented Amazon Redshift
• ETL from MySQL and Microsoft SQL Server into AWS across a
Direct Connect line storing on S3
• Also used S3 to dump flat files (iTunes Connect Data, Web Analytics
dumps, log files, etc)
• Used AWS Data Pipeline for executing Sqoop and Hadoop running
on EC2 to load data into Amazon Redshift
• Redshift Data Model based on Star Schema which looks something
like …
21. Usage with Business Intelligence
• Already selected a BI Tool
• Had difficulty deploying in the cloud
• But worked great on-premises
• Easily tied into Amazon Redshift using ODBC Drivers
• BUT, metadata for reports had to live in MSSQL
• Ported many SSIS/SSRS reports over
– But only the analytical reports!
23. Amazon Redshift Instances
• We use a little under 2TB
• Thought to use 2 - BIG 8XL instance to get great performance (in
passive failover mode)
• Cost us $$$
• Then we tested using 6 - XL instances in a cluster
• Performed better and allowed for more concurrency of queries in all
but a handful of cases that really needed the 8XL power
• Cost us $
• Duh! That’s why we do distributed everything else!!
24. Some First Hand Experience
• ETL was hardest part
• Amazon Redshift performs awesome
• Someone needs to make a great client SQL tool
• MicroStrategy works great on it (just wished it loved
running in EC2)
• Saving a ton, thanks to:
–
No hardware costs
–
No maintenance/overhead (rack + power)
–
Annual costs are equivalent to just the annual maintenance
of some of the cheaper DW on-premises options
25. Conclusion/Last Advice
•
Only use 8XL instances if you need >2TB of space
–
Otherwise distribute on a bunch of XL nodes
•
Buy reserved instances (we still need to do this!) since you likely will have this always on
•
Although we haven’t yet, the idea of a flexible scale-up/down DW is crazy awesome – maybe during
Holiday we will
•
Probably could have used Elastic MapReduce instead of Hadoop – wasn’t sure how it would play with Sqoop
•
Almost all BI tools play with Amazon Redshift now, so choose what is right for your business, and make sure it
works in EC2 before just putting it there
•
Communication between AWS and your DC is easy and fast, but I recommend a Direct Connect
•
Passed our rigorous information security standards, but used in a VPC
27. roundarch isobar
OUR SERVICES ACROSS BOUGHT, OWNED AND EARNED MEDIA
Strategies
Campaigns
Experiences
Platforms
Products
We digitally transform
business processes and
disrupt industries
We create, measure and
optimize digitally-focused
campaigns
We produce joyful
experiences that inspire
consumer interaction
We design and build
flexible and scalable
technology solutions
We invent digital
products that generate
new revenue streams
Audience insight
Research: competitive,
segmentation, persona
development, heuristics
Platforms: content
management, search,
portals, mobile, frontend technology,
internet-enabled
devices/wearables, social
apps, web services,
security, big data,
hosting
Digital products
Business planning:
competitive & industry
analysis, business cases,
maturity models,
roadmaps
Strategies: brand,
interactive, multichannel, social, content
27
Communications planning
Creative: advertising, visual
design, content creation,
studio production
Optimization: analytics,
monitoring, SEO, MVT,
media ROI analysis
Requirements and
specifications: content
analysis and specs,
functional requirements,
functional specifications
User experience design:
information architecture,
taxonomy and meta data,
interaction design, mobile
Digital product
extensions
Brand as a service
28. We have served the U.S. Air Force since 2001, building their enterprise portal and many
mission-critical applications
U.S. Air Force
Key metrics for our USAF work include:
• 900,000+ registered users
• Portal availability over 99.9% of time
• 700,000+ PK-E users
• 28 production enterprise services
• Response time worldwide: 3 seconds for 80% of all pages
• Over 300 applications available
• Over 1.2 million logins/week
• Public-facing and secure private instances (NIPR & SIPR)
• 124,000 unique daily users
28
• 4-5+ million pages daily (40-70 Mbit/sec)
• Portal support for over 5,000 “Communities of Interest”
29. Transforming in-stadium operations through a touch-screen command center
New York Jets
Our executive touch-screen environment provides real-time stadium
and game data, allowing the Jets owner, Woody Johnson, to monitor
the fan experience during game time and make operational
decisions that help maximize sales. The command center provides
summary-level and drill-down views of stadium operations such as
tickets, parking and concessions. It also creates predictive
algorithms that help identify pinch points and open revenue
opportunities.
29
“We brought the big picture close enough to
identify new, better ways to do business.”
30. Through a joint venture with Copia Capital, we created a new product offering for William
Blair
William Blair | Investment Research Management System
• Facilitates collaboration between
portfolio managers and analysts
Technology:
• Provides a holistic view of a
company/stock
• Uses Jquery,
JavascriptMVC, Less
– What is everything our
organization knows about
AAPL
• Digitizes PDF/Excel tools and
reports to enable rich, dynamic
interactions
• Simplifies content creation; e.g.,
comments, recommendation
reports, document upload
• Rich charting and visualization of
analytics
30
• JavaScript, HTML5, CSS3
• JSON Web Services
• Java, Spring, JPA, Mongo
DB
• User comment: “We love
how fast it is!”
31. What is the focus of your
CMO today?
Optimize marketing spend
across all channels (Bought, Earned
and Owned)
31
33. marketing effectiveness stages
DLP
Scorecard
Sonar
AMNET
Compass
Optimize
Scorecard
Real-Time and Non-Real-Time
Learn
Analyze
• Centralized cross channel
Big Data Platform
• Standardized cross channel
reporting tools
• Discovery tools to identify
channel optimization
opportunities
• Modeling solutions
• Channel experience
enhancements
• Improved media buying,
planning & reporting functions
• Real time integration into DSP
• A/B testing based micro
segment adjustments
34. So what have we accomplished?
Built Marketing Analytics Platform - Radar
with 200+ in-time analytics, reporting andfrequency, granularity
forenable feeds (1TB/week) with various optimization
to scalable multi-tenant in 3 platform on Amazon
as multiple clients with customized metrics
with first launch SaaS months
and classification
34
36. scorecard logical architecture
Media Team
Display
Paid
Search
Organic
Search
Digital
Video
Site
Metrics
Sales
Google
DFA
Google
Bing
Marin
Google
Bing
Custom
Google
Omniture
Client
Stakeholders
TBD
Scorecard App
Detailed Analytic
Reports
TV
Radio
Print
OOH
Earned
Social
DDS
DDS
DDS
Facebook
Twitter
Competit
ive
Custom
Paid
Social
Facebook
Twitter
Media Team
36
Planners
Client Team
37. data sources
DATA VOLUME
Voluminous Data
Digital
CRM
Research
- Surveys
- Demographics
- Campaigns
- Search
- Mobile
- Attribution
- Site
- Social
- Display
VARIETY and GRANULARITY
37
- Cookie Level
- UGC
- Geospatial
- Weather
- Sales
- Competitive
39. ETL
Extract
Files loaded on Amazon S3/Amazon Glacier
Transform
Utilize Pig on Amazon EMR to cleanse,
standardize and validate the data
Radio
Glacier
Display
Ads
S3
Redshift
Search
Load
Use COPY to load Pig output
Social
Feeds
Hadoop EMR
39
40. data warehouse
Performance
Handles humongous aggregation quickly
Tableau,
BI Tools
Analysts
Cheap, fast, easily scalable
ODBC and JDBC access
For BI / adhoc analysis
Redshift
40
41. aggregation
Mapping
Radio
Join performance data with metadata
Display
Ads
Multi-step aggregation
SQL
Product,
Campaign
In Amazon Redshift using SQL
Search
Views, Clicks,
CTR, CPC etc
Load aggregates
Social
in MySQL for sub second web response
Aggregates
Redshift
41
MySQL RDS
42. data workflow
Jenkins for client+channel ETL
Job control dashboard
Jenkins
Ruby for provisioning, job flow
Data intake/extract
Amazon DynamoDB for state management
Ruby
DynamoDB
On demand, job-initiated
Amazon EMR clusters
S3
42
Hadoop EMR
Redshift
MySQL RDS
43. SaaS dashboard
Designed for redundancy
Hardware and location
Client1.com
Client2.com
ElastiCache
Multi-Tenant
Managed services
DNS
Automated stack provisioning
For clients
MySQL RDS
43
EC2
Beanstalk
Load
Balancing
44. AWS advantages
Innovate
US
Quickly with reduced risk
AMAZON
Time
To market
Java
Ruby
Python
Lower
Operational overhead
Highly
Scalable
44
Developers
DevOps
AWS Ops
45. learnings
Metadata is more important than the data
Design for scalability upfront
Always explore better ways to aggregate
Cost management is very important
Build Agile: Perform early end-to-end validation on smaller dataset
Separate data visualization, data cleansing, storage & data aggregation
Be smart about implementing data aggregation routines across multiple granularities
45
46. Please give us your feedback on this
presentation
DAT205
As a thank you, we will select prize
winners daily for completed surveys!