Learn how HBX, the digital learning initiative of Harvard Business School, became data-centric to deliver an innovative online business learning experience that improves student outcomes, teaching process and staff effectiveness while promoting continuous innovation across teams.
In this webinar, you’ll find out how Informatica and Amazon Redshift helped HBX deliver a solution to:
Rapidly and automatically integrate and unify multiple siloed data sources into a trusted cloud data warehouse.
Accelerate reporting, dashboarding, and self-service analytics for data-informed decisions, ongoing agile experimentation, and business enhancement, and much more.
In addition, learn how AWS and Informatica can help you deliver your own agile analytics initiative and use the power of scalable cloud data warehousing environments to fuel all of your data-centric initiatives.
HBX: Harvard Business School's Digital Education Goes Data-Centric with Amazon Redshift + Informatica
1. HBX: Harvard Business School’s Digital
Education Goes Data-Centric with
Amazon Redshift + Informatica
2. Today’s Presenters
David Potes, Manager, Solutions Architecture, Amazon Web Services
Andrew McIntyre, Solutions Architect, Informatica
Ryan Frazier, Director, Systems Engineering & Operations, HBX
3. Today’s Agenda
• An overview of AWS and AWS Marketplace, with an emphasis on
AWS Big Data solutions
• Informatica solution overview
• The HBX success story with AWS and Informatica
• Q&A/Discussion
4. Learning Objectives
1. Becoming a data-centric organization for best business results
2. The benefits of agile analytics with cloud data warehouse
3. Rapidly integrating high volume, disparate data into a trusted source
6. AWS Big Data Portfolio
Collect Store Analyze
Amazon Kinesis
Firehose
AWS Direct
Connect
Amazon
Snowball
Amazon Kinesis
Analytics
Amazon Kinesis
Streams
Amazon S3 Amazon Glacier
Amazon
CloudSearch
Amazon RDS,
Amazon Aurora
Amazon
Dynamo DB
Amazon
Elasticsearch
Amazon EMR Amazon EC2
Amazon
Redshift
Amazon Machine
Learning
Amazon
QuickSight
AWS Data
Pipeline
AWS Database Migration Service AWS Glue
Amazon
Athena
7. Legacy architectural models lead to dark data
0
200
400
600
800
1000
1200
Enterprise Data Data in Warehouse
Very Expensive
Lock-In
Proprietary
Inflexible licensing
8. Traditional Data Warehousing
Business
Reporting
Complex pipelines
and queries
Secure and
Compliant
Easy Migration – Point & Click using AWS Database Migration Service
Secure & Compliant – End-to-End Encryption. SOC 1/2/3, PCI-DSS, HIPAA and FedRAMP compliant
Large Ecosystem – Variety of cloud and on-premises BI and ETL tools
Japanese Mobile
Phone Provider
Powering 100 marketplaces
in 50 countries
World’s Largest Children’s
Book Publisher
Bulk Loads
and Updates
9. Log Analysis
Log & Machine
IOT Data
Clickstream
Events Data
Time-Series
Data
Cheap – Analyze large volumes of data cost-effectively
Fast – Massively Parallel Processing (MPP) and columnar architecture for fast queries and parallel loads
Near real-time – Micro-batch loading and Amazon Kinesis Firehose for near-real time analytics
Interactive data analysis and
recommendation engine
Ride analytics for pricing
and product development
Ad prediction and
on-demand analytics
10. Business Applications
Multi-Tenant BI
Applications
Back-end
services
Analytics as a
Service
Fully Managed – Provisioning, backups, upgrades, security, compression all come built-in so you can
focus on your business applications
Ease of Chargeback – Pay as you go, add clusters as needed. A few big common clusters, several
data marts
Service Oriented Architecture – Integrated with other AWS services. Easy to plug into your pipeline
Infosys Information
Platform (IIP)
Analytics-as-a-
Service
Product and Consumer
Analytics
11. Redshift is used for mission-critical workloads
Financial and
management reporting
Payments to suppliers
and billing workflows
Web/Mobile clickstream
and event analysis
Recommendation and
predictive analytics
12. Amazon Redshift is available everywhere AWS is
Dublin
Frankfurt
London
Seoul
Sydney
Tokyo
Singapore
Beijing
Mumbai
Sao Paulo
US East - Virginia
US West - Oregon
US West – Northern California
GovCloud
Columbus Ohio
Montreal
Currently Available
Coming soon
13. Amazon Redshift is fast
“Did I mention that it’s ridiculously fast? We’re using
it to provide our analysts with an alternative to Hadoop”
“After investigating Redshift, Snowflake, and
BigQuery, we found that Redshift offers top-of-the-
line performance at best-in-market price points”
“…[Redshift] performance has blown away everyone
here. We generally see 50-100X speedup over Hive”
“We regularly process multibillion row datasets
and we do that in a matter of hours. We are heading
to up to 10 times more data volumes in the next couple
of years, easily”
“We saw a 2X performance improvement on a wide
variety of workloads. The more complex the queries,
the higher the performance improvement”
“On our previous big data warehouse system, it took
around 45 minutes to run a query against a year of
data, but that number went down to just 25 seconds
using Amazon Redshift”
14. And has gotten faster...
5X Query throughput improvement this year
Memory allocation (launched)
Improved commit and I/O logic (launched)
Queue hopping (launched)
Query monitoring rules (coming soon)
Power start (coming soon)
Short query bias (coming soon)
10X Vacuuming performance improvement
Ensures data is sorted for efficient and fast I/O
Reclaims space from deleted rows
Enhanced vacuum performance leads to better system throughput
Fast
Efficient
15. Amazon Redshift is easy to use
“With Amazon Redshift and Tableau, anyone in the
company can set up any queries they like - from how
users are reacting to a feature, to growth by demographic or
geography, to the impact sales efforts had in different areas”
“The doors were blown wide open to create custom
dashboards for anyone to instantly go in and see and
assess what is going in our ad delivery landscape,
something we have never been able to do until now.”
Provides an easy-to-use mechanism for querying data with
quick and uniform response times that analysts can use to
run research projects and perform in-depth analysis…We don’t
have to pre-allocate resources and can easily scale up to meet
demand and then scale down for efficiency”
16. Amazon Redshift is easy to use
Provisioning in
minutes
Automatic patching SQL - Data loading
Backups are built-in Security is built-in Compression is built-in
17. Amazon Redshift is cheap
“450,000 online queries 98 percent faster than previous
traditional data center, while reducing infrastructure costs by
80 percent.”
“Annual costs of Redshift are equivalent to just the annual
maintenance of some of the cheaper on-premises options
for data warehouses..”
“Most competing data warehousing solutions would have cost
us up to $1 million a year. By contrast, Amazon Redshift costs
us just $100,000 all-in, representing a total cost savings of
around 90%”
20. Amazon Redshift
shift
Fast, simple, petabyte-scale data warehousing for $1,000/TB/Year
Available now
Queue hopping
10X VACUUM performance improvement
Node fault tolerance
Enhanced VPC routing
IAM support for LOAD/UNLOAD
Auto compression for CTAS
TimestampTZ datatype
Query Monitoring rules
Coming soon
Automatic and incremental background
VACUUM
Short query bias
Power start
IAM Authentication for DB users
Auto compression for new tables
Enhanced JSON & AVRO ingestion performance
24. 25 May 2017
BUILDING A CLOUD BASED
DATA WAREHOUSE
hbx.hbs.ed
RYAN FRAZIER
Director, Systems Engineering & Operations
Harvard Business School/HBX
@rrfrazier
25. Agenda
25
• About HBX
• HBX Data Management Initiative
• Architecture & Implementation
• Challenges
• Reflections
27. Newest division at HBS, tasked with reimagining business education
for the digital age
First course June 2014 to deliver HBS experience online
Multiple Course Offerings
CORe (Business Analytics, Financial Accounting, and
Economics for Managers)
Disruptive Strategy with Clayton Christensen
Leading with Finance
Negotiation Mastery
Managing Your Career Development
The teaching model sets HBX apart
from many online learning options
and is reflective of the HBS in-
person classroom approach
HBX Overview
27
28. Mainly
asynchronous online
business education
Engagement
through student
interaction in
cohorts of ~400
Case-based
learning with highly
interactive teaching
elements and peer
help
HBX Course
Platform
(AWS)
28
29. 29
Studio-based
virtual classroom
Synchronous
audio/video with
chat, polls, boards
Up to 60 global
students on studio
wall, hundreds or
more observers
HBX Live
Platform
31. Why Create a Data-Driven Culture?
31
Improve Effectiveness
• Scale data intensive
activities like marketing,
admissions, & grading
• Use data to test ideas
and improve quality of
decisions
Enhance Outcomes
• Identify challenging
content
• Evaluate and improve
interactive content, social
engagement & retention
• Proactively support
struggling students
Refine Pedagogy
• Evaluate new
pedagogical
approaches
• Optimize evaluation
approaches
• Support pedagogical
research activities and
innovation
Foster Innovation & Continuous Improvement
• Identify and evaluate innovation opportunities
• Drive continuous improvement
Students Staff Faculty
32. Data Management Program Objectives
32
Integrate Data Sources into
Comprehensive Data Warehouse
Build Reports and Dashboards
Enable Self Service Ensure Data Quality and Integrity
33. Enablers for Building Data Driven Culture at HBX
• Use off-shore partner
Mindtree to accelerate
• Active engagement of
vendors on technology
challenges
• Short internal presentations
• Data Analysis Exercise at
all-staff team meeting
• Active interest & involvement
from Business Areas
• Alignment to organizational
priorities
• HBX willingness to try new
things
• Helps drive engagement
with vendors
Education
Strong Partners Program Governance
Experimentation
35. Core Tool Selection
35
Data Warehouse: Amazon Redshift + Snowflake
• Chose Redshift for scalability, performance, ease of management
• Aligns with AWS platform/ecosystem focus at HBX
• Easy integration with other AWS services
• Simplified vendor, contract, and cost management
• Leverage existing operations tools for monitoring, alerting, etc.
• Snowflake for JSON data lake and intermediate processing
ETL: Informatica Cloud
• Myriad of pluggable connectors
• S3, Redshift, Salesforce, ServiceNow, ODBC, REST
• Cloud-based architecture w/minimal infrastructure management
• Productivity-focused development tools for rapid implementation
• Extensible offering for future initiatives (MDM, Customer 360)
• Aligned to larger University vendor relationship & investments
Reporting/Analytics: Tableau
36. HBX Data Management by the Numbers
36
Source Systems
• 4 Major Systems
• 46 databases
• 598 Tables
• 98 Mongo Collections
Data Warehouse
• 3 Redshift Nodes
• 1 Snowflake warehouse
• 18 Schemas
• 434 tables
• 7479 fields
• 1,184,804,787 rows
• 278 GB
Daily ETL Process
• 770 jobs
• 35 MM+ rows
* Updated 5/2017
37. Course
Platform
Ver. B
HBX Core Data Ecosystem (at Launch)
37
Course
Platform
Ver. A
Historical Data
MongoDB
MySQL
Admin System
MySQL
Amazon
Redshift
About HBX
HBX Data
Management Initiative
Architecture &
Implementation
Challenges
Reflections
Tableau
Server
Secure
Agent
Informatica
Cloud
Services
MongoDB
MySQL
MongoDB
MySQL
Reporting
Copy
Metadata
38. HBX Data Ecosystem—Spring 2017
Reporting
Copy
Course Platform
MongoDB MySQL
Historical Data
MongoDB MySQL
Admin
System
MySQL
Salesforce
sync
ServiceNo
w
Google
Analytics
HBX AWS Acct
Snowflake
Virtual DWUpdated 3/31/2017
Snowflake
Metadata
Services
Snowflake Data
Persistence
Interactive
Reports
External Data Sources
Ad Hoc Query/Reporting
Course Platform
MongoDB data (json)
HBX Live
Hubspot
Ad Hoc Query/Reporting
Local Data Center
Firehose
Tableau
Secure
Agent
Existing
Data Flow
Proposed
Future
Data Flow
Informatica
Cloud
Services
Amazon
Redshift
40. Achieving Technical Objectives
40
Integrate Data Sources into
Comprehensive Data Warehouse
Build Reports and Dashboards
Enable Self Service Ensure Data Quality and
Integrity
Daily load from 4 core systems
(Legacy Business System,
Salesforce, Course Platform,
HBX Live)
33 production reports, serving all
HBX business units
>30 monthly active Tableau users
(~40% of staff)
0.5 FTE technical QA staff and
robust development/test/release
process
41. Reflections on HBX Implementation
41
Cloud Services Significantly Increase Agility
• Rapid provisioning of services
• Easy, on-demand scaling as business grows
• Cost-effective to support multiple environments
• Out-of-the box flexibility and feature enhancements
Semi-structured data stores can present challenges for reporting
• Think about data structures during development
• Consider your data pipeline and how data will be used
• Rapidly evolving and maturing space
QA & Testing are Critical to Success
• Reports and results only as good as the input
• Plead with your users to report problems and concerns!
42. HBX Business Outcomes
42
Automation of Manual Business Processes
• Moving from complex, time-consuming, error-prone
spreadsheet processes to purpose-built data products
Deeper Insight Into Prospects and Participants
• Ability to bridge top (marketing, leads) and bottom
(applicants, registrants, and completers) of funnel: who are
our prospects, how do we improve our yield, and how does
this relate to outcomes?
Growth of a “Data Driven” Culture
• Increasing use of data to understand our users, identify
challenges and opportunities, and drive decisions
• More focus on hypothesis driven experiments and A/B
testing
43. What’s Next and Why?
43
Analytics
• Greater understanding of
users
• Research and pedagogy
Additional Data Sources
• Greater understanding of users &
systems
• Automate additional processes
Streaming & Machine Data
• Real-time data reporting
• Access additional data
Machine Learning
• Automate additional
processes
58. Informatica’s Comprehensive Solution
Intelligent
Data Platform
ACLOUD
REAL TIME/
STREAMING
BIG
DATA
TRADITIONAL
DATA
INTEGRATION
BIG DATA
MANAGEMENT
MASTER DATA
MANAGEMENT
DATA
QUALITY
DATA
SECURITY
CLOUD DATA
MANAGEMENT
Products
Solutions
MONITOR AND MANAGE
CONNECTIVITY
COMPUTE
Enterprise Cloud
Data Management
CUSTOMER
360
DATA
GOVERNANCE
REFERENCE
360
INTELLIGEN
T
DATA LAKE
SECURE@SOURC
E
PRODUCT
360
ENTERPRISE
INFORMATION
CATALOG
SUPPLIER
360
(ENTERPRISE UNIFIED METADATA
INTELLIGENCE)
The main goal of this slide is to show platform completeness
Key talking points:
1/ Any big data application has a data acquisition phase, a storage need, and an analytics need
2/ Quick Service overviews. Go fast; especially on ones we talk about later.
Collect
Direct Connect – private, low latency connections between your data centers and ours. Most customers use a pair for redundancy and availability
Import/Export – for moving large volumes of data, fedex is your highest bandwidth option. With Snowball, we’ll ship you a ruggedized case, load it up, send it back
Kinesis – real time streaming data; has streams for custom apps, firehose for easy Redshift/s3 integration, Analytics for real time SQL
AWS IoT platform – complete suite for IoT devices to make it easy to manage them and get telemetry data into AWS
Store
S3 is the foundation of any big data app on AWS. Scalable, low cost, default landing zone for data ($0.023/GB-Mo and drops from there with scale. That’s $23/TB for a month, less than $300 for a year)
Glacier is the sister service for cold storage like data you need for compliance. Age data into it using lifecycle. $0.004/GB-Mo, or $48 per TB per year!
DynamoDB – NoSQL store; zero admin; JSON + Key Value with single digit millisecond latency. Great for high concurrency reads and writes
Elasticsearch – managed elasticsearch clusters for operational intelligence and search
Analyze
EMR for fully managed dynamic clusters for running Hadoop/Spark/Presto/HBase
Athena for interactive queries on S3 Data using Standard SQL with no infrastructure to manage
Redshift – fully managed, petabyte scale DW for $1,000/TB/year
ML – fully managed machine learning
EC2 – run anything you want that runs on Linux or windows
QuickSight for fast, cost-effective BI
Lambda – serverless compute for event driven computing
And rounding all this out, we have DMS for migrating databases and replicating OLTP to Redshift and Data Pipeline for scheduling and orchestration
Fully Managed – We provision, backup, patch and monitor so you can focus on your data
Fast – Massively Parallel Processing and columnar architecture for fast queries and parallel loads
Nasdaq security – Ingests 5B rows/trading day, analyzes orders, trades and quotes to protect traders, report activity and develop their marketplaces
NTT Docomo - Redshift is NTT Docomo's primary analytics platform for data science and marketing and logistic analytics. Data is pre-processed on premises and loaded into a massive, multi-petabyte data warehouse on Amazon Redshift, which data scientists use as their primary analytics platform.
Pinterest uses Redshift for interactive data analysis. Redshift is used to store all web event data and uses for KPIs, recommendations and A/B experimentation.
Lyft uses Redshiftfor ride analytics across the world (rides / location data ) - Through analysis, company engineers estimated that up to 90% of rides during peak times had similar routes. This led to the introduction of Lyft Line – a service that allows customers to save up to 60% by carpooling with others who are going in the same direction.
Yelp has multiple deployments of RedShift with different data sets in use by product management, sales analytics, ads, SeatMe (Point of sale analytics) and many other teams.
Analyzes 0s of millions of ads/day, 250M mobile events/day, ad campaign performance and new feature usage
Accenture Insights Platform (AIP) is a scalable, on-demand, globally available analytics solution running on Amazon Redshift. AIP is Accenture's foundation for its big data offering to deliver analytics applications for healthcare and financial services.
Mission critical customers stories – Grabtaxi and reiterate FINRA, Nasdaq
Summary – in 14 regions, adding 3 more
Optimum was formerly called Cablevision
http://hq.vevo.com/vevo-data-science/
Mention that customers can leverage AWS KMS in addition to their HSMs
Customers moving from using traditional databases like Oracle or MSSQL and MPP Data Warehouses to RedShift
Extend:
Easy to meet business demands; Easy to provision, manage, and maintain
Variety of data formats – AVRO, PARQUET, JSON that are not handled well by traditional data warehouses
Migrate
Cost is prohibitive – cloud data warehouse like RedShift is cheaper, easy to scale as you need
Analyze data without storage constraints
SQL on Hadoop
SQL – it is all SQL
RedShift Spectrum – keep the data on S3 and analyze in real time with RedShift
AWS for core infrastructure
Adminsitrative system (migrating to Salesforce over next year)
Multiple Prod Environments—each new release, ensure stability
Reporting Copy + Archive for Historical Data
ETL
Reviewed several options
Picked Informatica to align with HU, HBS data mgmt architecture
Use Cloud version
Redshift for EDW
Aligns to cloud, AWS
Tableau for Reporting
AWS for core infrastructure
Adminsitrative system (migrating to Salesforce over next year)
Multiple Prod Environments—each new release, ensure stability
Reporting Copy + Archive for Historical Data
ETL
Reviewed several options
Picked Informatica to align with HU, HBS data mgmt architecture
Use Cloud version
Redshift for EDW
Aligns to cloud, AWS
Tableau for Reporting
Here are some sample wins in the AWS ecosystem.
Asurion
Competition: Engine and data systems specific hand coding and 35 points solutions.
Why we won:
Most comprehensive end to end hybrid data management solution.
Extensive connectivity: shielded team from maintaining deep expertise for fast changing technologies and data systems.
Reuse visual mappings across multiple engines and systems, instead of hand-coding for each.
Intuitive and consistent UI for improved productivity.
Automation to enable thousands of data pipelines.
Expected benefits:
Single & unified view of authorized, cleansed & standardized business data.
Data available anytime, anywhere in any format.
Increase ROI and save $150K/year in development cost by using Infa instead of manual coding such as SSIS.
Predictive analytics to improve mobile customer engagement and loyalty.