DataXu sits at the heart of the all-digital world, providing a data platform that manages tens of millions of dollars of digital advertising investments from Global 500 brands. The DataXu data platform evaluates 1.5 million online ad opportunities every second for our customers, allowing them to manage and optimize their marketing investments across all digital channels. DataXu employs a wide range of AWS services: Cloud Front, Cloud Trail, CloudWatch, Data Pipeline, Direct Connect, Dynamo DB, EC2, EMR, Glacier, IAM, Kinesis, RDS, Redshift, Route53, S3, SNS, SQS, and VPC to run various workloads at scale for DataXu data platform.
In addition, DataXu also uses Qubole Data Service, QDS, to offer a Unified Analytics Interface tool to DataXu customers. Qubole, a member of APN provides self-managing Big data infrastructure in the Cloud which leverages spot pricing for cost-efficiencies, provides fast performance, and most importantly a streamlined user-interface for ease of use.
Attendees will learn how Qubole provided self-managing Hadoop clusters in the AWS Cloud accelerated DataXu’s batch-oriented analysis jobs; and how Qubole integration with Amazon Redshift enabled DataXu to preform low latency and interactive analysis. Further, in the session we'll take a look at how DataXu opened up QDS access to their customers using QDS user interface thereby providing them with a single tool for both batch-oriented and interactive analysis. By using the QDS user interface buyers of the DataXu data service could perform all manner of analysis against the data stored in their AWS S3 bucket.
Speakers:
Scott Ward
Solutions Architect at Amazon Web Services
Ashish Dubey
Solutions Architect at Qubole
Yekesa Kosuru
VP Engineering at DataXu
Getting to 1.5M Ads/sec: How DataXu manages Big Data
1. Getting to 1.5M Ads per Second
How DataXu Manages Big Data
AWS, DataXu, Qubole
March 30th, 2015
2. Today’s speakers
Yekesa Kosuru
VP of Engineering,
DataXu
Ashish Dubey
Solutions Architect,
Qubole
Scott Ward
Solutions Architect,
AWS
3. Agenda
• AWS: Big Data, Technologies & Techniques for working
productively with Data at any scale
• Qubole: Big Data Delivered as a Service
• DataXu: Leveraging Big Data to Understand & Engage Customers
4. Housekeeping
• The recording link will be distributed to all registrants via email after
the webinar next week
• Please submit your questions and comments using the Chat with
Presenters box located at the bottom left corner of your screen
5. Agenda
• AWS: Big Data, Technologies & Techniques for working
productively with Data at any scale
• Qubole: Big Data Delivered as a Service
• DataXu: Leveraging Big Data to Understand & Engage Customers
8. Creating Value from Data Assets
Recommendations,
Collective Intelligence
Machine Learning
Visualization
Dashboards
Business Intelligence
Measuring Functionality
and Services
Ad Hoc Queries
A/B Testing
Hypothesis Testing &
Predictions
Statistical
Analysis
Learning from Social
Media Conversations
Sentiment Analysis
SOCIAL
BIG DATA
Machine Learning Dashboards
Business Intelligence
Ad Hoc Queries
A/B Testing
Statistical
Analysis
Sentiment Analysis
10. Big Data AWS Cloud
Potentially Massive Data Sets Massive, virtually unlimited capacity
Iterative, experimental style of data manipulation
and analysis
Iterative, experimental style of infrastructure
deployment/usage
Frequently not a steady-state workload;
peaks and valleys
Efficient with highly variable workloads
Time to results is key
Parallel compute clusters from single data source
Hard to configure/manage
Managed services for data storage and analysis
Big Data + AWS
11. AWS Data Services
Data
Velocity
Variety
Volume
Structured, Unstructured, Text, Binary
Gigabytes, Terabytes, Petabytes
Millisecond, Second, Minute, Hour, Day
EC2EBS
Instance Storage
RedshiftRDS
SQL Stores
EMR
Hadoop
DynamoDB
NoSQL
Kinesis
Stream
Storage Services
S3 Cloud
Front
Glacier
Elasticache
Caching
Data
Pipeline
Orchestrate
12. Amazon Elastic Map Reduce
Hosted Hadoop Framework
• Easy to use and fully managed
• Secure
• Resizable clusters to support processing needs
• Support for EC2 spot instances
• Use many query tools to support analysis of
your data
– Hive, Pig, Hbase, Spark, BI Tools, etc
• EMR-FS for an S3 backed data store.
• Direct integration with other AWS data stores
– S3, Redshift, DynamoDB
13. Master instance group
Task instance groupCore instance group
HDFS HDFS
Amazon S3
Amazon
Redshift
Amazon
DynamoDB
Amazon EMR Architecture
14. EMR Security
• Security groups for master and
slave instances
• Instances launch in your VPC
• Encrypt data in S3
• Control who can access S3 data
• API requests required signed key
Master instance group
Task instance groupCore instance group
HDFS HDFS
Amazon S3
Amazon
Redshift
Amazon
DynamoDB
15. Amazon Redshift
Petabyte Scale Data Warehouse
• Fully managed data warehouse solution
• Able to achieve petabyte scale at $1000
per TB per year
• Integrates with existing data warehouse
tools
• Scales through columnar storage and
parallel query execution
• Data load directly from S3
• Integration with Amazon EMR
16. Amazon Redshift Architecture
• Leader Node
– SQL endpoint
– Stores metadata
– Coordinates query execution
• Compute Nodes
– Local, columnar storage
– Execute queries in parallel
– Load, backup, restore via Amazon S3
– Parallel load from Amazon DynamoDB,
Amazon EMR, Amazon S3, HDFS/SSH
• Two hardware platforms
– Optimized for data processing
– DW1: HDD; scale from 2TB to 1.6PB
– DW2: SSD; scale from 160GB to 256TB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
17. • SSL to secure data in transit
• Encryption to secure data at rest
– AES-256; hardware accelerated
– All blocks on disks and in
Amazon S3 encrypted
– HSM/CloudHSM
• No direct access to compute
nodes
• Amazon VPC support
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
Security
Group
JDBC/ODBC
Amazon Redshift Security
18.
19. Agenda
• AWS: Big Data, Technologies & Techniques for working
productively with Data at any scale
• Qubole: Big Data Delivered as a Service
• DataXu: Leveraging Big Data to Understand & Engage Customers
20. 2014 Usage Statistics for Qubole on AWS:
• Total QCUH processed in 2014 = 40.6 million
• Total nodes managed in 2014 = 2.5 million
• Total PB processed in 2014 = 519
Operations
Analyst
Marketing Ops
Analyst
Data
Architects
Business
Users
Product
Support
Customer
Support
Developer
Sales
Ops
Product
Managers
Developer
Tools
Service
Management
Data Workbench
Cloud Data Platform
BI & DW
Systems
• SDK
• API
• Analysis
• Security
• Job Scheduler
• Data Governance
• Analytics templates
• Monitoring
• Support
• Collaboration
• Workflow &
Map/Reduce
• Auto Scaling
• Cloud Optimization
• Data Connectors• YARN • Presto & Hive• Spark & Pig
Hadoop Ecosystem (Apache Open Source)
25. Agenda Slide
• AWS: Big Data, Technologies & Techniques for working
productively with Data at any scale
• Qubole: Big Data Delivered as a Service
• DataXu: Leveraging Big Data to Understand & Engage Customers
26. | 26
DataXu Introduction
Disruptive on-demand software platform relied upon by the world’s
leading brands
A petabyte scale marketing cloud that enables Fortune 500 brands to
manage data, insight and action to maximize Marketing ROI
The industry’s #1 rated programmatic marketing technology
spun out of MIT by the founders
One of the fastest growing companies in the Inc. 500
27. | 27
DataXu Quick Statistics
Big data + Real time decisions
Big Data
Processing
13 petabytes
of data
20 terabytes/day
consumer data intake
Real-Time
Decisioning
42 billion
decisions per second
1,500,000
Inbound Queries Per Second
Dozens of
algorithms across mobile,
social, native, display,
video and TV
Predictive
Modeling
Executing 10,000+
investments simultaneously
10M variables
considered per investment
decision using next gen
machine learning
Enterprise-
Cloud
Infrastructure
14
data centers
35,000+
CPU cores
Patent portfolio for real-time decision systems
Exclusive license from MIT to Algebra Of Systems IPR
28. | 28
Programmatic buying exploits real time signals to
drive greater ROI.
Analyze the attributes
available at bidding time
Assess the value of each
impression to determine a bid
price and the creative to serve
Learn from served
impressions to adjust future
bidding and creative delivery
OptimizeAppraiseAnalyze
Context Geo O.S.
Time Demo Etc.
29. | 29
• On-premise and Cloud
• Why Cloud/AWS
– Automation, API driven
– All Data in One Place
– Improved Testability
– Deep Security
– Breadth and Depth of Services
– Costs, Pay As You Go
– Auto Scaling (Scalability, Elasticity)
– Disaster Recovery and Business Continuity
DataXu in the Cloud
AWS
30. | 30
DataXu Data Flows in AWS
Producers Continuous
Processing
Storage
Analytics
CDN
Real Time
Bidding
Retargeting
Platform
Qubole
Kinesis S3 Redshift
Machine
LearningStreaming
Data Collection
Analysts
Data Scientists
Engineers
31. | 31
Why Qubole
Managed Service
• Auto Scaling
• Spot Pricing
• No Opex
• Redundant Clusters
• Data Security
Single Unified Interface
• Rich Unified Experience
• Data Discovery tool
• Query Templates
• Administration and Monitoring
Performance Optimizations
• Overall better performance than other
Hadoop clusters in the cloud
Automation
• Workflow, Scheduler
• SDK
Support
• 24 X 7 deep expertise support
33. | 33
• Use VPC, pick AZ’s appropriately to match reservations
• Use hybrid spot pricing strategy
• Use tags for better reporting
• Seek Qubole help for cluster tuning
Qubole Cluster Best Practices
34. | 34
Data Security & Privacy
• AWS offers comprehensive data security
• Security & Privacy
– VPC
– IAM Policies, Users, Roles
– S3 Buckets, Bucket Policies & HTTPS
– Security Groups, Whitelist IP CIDR
– Key Management Service & CloudHSM
– Server Side and Client Side Encryption
35. | 35
Right tool for right workload
Large scale ETL
Interactive
Discovery
Queries
Machine
Learning/Real time
queries
High Performance
DW
Queries/Reporting
backend
Use Case / Technology
1.5 million ad requests per sec
Billions of impressions per month, Petabytes of data
~10ms round trip average response time, 100ms max
Serving in 50+ countries around the world
Over 20 TB data collected per day
Integrated with over 30 exchanges around the world
No HDFS, there is no reliable way to auto scaling
Pretty innvoative, using spot
Qubole has put thoughts into cost effective
Spot pricning anf auto scaling
Talk about auto scaling – cost optimization
HDFS – does not make sense in Qubole, don’t rely on,