Getting It Right Exactly Once: Principles for Streaming Architectures
1. Getting It Right Exactly Once:
Principles for Streaming Architectures
Darryl Smith, Chief Data Platform Architect and Distinguished Engineer, Dell Technologies
September 2016 | Strata+Hadoop World, NY
2. 2
Getting Started
I’m Darryl Smith
• Chief Data Platform Architect
and Distinguished Engineer
Dell Technologies
Agenda
• Real-Time And The Need For Streaming
• Adding Real-Time And Streaming To The Data Lake
• Results, Plans, Lessons Learned
• Demonstration
3. 3
Trickle, Flood, or Torrent…
Streaming is about
continuous data motion,
more than speed
or volume
5. The Enterprise Reality
5
Batch > Real-Time > Streaming
Enterprise Opportunities
Immediate Business Advantage
Website and Mobile
Application Logs
Internet of Things
Sensors
6. 6
The Enterprise Streaming Play
Moving from batch to real-time streams
avoids surges, normalizes compute,
and drives value
8. 8
Drive DellEMC towards a
Predictive Enterprise via
intelligent data driving agility,
increasing revenue and
productivity resulting in a
competitive advantage
Analytics Vision
9. 9
Need to use new data for
competitive advantage
• Volume, Variety and Velocity
Leverage near real time and
streaming data sets to
optimize predictions
• Make faster, better decisions
Cost-effectively scale to
improve query and load
performance
Put the data in the hands of
the business
Becoming An Analytical Enterprise
DRIVE
COMPETITIVE
ADVANTAGE
COST-
EFFECTIVELY
SCALE
DATA ACCESS
BY BUSINESS
NEAR
REAL-TIME
ANALYTICS
10. 10
Problem Statement
Teams do not have access
to maintenance renewal
quotes in the timeframes
or the degree of quality
which they need for Tech
Refresh and Renewal
sales.
Desired Outcome
Implement a cost-effective,
real-time solution that
improves productivity
and gives confidence to
produce desired outcomes
efficiently.
Scoping The Business Objectives
11. 11
Business Drivers
CURRENT REALITY
VISION FOR THE
FUTURE
TO REALIZE
THIS VISION:
IMPLEMENT
CALM
SOLUTION
PHASES AND
OPTIMZE
BUSINESS
PROCESSES
HIGH TOUCH
TACTICAL EXECUTION
LOW TOUCH SELF
SERVICE
DATE DRIVEN
PROCESSES
BUSINESS VALUE
DRIVEN PROCESSES
INEFFICENCIES &
LOST PRODUCTITY
INCREASED
PRODUCTIVITY
SILOED DATA /
LIMITED VIEWS
SINGLE VIEW OF
DATA/DATA SCORING
VARIABLE DATA
QUALITY
DATA QUALITY &
CONFIDENCE
12. 12
The Need for “CALM”
Customer Asset Lifecycle Management
For
enterprise sales
Who need
accurate and timely customer information
CALM is a
real-time application
Providing
up to the moment customer 360 dashboards
For enterprise sales
Who need accurate and timely customer information
CALM is a real-time application
Providing up to the moment customer 360
o
dashboards
Install Base
Pricing
Device Config
Contacts
Contracts
Analytics Contracts
Component
Data
Offers
Scorecard
13. 13
Data Lake Architecture
D A T A P L A T F O R M
V M W A R E V C L O U D S U I T E
E X E C U T I O N
P R O C E S S GREENPLUM DBSPRING XD PIVOTAL HD
Gemfire
H A D O O P
INGESTION
DATAGOVERNANCE
Cassandra PostgreSQL MemSQL
HDFS ON ISILON
HADOOP ON SCALEIO
VCE VBLOCK/VxRACK | XTREMIO | DATA DOMAIN
A N A L Y T I C S
T O O L B O X
Network WebSensor SupplierSocial Media Market
S T R U C T U R E DU N S T R U C T U R E D
CRM PLMERP
APPLICATIONS
ApacheRangerAttivioCollibra
Real-TimeMicro-BatchBatch
14. 14
Data Ingestion
• Small to Big Data (high-throughput)
• Structured and unstructured Data from any Source
• Streams and Batches
• Secure, multi-tenant, configurable Framework
Real-Time Analytics
• Tap into streams for in-memory Analytics
• Real Time Data insights and decisions
Services
• Data Ingestion to Data Lake
• Data Lake APIs
• Data Alerting
Business Data Lake Offerings
Unstructured
Structured
16. 16
Seeking A Fast Database
A compliment to the business data lake
O P C M
17. HammerDB Platform Benchmarks
HammerDB workloads testing was done following EMC’s Oracle and SQL Server
DBA Teams standard practices.
Definition of workload. Mix of 5 transactions as follows:
• New order: receive a new order from a customer: 45%
• Payment: update the customer balance to record a payment: 43%
• Delivery: deliver orders asynchronously: 4%
• Order status: retrieve the status of customer’s most recent order: 4%
• Stock level: return the status of the warehouse’s inventory: 4%
Testing scenario:
• 100 warehouses 8 vUsers. Database creation and initial data loading.
• Timed testing. 20 minutes per each testing session.
• Scaled number of virtual users for each testing session from 1 until 44.
No changes done to the systems and databases configuration while running the
test.
18. HammerDB Workload Testing
Each test was 16 vCPU x 32 GB RAM
• RedHat 6.4
• Oracle 11g R2
• Windows Core 2012 R2
• SQL Server 2012 Ent Ed.
• RedHat 6.4
• PostgreSQL 9.3.3
20. Query PostgreSQL MemSQL
Opportunity(5K) 5 seconds 200ms
Sales Order(170K) 1-1.5 Minutes 6 seconds
Territory(60K) 60 seconds 5 seconds
PostgreSQL vs In-Memory DB
We picked 5 top queries run by different business functions.
Presented here are 3 queries that had response times that did not meet the SLA.
21. 21
Business Data Lake – Ingestion to Fulfillment
Raw Data
Summary
Data
DATAGOVERNOR
Consumers
Predictive/
Prescriptive
Analytics
Processed
Data
Analytical Data
GREENPLUM DATABASE
HADOOP
RAW
Data
INGEST
MANAGER
SPRING XD
SPARK
SQOOP
Execution Tier
CASSANDRAGEMFIRE
MEMSQL POSTGRESQL
Real-Time
Tap
22. 22
Here Are The Data Flows We Built
Low Velocity
Batch
Real-Time
23. 23
Data Flow Patterns – Low Velocity
Analytical [BATCH]
Ingestion
Data
Service
JDBC
Application
Presentation [SPEED/SERVING]
GREENPLUM
DATABASE
PIVOTAL HD
POSTGRESQL
MEMSQL
Raw
Data
One-Time
CASSANDRA
GEMFIRE
25. 25
Data Flow Patterns – Real Time
Real-time
Initial Load
Analytical [BATCH]
Ingestion
Data
Service
JDBC
Application
GREENPLUM
DATABASE
PIVOTAL HD
Presentation [SPEED/SERVING]
POSTGRESQL
MEMSQL CASSANDRA
GEMFIRE
26. 26
Nothing Closer To Real Time Than Streaming
Let’s look at the leading edge
Apache Kafka
Messaging Semantics
• At most once
• At least once
• Exactly once
30. 30
Understanding Streaming Semantics
At most once At least once Exactly once
Message pulled once Message pulled one or
more times;
processed each time
Message pulled one or
more times;
processed once
May or may not be
received
Receipt guaranteed Receipt guaranteed
No duplicates Likely duplicates No duplicates
Possible missing data No missing data No missing data
000
? 000000
?
01
01
01
31. 31
Rendering In Real Time
Picking the right business intelligence layer
• Tableau
• Custom Application (CF, D3, Docker)
• Additional Third Party Solutions
33. 33
Business Benefits
DATA QUERYING
Down from 4 hours per quarter
to less than 1 minute per year
SIMPLIFIED
PROVISIONING
Reduced number of tables/report
required
DATA
GOVERNANCE
Provides one version of
the truth
TIME TO MARKET
Reduced number of tables/report
required
TOOL
AGNOSTIC
Business logic in the DB not
the tool provides increased
flexibility
34. 34
Use Case: Customer Account Profile
STREAMLINED analytics ENVIRONMENT TO GAIN A HOLISTIC CUSTOMER VIEW
Service Request
Contracts
Installed Base
Bookings
Billings
EMC DATA
LAKE
BDL
SERVICES
DATA
WORKSPACES
DATA INGESTION
Prof Services
23 BUSINESS MANAGED WORKSPACES
35. 35
Customer Asset Lifecycle Management
Platform Roadmap
Phase 1 : Foundational
Capabilities/Discovery
Phase 2 : Scale Platform /
Automate
Future Phases : Global Standard tool
Integrations , advanced Analytics
BAaaS/Tableau
Scalable
Platform
Integrated
Platform
GBS
Renewals
Inside
Sales
Additional
Business groups
Oct 2015 2016 TBDAug 2015
BDL Platform
Enablement CollaborationAcceleration
In-Memory Capabilities
(POC)
We are here
36. 36
Data Services Roadmap
Security
Planned integration into
custom BDL security API for
managing Role Based Access
Control (RBAC) to the
underlying data
Business Data Lake Plans
37. 37
Lessons Learned – Key Takeaways
EDUCATE ASSESS INFRASTRUCTURE JOURNEY
Educate the
business
Use examples of
business impact
Assess in-house
big data skills
Ensure plan to
support the
organization for 3-
5 years
Choose the best
possible infrastructure
Make sure your Big
Data technology
platform can evolve
Remember it is a
journey
Look for small wins
as well as big wins.
38. 38
Lessons Learned: Analytics and Data
Sourcing the right skills, working with a different philosophy,
and some new tools will help you meet your analytical goals
TRANSFORM YOUR
PEOPLE
CHANGE YOUR
PROCESSES
ADAPT YOUR
TECHNOLOGY
Data science in the
organization, IT or both?
Helping business units
take initiative
New philosophy to
running analytics projects
How and when to share
data
Steadily refine toolsets
based on needed analysis
Identify to infrastructure
layers
40. 40
Demo Agenda
Showcase exactly-once semantics from Kafka
1: Data set of 200,000 transactions summing to zero
2: CREATE TABE AND CREATE PIPELINE
3: Push to Kafka and confirm exactly-once
4: Validate Resiliency and confirm exactly-once
41. Step 1: Data Source
start with a data set of 200,000 transactions representing
money/goods that sum to zero
45. Step 3: Push to Kafka
Push that data set to Kafka
Validate exactly-once delivery by querying MemSQL
• show tables;
• show pipelines;
• select sum(amount) from transactions;
Should be 0 in the demo
• select count(*) from transactions;
Should be 200,000 in the demo
47. Step 4: Resiliency
induce a failures to show resiliency during exactly-once
workflows
a. randomly_fail_batches.py
b. restart Kafka and show error count
c. continue and validate exactly-once semantics