Data drift, the gradual morphing of data structure and semantics, is a fact of life in enterprise IT. New requirements force schema changes, the meaning of database columns changes over time, and infrastructure upgrades add new fields to log files. Left unchecked, drift in data sources can cause applications and dataflows to fail, with costly downtime and, in the worst case, corruption in downstream data stores.
Cox Automotive comprises more than 25 companies dealing with different aspects of the car ownership lifecycle, with data as the common language they all share. The challenge for Cox was to create an efficient engine for the timely and trustworthy ingest of data capability for an unknown but large number of data assets from practically any source. Discover how their big data engineering team overcame data drift and are now populating a data lake, allowing analysts easy access to data from their subsidiary companies and producing new data assets unique to the industry.
2. Speakers
Nathan Swetye
Sr. Manager of Platform Engineering
Cox Automotive
Michael Gay
Lead Technical Architect
Cox Automotive
Pat Patterson
Community Champion
StreamSets
3. 3
25 (and growing) companies
dealing with the automotive
space
Spans the full vehicle ownership
lifecycle
Data perceived as the integration
point for all companies
Cox Automotive
4. Enterprise Data DNA
Commercial Customers Across Verticals
150,000 downloads
40 of the Fortune 100
Doubling each quarter
Strong Partner Ecosystem Open Source Success
Mission: empower enterprises to harness their data in motion.
StreamSets Overview
5. StreamSets
Data Collector™
StreamSets
Dataflow Performance
Manager (DPM™)
Instrumented, open source
UI and engine to build any-to-any
dataflows.
Cloud Service to map,
measure and master dataflow
operations.
DATAFLOW LIFECYCLE
Developers
Scientists
Architects
StreamSets Enterprise
EVOLVE (Proactive)
REMEDIATE (Reactive)
DEVELOP OPERATE
Operators
Stewards
Architects
6. EFFICIENCY
Intent Driven Flows
Batch & Streaming Ingest
In-stream Sanitization
CONTROL
Fine-grained Stage & Flow Metrics
Drift Handling
Lineage and Impact Analysis Capture
AGILITY
Flexible deployment
Exception Handling
Seamless Evolution
StreamSets Data Collector is a complete
IDE for building and executing any-to-any
ingest pipelines.
StreamSets Data Collector
7. StreamSets DPM provides a
single pane of glass to map,
measure and master your
dataflow operations.
MASTER
Availability & Accuracy
Proactive Remediation
MEASURE
Any Path
Any Time
MAP
Dataflow Lineage
Live Data Architecture
StreamSets
Dataflow Performance Manager (DPM)
8. Data Drift
Change is the New Normal
The unpredictable, unannounced and unending mutation of data
characteristics caused by the operation, maintenance and
modernization of the systems that produce the data
Structure
Drift
Semantic
Drift
Infrastructure
Drift
9. SQL on Hadoop (Hive) Y/Y Click Through Rate
80% of analyst time is spent preparing and validating data,
while the remaining 20% is actual data analysis
Example: Data Loss and Corrosion
10. Data Drift and Scale
At the micro level, data drift leads to
breakage and errors
At the macro level, data drift brings your
system to a grinding halt!
11. 11
The Problem of Data Exchange at Scale
Everyone wants each others’
data, but often difficult to acquire
A tangled mess of data flow
A source of anguish and sorrow
12. 12
The Problem of Data Exchange at Scale
Enter the Data Lake
The central store for valuable
data
Mission: Data Lake, not Data
Swamp
Data$Lake
13. 13
Great. A Data Lake. But how do you Populate it?
Problem: $$ Cost – a Question of Scale
• 25 Companies
• 9+ Source Types, mostly DBs
• 1-Many Schemas per Database
• Many Tables per Schema
Example:
• AutoTrader -> Oracle -> ATM1:
~1600 Tables
14. 14
Great. A Data Lake. But how do you Populate it?
Problem: $$ Cost – a Question of Scale
• 25 Companies
• 9+ Source Types, mostly DBs
• 1-Many Schemas per Database
• Many Tables per Schema
Example:
• AutoTrader -> Oracle -> ATM1:
~1600 Tables
We’ve
ingested
about that
much
18. 18
Cox Automotive’s StreamSets Architecture
Databases
Amazon S3
Files
FTP
Sources
StreamSets
Acquisition
StreamSets
StreamSets
StreamSets
Hadoop Filesystem
Big Data SQL
Amazon S3
Targets
StreamSets
Ingestion
StreamSets
StreamSets
StreamSets
Data Pipelines
Separates Acquisition from Ingestion
Dynamic Error Handling
Encrypted Data in Transit
Data standards applied automatically:
• Compression
• File Formats
• Partitioning Schemes
• Row-level Watermarks
• Time-stamping
Ingestion farm scales with demand
Auto-creates schemas en route
Data comes from a
variety of sources
Pipelines are established
for each source
Ingestion Back
Pressure
Scaling, Secure,
load-balanced
Actual ingestion
activities
On-premises and
Cloud Big Data
Systems
StreamSets
RPC
StreamSets
StreamSets
StreamSets
LoadBalancer
25. 25
Where do we go from Here?
• Amazon Web Services
• StreamSets Dataflow Performance Manager
• Acquire/Ingest decision point: Centralized, Federated, or Democratized?
• Quality
• Streamline access to sources
• Change data capture
• Integration with enterprise data catalogs
• Ingestion post-processing
StreamSets was founded in 2015 to address the pain of building and operating data movement.
We were founded by former leaders at Informatica and Cloudera and have key talent with experience at big data open source vendors as well as leading edge practitioners (Square, FCBK).
They recognized that big data fundamentally broke traditional data integration systems, and that the low-level frameworks that were being used instead, like Sqoop and Flume, were brittle and opaque.
We has seen tremendous open source adoption with over 150,000 downloads in our 15 months since launch. Our solution is general purpose, and we have commercial customers across many industries using us for a broad range of projects from data warehouse replatforming to specific applications in the area of IoT, cybersecurity and website personalization.
We look at dataflows as having a life-cycle. First you develop your logic and place it into operation. Over time you encounter problems that need to be remediated and you evolve your data flows to take advantage of new functionality - say Spark machine learning or support new business needs.
Our StreamSets Enterprise service spans the full life-cycle through two products. First there is StreamSets Data Collector, which is open source data ingestion software. You use it to develop, test and run individual pipelines.
Then for managing complex ingestion projects, we have StreamSets Dataflow Performance Manager. It acts as a single point of management across dozens or hundreds of data collectors.
StreamSets Data Collector – open source software for building and running data pipelines that accelerates time to analysis
Efficiency –a visual IDE to easily connect sources to destinations--light on schema specification, batch & streaming, on edge and in cluster
Agility – data exception handling, data flow evolution, data infrastructure modernization with minimal down time
Control – built-in data cleansing plus ability to adapt to data drift improves downstream data quality; monitor and alert on data KPI’s in real-time.
Think of DPM as your comprehensive control panel - an operational console for all of your data movement. Your unit of measure here is not a pipeline but a dataflow topology that includes all of the interconnected pipelines that feed back to an application or support a business process. Of course you can also drill down to the individual pipelines with a topology.
We talk about DPM in terms of 3 Ms, map, measure and master your dataflow operations.
Map - you have a live and self-updating map of your dataflows topologies, you can manage releases and track changes in topologies over time.
Measure - Measure and establish baselines for end-to-end and point-in-flow KPIs for data availability and accuracy.
Master - you can master your dataflow operations by creating Data SLAs to detect and remediate violations.
Storing unconstrained and drifting data in Big Data Stores leads to a whole slew of undetected data consistency and correctness problems. This is the ticking time bomb that most enterprises are facing. The Googles, Facebooks and LinkedIns of the world can put armies of people on this. Most enterprises cannot.
Here’s a simple example of a real world customer that Arvind saw at Cloudera...
Log data is put into Hadoop in order to be analyzed with SQL.
Data is coming from a number of different data centers, a few of which upgraded from IPv4 to IPv6
Manual data ingest process did not take into account the unforeseen IPv6 format for IP addresses. End result is that the business metric (service request rate) is overstated (false positive) causing harm to the business.