As the usage of Apache Spark continues to ramp up within the industry, a major challenge has been scaling our development. Too often we find that developers are re-implementing a similar set of cross-cutting concerns, sprinkled with some variance of use-case specific business logic as a concrete Spark App.
3. Agenda
The Why
Background on the problem(s) that drove our need
for Composable Data processing (CDP).
The What
High level walk-through of our CDP design.
The How
Discuss challenges to achieve CDP with
Spark.
The Results
What has been the impact of CDP and where
are we headed.
4. Adobe Experience Platform (AEP) Zen Statement
4
INSIGHTS – ML & QUERY
ACTION
POS
CRM
Product
Usage
Mktg
Automate
IoT
Geo-
Location
Commerce
DATA
Centralize and standardize customer data and content across the
enterprise – powering 360° customer profiles, enabling data science,
and data governance to drive real-time personalized experiences
SEMANTICS & CONTROL
6. Data Landing (aka Siphon)
▪ 1M Batches per Day
▪ 13 Terabytes Per Day
▪ 32 Billion Events Per Day
Customers
Siphon
Data Lake
Solutions
3rd Parties
Producers
▪ Transformation
▪ Validation
▪ Partitioning
▪ Compaction
▪ Writing with
Exactly Once
▪ Lineage Tracking
7. Siphon’s Cross Cutting Features
Producers Data Lake
Queue
Siphon
Siphon
Siphon
Bulkhead1
Siphon
Bulkhead2
Supervisor
Catalog
Streaming
Ingest
Batch
Ingest
8. Siphon’s Data Processing (aka Ingest Pipeline)
Ingest Pipeline
Producers
Data Lake
Siphon
Parse Convert Validate Report
Write
Data
Write
Errors
10. Option A: Path of Least Resistance
Siphon +
Feature X +
Feature Y +
…..
Input Output
▪ Deprioritize Hardening
▪ Overhead due to
Context Switching
▪ Tendency towards
Spaghetti code
▪ Increasingly difficult to
test over time
▪ Increasingly difficult to
maintain over time
11. Option B: “Delegate” the Problem
Siphon
Input Output
Feature X
By Service X
Output X
Feature Y
By Service Y
Output Y
Feature …
…
Output …
▪ Lack of Reuse
▪ Lack of Consistency
▪ Complex to Test E2E
▪ Complex to Monitor
E2E
▪ Complex to maintain
over time
▪ Increased Latency
▪ COGS not tenable
12. Option C: Composable Data Processing
Siphon
Input Output
▪ Scalable Engineering
▪ Modularized Design &
Code
▪ Clear Separation of
Responsibilities
▪ Easier to Test
▪ Easier to Maintain
▪ Maximizes re-use
▪ Minimizes Complexity
▪ Minimize Latency
▪ Minimizes COGs
Feature X
By Team X
Feature Y
By Team Y
Feature …
By Team …
14. Goal
▪ Implement a framework that enables different teams to extend
Siphon’s data ingestion pipeline.
▪ Framework must be:
▪ Efficient
▪ Modular
▪ Pluggable
▪ Composable
▪ Supportable
17. 2. Conversion
Id First Name Last Name birthDate rewardsLe
vel
1 Jared Dunn 1988-11-13 bronze
4 Dinesh 1985-01-05 blah
Id _errors
3 [{"code":"355", "message":"Invalid Date",
"column":"bday"}]
Id Name bday level
1 Jared Dunn 1988-11-13 bronze
3 Monica Hall 1985-02 silver
4 Dinesh 1985-01-05 blah
Input
Pass Fail
18. 3.Validation
Id First Name Last Name birthDate rewardsLe
vel
1 Jared Dunn 1988-11-13 bronze
4 Dinesh 1985-01-05 blah
Input
Id First Name Last Name birthDate rewardsLe
vel
1 Jared Dunn 1988-11-13 bronze
Id _errors
4 [{"code":401","message":"Requied value","column":"lastNa
me"}, {"code":"411","message":"Invalid enum value: blah,
must be one of bronze|silver|gold.", "column":
"rewardsLevel"}]
Pass Fail
19. 4. Persisting the Good
Id First Name Last Name birthDate rewardsLe
vel
1 Jared Dunn 1988-11-13 bronze
Data Lake
20. 5. Quarantining the Bad
Id _errors
2 [{"code":”101","message":”Missing closing
bracket."}]
3 [{"code":"355", "message":"Invalid Date",
"column":"bday"}]
4 [{"code":401","message":"Requied value","colum
n":"lastName"}, {"code":"411","message":"Invalid
enum value: blah, must be one
of bronze|silver|gold.", "column": "rewardsLevel"}]
Id Name bday level
1 Jared Dunn 1988-11-13 bronze
3 Monica Hall 1985-02 silver
4 Dinesh 1985-01-05 blah
Quarantine
Join
Failed Parser Output
21. Weaving It All Together
Plugin Runtime
Parser Converter Validator Data Sink Error Sink
Errors +
Data
Errors +
Data
Errors +
Data Data Errors
Siphon
29. Parsing Errors
Ø Processed only once by SIP at the beginning.
Ø Only applicable for file sources like CSV and JSON
Ø Relies on Spark to capture the parsing errors.
Ø Pass on appropriate read options
Ø CSV
Ø Mode = PERMISSIVE
Ø columnNameOfCorruptRecord = “_corrupt_record”
Ø Parsing error records are captured
Ø By applying predicate on _corrupt_record column.
Ø Good records are passed to plugins for further processing
30. Parsing Errors
{ “name”: ”John”, “age”: 30 }
{ ”name”: ”Mike”, ”age”: 20
JSON: p.json
spark.read.json("p.json").show(false)
CSV: p.csv
name,age
John,30
Mike,20,20
spark.read.schema(csvSchema).options(csvOptions).csv("p.csv").show(false)
No record terminator
Record does not confirm to schema
31. Conversion/Validation errors
Ø SIP invokes the plugins in sequence
Ø Converter Plugin
Ø Both good and error records are collected.
Ø The good records are passed to the next plugin in sequence.
Ø Validate Plugin
Ø Returns the error records.
Ø Process an error record multiple times.
Ø To capture all possible errors for a given record.
Ø Example, both plugin-1 and plugin-2 may find different errors for
one or more columns of same record.
32. Error consolidation (contd ..)
Mapping rule Target_column
first_name || last_name full_name
MAPPINGS
full_name age Row_id
John Vanau 40 1
Michael Shankar -32 3
row_id _errors
2 [[last_name, ERR-100, “Field `last_name` can not be null]]
first_name last_name age Row_id
John Vanau 40 1
Jack NULL 24 2
Michael Shankar -32 3
INPUT_DATA
DATA_WITH_ROW_ID
monotonically_increasing_id()
applying mapping rule
success
error
successful mapping
column_name data_type constraint
full_name String None
age Short age > 0
TARGET _SCHEMA
applying target schema
row_id _errors
3 [[age, ERR-200, “Field `age` cannot be < 0”]]
row_id _errors
3 [[age, ERR-200, “Field `age` cannot be < 0”]]
row_id _errors
2 [[last_name, ERR-100, “Field `last_name` can not be null]]
row_id _errors
2 [[last_name, ERR-100, “Field `last_name` can not be null]]
3 [[age, ERR-200, “Field `age` cannot be < 0”]]
Union
first_name last_nam
e
age _errors
Jack NULL 24 [[last_name, ERR-100, “Field `last_name` can not be
null”]]
Michael Shankar -32 [[age, ERR-200, “Field `age` cannot be < 0”]]
full_name age
John Vanau 40
first_name last_name age Row_id
John Vanau 40 1
Jack NULL 24 2
Michael Shankar -32 3
DATA_WITH_ROW_ID
Join
Anti-Join
final successfinal error
full_name age Row_id
John Vanau 40 1
Michael Shankar -32 3
33. Error Trapping
▪ Most of existing conversion and validations use UDFs.
▪ Nested type conversions use nested UDFs.
▪ Currently not possible to capture errors from nested UDFs.
▪ Custom expression used to trap errors.
▪ Captures input column value, error code and error text in case of error.
▪ Captures output column value upon successful conversion/validation.
Custom Expression
39. Benefits
Cross-Cutting Data Processing
▪ Scalable Engineering
▪ Separation of
Responsibilities
▪ More Readable Code
▪ More Testable Code
▪ Easier to Maintain
▪ More ETL Features
▪ More Validation
Features
▪ More Error Reporting
Features
▪ Minimize Latency (from
10 min to 10 sec)
▪ Re-Use
▪ 50% or More Storage
Savings
▪ 50% or More Compute
Savings
Cross-Cutting
Data Processing