SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks for EDM, Our Unified Data Platform

Standardized Data Management
Data Ingestion Platform
Arun Manivannan
Senior Data Engineer

Our Data
History
Our Dataflow
Discuss some
interesting
problems
Questions
So, what do we do now?

Data in SCB context
Variety of applications - Web, Batch, Mainframes
1 Hundreds of applications (~180 data lake source apps)
2
Variety of consumption patterns
3
Multi-regulatory guided data storage
4
5
50+ countries

History
2012
Enterprise Data
Management on Teradata
Now
Unified ingestion platform
Until a year ago
Legacy Ingestion
framework
<2012
Traditional group-wise Data
Warehouses
2014
EDM on Hadoop

CLEANSE VALIDATERECEIVE CONSTRUCTRECORD
What is our view of Ingestion Framework?
PREPROCESSING PROCESSING
SECURITY AND LINEAGE TRACKING

VALIDATERECEIVE CONSTRUCTRECORD
Cleanse
CLEANSE

Validate
CLEANSE

Essential pre-processing
RECORDRECEIVE CLEANSE VALIDATE CONSTRUCT
1
2
3
4
5
Column and Row count validation
Embedded new line removal and special character replacement
Datatype validation
Data transformation
Value defaulting

Record
CLEANSE

Construct
CLEANSE

Backing technologies
RECORDRECEIVE VALIDATE CONSTRUCT
Change Data
Capture
CLEANSE

Generation 2 Awesomeness
Faster development cycle for new applications.
Easier to reason with the flow with NiFi’s visual
flow representation.
1
Consistent tooling for preprocessing, ops data
management, error reporting and archival
2
Significantly faster processing through Spark
3
ORC performs well for most of our consumption
patterns - supporting both predicate and projection
push down.
4
5
Security and managed concurrency
CLEANSE
Change Data
Capture

Extending NiFi via Custom Processors

Good Problems
Don’t bring me anything but trouble.
Good news weakens me.
- Charles Kettering

Types of Data
» Master (eg. Customer data)
» Transaction (eg. Banking transactions)
» Transaction data as Master (eg. editable
transactions)
TYPES OF DATA
CLEANSE

Types of Data
Data accumulated
until Yesterday
Today’s changes
Today’s EOD
snapshot data
Monday’s snapshot
data
Tuesday’s snapshot
data
Wednesday’s
snapshot data
Master Transactional
History Current
CLEANSE

Frequency
transactions)
» Output of Change Data Capture systems
(Delimited)
» Plain delimited
» Fixed width
» Avro-JSON messages
» Spreadsheets
DATA FORMATS
TYPES OF DATA
» Streaming
» Hourly incremental
» Daily
» Weekly
FREQUENCY
CLEANSE

Data accumulated
until Tuesday
Today’s changes
Wednesday’s
snapshot data
Frequency - Hourly Incremental
hourly changesToday’s changes
Today’s changes
Today’s changesWednesday’s hourly
changes
hourly changes
Wednesday’s hourly
changes
Wednesday’s
snapshot data
RECORDRECEIVE VALIDATE CONSTRUCTCLEANSE

Data formats
transactions)
(Delimited)
» Plain delimited
» Fixed width
» Avro messages
» XML
» Spreadsheets
DATA FORMATS
TYPES OF DATA
» Streaming
» Daily
» Weekly
FREQUENCY
CLEANSE

Data formats - CDC Delimited
TIMESTAMP, TRANSACTION_ID, OPERATION_TYPE, USER_ID,<DATAFIELDVALUE1>,<DATAFIELDVALUE2>….
2017-07-20 22:38:04.00000|426664065479|B|DWUSER|OFFICE TELEPHONE NO|21234567890
CDC Columns Application Columns
Could be one of I, B, A, D

CDC Delimited - Snapshot building
Data accumulated
until Yesterday
Today’s changes
Snapshot building
History Snapshot
D records moves to History I or latest A moves to Snapshot
CLEANSE

Data formats and Frequency
transactions)
DATA FORMATS
TYPES OF DATA
» Streaming
» Daily
» Weekly
FREQUENCY
CLEANSE
(Delimited)
» Plain delimited
» Fixed width
» Avro messages
» XML
» Spreadsheets

Data formats - Delimited with Header/Trailer
H3
TRANS_ID_1|USER_ID1|100.00
T123.00
» Header and trailer has meta information about the data in the file
» Varieties of header and trailer formats
Row count of the file
Sum or a rolling checksum of column(s) in that file
* Header and trailer information is used during reconciliation

The final piece to the puzzle
transactions)
DATA FORMATS
» Streaming
» Daily
» Weekly
FREQUENCY
Among others already discussed,
» Schema evolution
» Cascading re-runs
» full-dump override
PROCESSING
» TYPES OF DATA
CLEANSE
(Delimited)
» Plain delimited
» Fixed width
» Avro messages
» XML
» Spreadsheets

More awesomeness
Metadata management UI1
Schema evolution & retrofitting historic data2
Reruns, Cascading reruns, full-dump override3
One-touch deployment4
CLEANSE

To summarise
Our code is (still) manageable and extensible
1
The volume, the variety and the interesting combinations of
data presented us some very interesting problems to solve.
2
And… we intend to open source the framework for the
general audience.
3
4
With HDF and HDP, we were able to abstract away the cross
cutting concerns such as security and concurrency.

THANK YOU !
Questions?
@arunma

SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks for EDM, Our Unified Data Platform

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks for EDM, Our Unified Data Platform

Ähnlich wie SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks for EDM, Our Unified Data Platform (20)

Mehr von DataWorks Summit

Mehr von DataWorks Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks for EDM, Our Unified Data Platform