SDM is a distributed, reliable and highly available data lake ingestion framework that handles data processing, archival and reconciliation capabilities with an effective change based history management capabilities for batch and streaming data. It is meta-driven and provides automated schema evolution. The SDM platform is built completely on open source software/platforms, making it both extensible and robust. The data management, schema evolution and archival is achieved through Apache NiFi’s in-built capabilities and extensions via custom processors and controller services. The end-of-day construct is generated through an Apache Spark job.Types of Data :
Types of Data :
1. Batch
a. Full dump
b. Incremental
c. Hybrid (Daily incremental + Weekly/Monthly full dump)
2. Near Real time
a. CDC-Kafka
b. JMS-Kafka
3. Extractions
a. Incremental based on Change Data Capture tool (IBM Infosphere CDC)
b. Sqoop
c. JDBC/ODBC
4. Manual File Upload
a. Excel
Types of Process:
1. File validation
a. File integrity (header, trailer, data checksum)
b. File de-duplication
c. New line and non-printable control characters handling
2. Structural validation (Row validation)
a. Fixed width
b. Delimited
c. XML
d. JSON
e. Excel (Single/Multi tab)
f. Datatype validation
g. Constraint validation – Null, primary key and full row de-duplication
3. Defaulting
a. Condition based
b. Special data-type handling (mainframe systems)
4. Operational assurance
a. Row count logging
b. Reconciliation with source
c. File/Record rejections with reasons
5. Lineage tracking
a. Row-id for every single record is generated and referenced against the source file until the processed layer.
Storage formats :
1. Raw - Archival
2. Avro – Staged
3. ORC – Processed
Benefits
1. Metadata driven
2. Extensible
3. Scalable
4. Flexible
Plans
Current State :
1. Custom built ingestion framework leveraging upon standard open source software from Apache.
2. Data of 100+ source systems are ingested into the Hadoop data lake using the ingestion framework.
Plan :
1. Open sourcing the framework for general consumption.
2. Metadata management UI/API which would serve as a glossary of data available in the data lake with search capabilities.
3. Operational and Exception reporting.
4. Centralized data retention within the framework.
5. Health monitoring and alerting.
6. Provenance data maintenance in Atlas.
Speaker
Arun Manivannan, Senior Data Engineer, Standard Chartered Bank
3. Data in SCB context
Variety of applications - Web, Batch, Mainframes
1 Hundreds of applications (~180 data lake source apps)
2
Variety of consumption patterns
3
Multi-regulatory guided data storage
4
5
50+ countries
4. Data in SCB context
Variety of applications - Web, Batch, Mainframes
1 Hundreds of applications (~180 data lake source apps)
2
Variety of consumption patterns
3
Multi-regulatory guided data storage
4
5
50+ countries
5. History
2012
Enterprise Data
Management on Teradata
Now
Unified ingestion platform
Until a year ago
Legacy Ingestion
framework
<2012
Traditional group-wise Data
Warehouses
2014
EDM on Hadoop
9. Essential pre-processing
RECORDRECEIVE CLEANSE VALIDATE CONSTRUCT
1
2
3
4
5
Column and Row count validation
Embedded new line removal and special character replacement
Datatype validation
Data transformation
Value defaulting
13. Generation 2 Awesomeness
RECORDRECEIVE VALIDATE CONSTRUCT
Faster development cycle for new applications.
Easier to reason with the flow with NiFi’s visual
flow representation.
1
Consistent tooling for preprocessing, ops data
management, error reporting and archival
2
Significantly faster processing through Spark
3
ORC performs well for most of our consumption
patterns - supporting both predicate and projection
push down.
4
5
Security and managed concurrency
CLEANSE
Change Data
Capture
14. Generation 2 Awesomeness
RECORDRECEIVE VALIDATE CONSTRUCT
Faster development cycle for new applications.
Easier to reason with the flow with NiFi’s visual
flow representation.
1
Consistent tooling for preprocessing, ops data
management, error reporting and archival
2
Significantly faster processing through Spark
3
ORC performs well for most of our consumption
patterns - supporting both predicate and projection
push down.
4
5
Security and managed concurrency
CLEANSE
Change Data
Capture
17. RECORDRECEIVE VALIDATE CONSTRUCT
Types of Data
RECORDRECEIVE VALIDATE CONSTRUCT
» Master (eg. Customer data)
» Transaction (eg. Banking transactions)
» Transaction data as Master (eg. editable
transactions)
TYPES OF DATA
CLEANSE
18. Types of Data
RECORDRECEIVE VALIDATE CONSTRUCT
Data accumulated
until Yesterday
Today’s changes
Today’s EOD
snapshot data
Monday’s snapshot
data
Tuesday’s snapshot
data
Wednesday’s
snapshot data
Master Transactional
History Current
CLEANSE
19. Frequency
RECORDRECEIVE VALIDATE CONSTRUCT
» Master (eg. Customer data)
» Transaction (eg. Banking transactions)
» Transaction data as Master (eg. editable
transactions)
» Output of Change Data Capture systems
(Delimited)
» Plain delimited
» Fixed width
» Avro-JSON messages
» Spreadsheets
DATA FORMATS
TYPES OF DATA
» Streaming
» Hourly incremental
» Daily
» Weekly
FREQUENCY
CLEANSE
20. Data accumulated
until Tuesday
Today’s changes
Wednesday’s
snapshot data
Frequency - Hourly Incremental
hourly changesToday’s changes
Today’s changes
Today’s changesWednesday’s hourly
changes
hourly changes
Wednesday’s hourly
changes
Wednesday’s
snapshot data
RECORDRECEIVE VALIDATE CONSTRUCTCLEANSE
21. Data formats
RECORDRECEIVE VALIDATE CONSTRUCT
» Master (eg. Customer data)
» Transaction (eg. Banking transactions)
» Transaction data as Master (eg. editable
transactions)
» Output of Change Data Capture systems
(Delimited)
» Plain delimited
» Fixed width
» Avro messages
» XML
» Spreadsheets
DATA FORMATS
TYPES OF DATA
» Streaming
» Hourly incremental
» Daily
» Weekly
FREQUENCY
CLEANSE
22. Data formats - CDC Delimited
TIMESTAMP, TRANSACTION_ID, OPERATION_TYPE, USER_ID,<DATAFIELDVALUE1>,<DATAFIELDVALUE2>….
2017-07-20 22:38:04.00000|426664065479|B|DWUSER|OFFICE TELEPHONE NO|21234567890
CDC Columns Application Columns
Could be one of I, B, A, D
23. CDC Delimited - Snapshot building
RECORDRECEIVE VALIDATE CONSTRUCT
Data accumulated
until Yesterday
Today’s changes
Snapshot building
History Snapshot
D records moves to History I or latest A moves to Snapshot
CLEANSE
24. Data formats and Frequency
RECORDRECEIVE VALIDATE CONSTRUCT
» Master (eg. Customer data)
» Transaction (eg. Banking transactions)
» Transaction data as Master (eg. editable
transactions)
DATA FORMATS
TYPES OF DATA
» Streaming
» Hourly incremental
» Daily
» Weekly
FREQUENCY
CLEANSE
» Output of Change Data Capture systems
(Delimited)
» Plain delimited
» Fixed width
» Avro messages
» XML
» Spreadsheets
25. Data formats - Delimited with Header/Trailer
H3
TRANS_ID_1|USER_ID1|100.00
TRANS_ID_2|USER_ID2|3.00
TRANS_ID_3|USER_ID1|20.00
T123.00
» Header and trailer has meta information about the data in the file
» Varieties of header and trailer formats
Row count of the file
Sum or a rolling checksum of column(s) in that file
* Header and trailer information is used during reconciliation
26.
27. The final piece to the puzzle
RECORDRECEIVE VALIDATE CONSTRUCT
» Master (eg. Customer data)
» Transaction (eg. Banking transactions)
» Transaction data as Master (eg. editable
transactions)
DATA FORMATS
» Streaming
» Hourly incremental
» Daily
» Weekly
FREQUENCY
Among others already discussed,
» Schema evolution
» Cascading re-runs
» full-dump override
PROCESSING
» TYPES OF DATA
CLEANSE
» Output of Change Data Capture systems
(Delimited)
» Plain delimited
» Fixed width
» Avro messages
» XML
» Spreadsheets
29. To summarise
Our code is (still) manageable and extensible
1
The volume, the variety and the interesting combinations of
data presented us some very interesting problems to solve.
2
And… we intend to open source the framework for the
general audience.
3
4
With HDF and HDP, we were able to abstract away the cross
cutting concerns such as security and concurrency.