Kafka Summit NYC 2017 - Achieving Predictability and Compliance with BNY Mellon's Data Distribution Hub

Achieving Predictability and
Compliance with BNY Mellon’s
Data Distribution Hub
Rajesh Kaveti
May 8, 2017
Senior Principal Developer

2 Information Classification: Confidential
$30.6
trillion
$1.7
trillion
100+
markets
assets under
custody and/or
administration*
assets under
management*
across the world*
* All figures as of March 31, 2017
BNY Mellon

11,000+ employees
in over 50 cities
$195M retail
accounts serviced
200,000 professionals
access our services daily
8,000 Java virtual
machines in production
1.6 billion Digital Pulse events/
month from 100+ sources
BNY Mellon Technology at a Glance

As a response to
the financial crisis
of 2007-2008,
significant
changes were
done to financial
regulation.
Background
DATA
management
has become a
focus in the past
10 years.
quality.
This effort is
primarily in
response to some
of those
requirements.
The main idea
was around Data
Integration
across the
enterprise and
managing
lineage and
quality.
quality.

With some kind of ETL…
How do we do this..?

Source 1
Source 3
Source 2
Source 4
Enterprise Data Warehouse
ETL
ETL
ETL
ETL
Seems Simple..

But typically
it does not look that neat…

Consumer 1
Source 1
Source 3
Source 2
Source 4
Consumer 2
Consumer 3
DB
HADOOP
ETL
ETL
ETL
ETLETL
ETL
ETL
ETL
ETL
ETL
Cache
ETL
ETL
ETL
ETL
It is lot messier than this.. It is
more like a Hairball..

Consumer 1Source 1
Source 3
Source 2
Source 4
Consumer 2
Consumer 3
Distribution
Hub

Source and Consumer are not connected directly
Decoupling helps to have consumers and sources evolve.
Middle Layer evolving with technology
The idea was…
Distribution Hub

Our Vision:
• Be the trusted, go-to data provider.
• Centralize
- Transformation and enrichment logic
- Security
• Make it easy to manage data lineage
• Monitor to ensure elasticity and consistent performance.

• 1000s of systems with different and diverse data
structures.
• Needed a flexible read-level schema.
• Given the size, didn't know what end state would look
like.
Our Challenges:

A platform for this distribution hub
One that evolves with changing technology and requirements
Self service, business friendly model
Our own technology, not a vendor tool
ZERO data LOSS
Reconcilable
Centralized Data Lineage
Be the basis for enterprise Data Dictionary
What we really needed was ...

14 Information Classification: Confidential14
Consumer 1
Source 1
Source 3
Source 2
Source 4
Consumer 2
Consumer 3
Distribution
Hub
Source 1
Enrichment

Micro-services
• Framework built as Microservices
• LEGO blocks – Allowing us to use these block to transform
and morph and making no assumptions about the future.

Our platform uses KAFKA extensively in both
message and file-based paradigms
Let’s look at how we used KAFKA to achieve reliability, performance and economy.

Functional Decomposition-
MICROSERVICES
Horizontal Scaling-
INTERNAL CLOUD
Data Splitting -
SHARDING
Our Scaling Strategy - 3 Dimensions of Scaling*
*The Art of Scalability – Martin Abott and Michael Fisher

Functional Decomposition-
MICROSERVICES
Horizontal Scaling-
INTERNAL CLOUD
Data Splitting -
SHARDING
Our Scaling Strategy - 3 Dimensions of Scaling
• Separation of Work by
responsibility.
• Based on micro-services where
services are more specialized for
a task. Each of these micro-
services exposed via API Store
resulting in more reuse.
• Tasks that need more CPU would
be scaled separately without
scaling the entire infrastructure.
• Cloning of services or data such
that work can be easily
distributed across instances with
absolutely no bias.
• Can be implemented by scaling
out using BNY Mellon Cloud.
• Functional Decomposition
required for easy horizontal
scaling.
• As the data grows, the ability to
handle scale in a horizontal scale
environment gets harder
• For horizontal scaling, data split is
required to ensure that the
memory requirement for each
node is consistent.
• Idea of data splitting across a set
of servers. Each server only deals
with a subset of data and in the
process improves memory
management and transaction
scalability.

2 Main Pillars
• Inbound
• Outbound
Functional Decomposition

Extraction
Extraction
Validation
Staging
Enrichment
Distribution
Validation Staging Enrichment Distribution
RULES
Monolith to
Micro-service
Inbound Outbound
We started
with Monolith Inbound
Outbound

In both cases, we used Vertica as the way to stage
our data
Extraction Validation
Staging
Enrichment Distribution

On Hadoop
On Premise
In the Cloud
Vertica – SQL Analytics Platform
Blazingly Fast and Scalable

Batch Based
Processing for
Files
• Traditional File Processing
• Started with Hadoop
• Moved to Traditional Batch Based Processing
• Batch based – Simple but not scalable
• Resulted in missed SLAs and very resource intensive.
Extraction Validation Staging Enrichment Distribution
Traditional File Processing

Integrated all micro services using KAFKA.
Maintain the state using OFFSETS. Knew exactly where to start from if things failed..

KAFKA TOPIC
Second Dimension - Horizontal Scaling
Active
Active
- KAFKA is the basis for integrating all micro-services.
- Partitions in KAFKA allowed separate paths for different files.
- Larger files would not impact smaller files.
- File splits would run concurrently. For services that need
synchronous, used request-response.
- Maintain the state using OFFSETS.
- Knew exactly where to start from if things failed..
Each PARTITION contains
instructions for file splits
related to a file .
Splits for a specific file is
processed concurrently.
KAFKA synch
multiple regions
Cluster Working making
sure that there is no Singe
Point of Failure

1 2 3 4 5
Splits for a specific file is
processed concurrently.
File
Each File is spit into
smaller files
Vertica
Reconcile
KAFKA
1 2 3 54
Client
Third Dimension - Data Sharding
Instead of JDBC, we use COPY construct in Vertica.
Larger Files broken into smaller files and those jobs
could be sent to cluster of servers.
We could now transfer files in parallel

Resolved Concerns
• COPY statement in Vertica helped the bulk-loads data into an HPE Vertica database.
• KAFKA maintained the offsets and that meant any failure would not require us to restart.
• Idempotency was a key gain as the client could just call again and system would know where
to restart from..
• Network timeout and other issues no longer an issue.
• Restarting the job - Now we could start even when it failed in between

Ability to Fine Tune Our Engine
Instances, Split Size and Memory were three dimensions that we could play with based on the availability of the
hardware

Real time
for Messages
Extraction Validation Staging Enrichment Distribution
• STORM for managing the Streams
• Kafka as a Queue
• And used Vertica to Stage the data to create outbounds
Messaging

Typical Real time Streaming Architecture
Storm
Topology
Storm
Topology
Storm
Topology
Storm
Topology
Storm
Topology
KAFKA KAFKA KAFKA

• Kafka is designed for a streaming use case
• In Vertica, streaming effect can be achieved by running a series of COPY statements
• BUT - process can become tedious and complex
• So we used Kafka integration to load data to Vertica database
Vertica- KAFKA Integration

Performance - 4 TIMES THROUGHPUT

Consumption -
70 - 80% Gain
Resource Consumption
Percent difference between JDBC & KAFKA Excavator

This will be an ever evolving and dynamic project. We have changed, evolved and
redesigned so that it is easy for us to evolve and be forward looking.
Thanks

Kafka Summit NYC 2017 - Achieving Predictability and Compliance with BNY Mellon's Data Distribution Hub

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Kafka Summit NYC 2017 - Achieving Predictability and Compliance with BNY Mellon's Data Distribution Hub

Ähnlich wie Kafka Summit NYC 2017 - Achieving Predictability and Compliance with BNY Mellon's Data Distribution Hub (20)

Mehr von confluent

Mehr von confluent (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Kafka Summit NYC 2017 - Achieving Predictability and Compliance with BNY Mellon's Data Distribution Hub

Hinweis der Redaktion