2. 2 Information Classification: Confidential
$30.6
trillion
$1.7
trillion
100+
markets
assets under
custody and/or
administration*
assets under
management*
across the world*
* All figures as of March 31, 2017
BNY Mellon
3. 3 Information Classification: Confidential
11,000+ employees
in over 50 cities
$195M retail
accounts serviced
200,000 professionals
access our services daily
8,000 Java virtual
machines in production
1.6 billion Digital Pulse events/
month from 100+ sources
BNY Mellon Technology at a Glance
4. 4 Information Classification: Confidential
As a response to
the financial crisis
of 2007-2008,
significant
changes were
done to financial
regulation.
Background
DATA
management
has become a
focus in the past
10 years.
quality.
This effort is
primarily in
response to some
of those
requirements.
The main idea
was around Data
Integration
across the
enterprise and
managing
lineage and
quality.
quality.
8. 8 Information Classification: Confidential
Consumer 1
Source 1
Source 3
Source 2
Source 4
Consumer 2
Consumer 3
DB
HADOOP
ETL
ETL
ETL
ETLETL
ETL
ETL
ETL
ETL
ETL
Cache
ETL
ETL
ETL
ETL
It is lot messier than this.. It is
more like a Hairball..
9. 9 Information Classification: Confidential
Consumer 1Source 1
Source 3
Source 2
Source 4
Consumer 2
Consumer 3
Distribution
Hub
10. 10 Information Classification: Confidential
Source and Consumer are not connected directly
Decoupling helps to have consumers and sources evolve.
Middle Layer evolving with technology
The idea was…
Distribution Hub
11. 11 Information Classification: Confidential
Our Vision:
• Be the trusted, go-to data provider.
• Centralize
- Transformation and enrichment logic
- Security
• Make it easy to manage data lineage
• Monitor to ensure elasticity and consistent performance.
12. 12 Information Classification: Confidential
• 1000s of systems with different and diverse data
structures.
• Needed a flexible read-level schema.
• Given the size, didn't know what end state would look
like.
Our Challenges:
13. 13 Information Classification: Confidential
A platform for this distribution hub
One that evolves with changing technology and requirements
Self service, business friendly model
Our own technology, not a vendor tool
ZERO data LOSS
Reconcilable
Centralized Data Lineage
Be the basis for enterprise Data Dictionary
What we really needed was ...
15. 15 Information Classification: Confidential
Micro-services
• Framework built as Microservices
• LEGO blocks – Allowing us to use these block to transform
and morph and making no assumptions about the future.
16. 16 Information Classification: Confidential
Our platform uses KAFKA extensively in both
message and file-based paradigms
Let’s look at how we used KAFKA to achieve reliability, performance and economy.
17. 17 Information Classification: Confidential
Functional Decomposition-
MICROSERVICES
Horizontal Scaling-
INTERNAL CLOUD
Data Splitting -
SHARDING
Our Scaling Strategy - 3 Dimensions of Scaling*
*The Art of Scalability – Martin Abott and Michael Fisher
18. 18 Information Classification: Confidential
Functional Decomposition-
MICROSERVICES
Horizontal Scaling-
INTERNAL CLOUD
Data Splitting -
SHARDING
Our Scaling Strategy - 3 Dimensions of Scaling
• Separation of Work by
responsibility.
• Based on micro-services where
services are more specialized for
a task. Each of these micro-
services exposed via API Store
resulting in more reuse.
• Tasks that need more CPU would
be scaled separately without
scaling the entire infrastructure.
• Cloning of services or data such
that work can be easily
distributed across instances with
absolutely no bias.
• Can be implemented by scaling
out using BNY Mellon Cloud.
• Functional Decomposition
required for easy horizontal
scaling.
• As the data grows, the ability to
handle scale in a horizontal scale
environment gets harder
• For horizontal scaling, data split is
required to ensure that the
memory requirement for each
node is consistent.
• Idea of data splitting across a set
of servers. Each server only deals
with a subset of data and in the
process improves memory
management and transaction
scalability.
20. 20 Information Classification: Confidential
Extraction
Extraction
Validation
Staging
Enrichment
Distribution
Validation Staging Enrichment Distribution
RULES
Monolith to
Micro-service
Inbound Outbound
We started
with Monolith Inbound
Outbound
21. 21 Information Classification: Confidential
In both cases, we used Vertica as the way to stage
our data
Extraction Validation
Staging
Enrichment Distribution
22. On Hadoop
On Premise
In the Cloud
Vertica – SQL Analytics Platform
Blazingly Fast and Scalable
23. 23 Information Classification: Confidential
Batch Based
Processing for
Files
• Traditional File Processing
• Started with Hadoop
• Moved to Traditional Batch Based Processing
• Batch based – Simple but not scalable
• Resulted in missed SLAs and very resource intensive.
Extraction Validation Staging Enrichment Distribution
Traditional File Processing
24. 24 Information Classification: Confidential
Integrated all micro services using KAFKA.
Maintain the state using OFFSETS. Knew exactly where to start from if things failed..
25. 25 Information Classification: Confidential
KAFKA TOPIC
Second Dimension - Horizontal Scaling
Active
Active
- KAFKA is the basis for integrating all micro-services.
- Partitions in KAFKA allowed separate paths for different files.
- Larger files would not impact smaller files.
- File splits would run concurrently. For services that need
synchronous, used request-response.
- Maintain the state using OFFSETS.
- Knew exactly where to start from if things failed..
Each PARTITION contains
instructions for file splits
related to a file .
Splits for a specific file is
processed concurrently.
KAFKA synch
multiple regions
Cluster Working making
sure that there is no Singe
Point of Failure
26. 26 Information Classification: Confidential
1 2 3 4 5
Splits for a specific file is
processed concurrently.
File
Each File is spit into
smaller files
Vertica
Reconcile
KAFKA
1 2 3 54
Client
Third Dimension - Data Sharding
Instead of JDBC, we use COPY construct in Vertica.
Larger Files broken into smaller files and those jobs
could be sent to cluster of servers.
We could now transfer files in parallel
27. 27 Information Classification: Confidential
Resolved Concerns
• COPY statement in Vertica helped the bulk-loads data into an HPE Vertica database.
• KAFKA maintained the offsets and that meant any failure would not require us to restart.
• Idempotency was a key gain as the client could just call again and system would know where
to restart from..
• Network timeout and other issues no longer an issue.
• Restarting the job - Now we could start even when it failed in between
28. 28 Information Classification: Confidential
Ability to Fine Tune Our Engine
Instances, Split Size and Memory were three dimensions that we could play with based on the availability of the
hardware
29. 29 Information Classification: Confidential
Real time
for Messages
Extraction Validation Staging Enrichment Distribution
• STORM for managing the Streams
• Kafka as a Queue
• And used Vertica to Stage the data to create outbounds
Messaging
30. 30 Information Classification: Confidential
Typical Real time Streaming Architecture
Storm
Topology
Storm
Topology
Storm
Topology
Storm
Topology
Storm
Topology
KAFKA KAFKA KAFKA
31. 31 Information Classification: Confidential
• Kafka is designed for a streaming use case
• In Vertica, streaming effect can be achieved by running a series of COPY statements
• BUT - process can become tedious and complex
• So we used Kafka integration to load data to Vertica database
Vertica- KAFKA Integration
33. 33 Information Classification: Confidential
Consumption -
70 - 80% Gain
Resource Consumption
Percent difference between JDBC & KAFKA Excavator
34. 34 Information Classification: Confidential
This will be an ever evolving and dynamic project. We have changed, evolved and
redesigned so that it is easy for us to evolve and be forward looking.
Thanks
Hinweis der Redaktion
We have over 1000s of systems with different and diverse data structures.
We needed a flexible read level schema.
One cannot create a global schema to accommodate all the diversity.
Given the size, we really didn't know what our end state would look like..
What we really needed was to ..
Build a platform for this distribution hub which evolves with changing technology and requirements.
At the same time, it provides a self service model that is business friendly.
And … that is the only way to be cost effective.
We needed our own technology and not a vendor tool…
Which a true platform..
where business users can build their own custom tranformation rules..
Which responds to all changing trends and thinking process.
ZERO data LOSS
And be reconcileable. And ..
Serve as a centralized data hub between sources and consumers, enriching, transforming, and delivering data across the company.
It is the brains behind everything around the data movement.
It Enriches, Transforms, and Delivers data
And this is what we built --
This framework is built as collection of micro-services.
Like LEGO blocks - Allowing us to morph this product over time – making no assumptions about future.
What were our pillars for functional decomposition?
At the highest level – Inbound and Outbound
Source and Consumer are not connected directly - No point to point connection.
Decoupling helps to have consumers and sources evolve. Different volume and performance requirements
Middle Layer evolving with technology without ever changing the business rules or the interfaces
Vertica is a blazingly fast SQL analytics platform based on Columnar and MPP architecture. It supports Advanced analytics and Machine learning functions. Vertica is an ACID based database that supports ANSI SQL interface to query your data.
KAFKA is the basis for integrating all microservices.
Partitions in KAFKA would allowed us to build separate paths for different files. Larger files would not impact smaller files.
File splits would run concurrently as each split will be handled by group of nodes dedicated to that partition
We used KAFKA to load balance across multiple REGIONS.
Load Balancing was out of the box.
Created separate paths for larger and smaller files.
To increase Concurrency, we split large files into multiple smaller files and processed them in parallel..
And we did that again using …
KAFKA partitions which would maintain the state. Clusters would listen on partitions.
The COPY statement in Vertica helped the bulk-loads data into an HPE Vertica database. One can initiate loading one or more files or pipes on a cluster host or on a client system (using the COPY LOCAL option).
This helped us with some of the classic database issues
For instance - Throughput/Loading into RDMS databases in the ETL space is a general concern and generally creates bottlenecks and backpressure.
And Vertica COPY was very efficient!
Once all processed, we reconciled and merged all the files.
KAFKA maintained the offsets and that meant any failure would not require us to restart.
Idempotency was a key gain as the client could just call again and system would know where to restart from..
Earlier really large file meant issues with network timeout and other issues outside our control.
Restarting the job would mean more delay..
Now we could start even when it failed in between
One can initiate loading one or more files or pipes on a cluster host or on a client system (using the COPY LOCAL option). This helped us with some of the classic database issues
For instance - Throughput/Loading into RDMS databases in the ETL space is a general concern and generally creates bottlenecks and backpressure.
Vertica COPY was very efficient!
Once all processed, we reconciled and merged all the files.
Earlier really large file meant issues with network timeout and other issues outside our control.
Ability to fine tune our engine –
Performance, Split Size and Memory were three dimensions that we could play with based on the availability of the hardware
Kafka is designed for a streaming use case (high volumes of data with low latency). In Vertica, one can achieve this streaming effect by running a series of COPY statements, each of which loads small amounts of data into your database.
However, this process can become tedious and complex. Instead, we used the Kafka integration feature to automatically load data to Vertica database as it streams through Kafka.