Hadoop is becoming a standard platform for building critical financial applications such as risk reporting, trading and fraud detection. These applications require high level of SLAs (service-level agreement) in terms of RPO (Recovery Point Objective) and RTO (Recovery Time Objective). To achieve these SLAs, organizations need to build a disaster recovery plan that cover several layers ranging from the infrastructure to the clients going through the platform and the applications. In this talk, we will present the different architecture blueprints for disaster recovery as well as their corresponding SLA objectives. Then, we will focus on the stretch cluster solution that Crédit Agricole CIB is using in production. We will discuss the solution’s advantages, drawbacks and the impact of this approach on the global architecture. Finally, we will explain in detail how to configure and deploy this solution and how to integrate each layer (storage layer, processing layer...) into the architecture.
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financial Applications
1. Page 1
DataWorks Summit - Breakout Session
Disaster Recovery Experience at CACIB:
Hardening Hadoop for Critical Financial Applications
March 21st, 2019
Abdelkrim HADJIDJ – Cloudera
Mohamed Mehdi BEN AISSA – CA-GIP
2. Page 2
Speakers
Mohamed Mehdi BEN AISSA
Big Data Technical Architect at CA-GIP
Big Data Infrastructure Technical Owner for CA-CIB
Abdelkrim HADJIDJ
Solution Engineer at Cloudera
3. Page 3
Agenda
Big Data at CA-GIP & CA-CIB
Disaster Recovery Strategies
Stretch Cluster : Architecture & Configuration
Questions & Answers
5. Page 5
Big Data at CA-GIP & CA-CIB
Big Data at
CA-GIP & CA-CIB
15
Infrastructure B&R
Big Data Experts
Big Data
Run Team
Big Data
BuildTeam
Big Data
Storage
8PB
2019 80% 1500
CA Group
Infrastructure
Collaborators Sites in FranceCreation
Date
17
8000
CollaboratorsThe world's n°13
bank *
13 36
Locations
around World
* In 2017, measured by Tier One Capital
36TB of Memory
4000 Cores
6. Page 6
Big Data at CA-GIP & CA-CIB : Use Cases
Risk
Management
Decision
Making
Cash
Management
Regulations
7. Page 7
Big Data at CA-GIP & CA-CIB : Principal Use Cases
Risk Management/ Regulation
• Aims to replace the current market risk eco-system and phase out the legacy system
(over 10 applications to decommission) to provide the bank with a golden source on
deal & risk indicators across business lines and worldwide
• Address ongoing and future regulations (LBF/Volker rules, FRTB, BCBS239, Initial
Margin, Stress EBA/AQR …)
• 3PB of Data on Production to date
Cash Management Transformation
• Strategic program for CA-CIB new business
• Real time Transaction Processing
• Redesign the SI payment for CACIB and international deployment
• Target : 800 millions transactions/day (8 TB/day)
Data-Lake
Real Time
Processing
8. Page 8
Big Data at CA-GIP & CA-CIB : Service Offer Architecture
ACCESSPROCESSINGINGESTION
Scheduling, Security, Monitoring & Administration
STORAGE &
MESSAGING
DATA SOURCES
Data storage
Messaging
Batch processing
Stream
Processing
App 1
App 2
…
App n
Records
Documents
Files
Messages
Streams
Dataviz Data Governance
APPLICATIONS
Batch Mode
Stream Mode
Data query (SQL)
NoSQL Database
Indexed Data
OLAP
RAW DATA ENHANCED DATA
OPTIMIZED DATA
RAW DATA
ENHANCED DATA
OPTIMIZED DATA
9. Page 9
Big Data at CA-GIP & CA-CIB : Service Level Agreements
Disaster Recovery Performance Security
Resiliency
Service Availability
24/24 7/7
Zero Data Loss
Distributed Systems
Scalability
Data Locality
In-Memory Processing
Authentication
Authorization
Data Protection
Audit
11. Page 11
Disaster Recovery vs Backup vs Archive
Disaster Recovery (DR)
• Protects from the complete outage of a data center (eg. Natural disaster)
• Disaster Recovery includes replication, but also incorporates failover and failback
• Disaster Recovery Site can be an on-premise or cloud cluster
Backup / Restore
• Protects against the logical errors (e.g. accidental deletion, corruption of data, etc)
• Incremental/full backup mechanisms are required to restore data from previous Point
In Time version (PIT). This usually involves a snapshot mechanism for PIT protection.
• Backups/Snapshots are kept for relatively short time (from days to months)
Archive
• A single static copy of data for long-term preservation (several years)
• This is required by some regulations
12. Page 12
Objective of a Disaster Recovery plan
• SLA (Service-Level Agreement) : Particular aspects of the service (quality, availability,
responsibilities) :
• RTO (Recovery Time Objective) : acceptable service interruption measured in time
• RPO (Recovery Point Objective) : maximum acceptable amount of data loss measured in
time
€
Minimize service
interruption (RTO)
Minimize data
Loss (RPO)
Reduce Costs Guarantee
Consistency
Optimize
Performance
13. Page 13
DR options
Node
Node
Node
Node
Node
Node
Node
Node
DC1 DC2
Data
Node
Node
Node
Node
Node
Node
Node
Node
DC1 DC2
Data
Node
Node
Node
Node
Node
Node
Node
Node
DC1 DC2
Data
Dual ingest
Low RPO/RTO
Mirroring
High RPO/RTO
Multiple DC
Low RPO/RTO
Node
Node
Node
Node
DC3
14. Page 14
Dual ingest
DR Cluster
PROD Cluster
Synchronicity Checks / Checksums
Pub-sub/
Streaming / Batch
Routing
Data sources Global
Traffic
Manager
Local Traffic
Manager
Local Traffic
Manager
End Applications/
Users
• Significant investment
• Might meet RPO=0 (in sync)
• Active/active site
15. Page 15
Dual ingest pros and cons
Pros
• Very low RPO/RTO (almost 0)
• Dual run makes failover and failback
easier
• Easy to implement from an
infrastructure standpoint. Tools like
NiFi or Kafka make implementation
easier
• Help detect application’s bugs/errors
(except ML)
Cons
• Requires two clusters with preferably
iso-resources
• Requires dual configurations injections
(and automation)
• Impact on applications makes
implementation complex (ex self
service)
• Requires a cluster diff implementation
• Data export should be run once
16. Page 16
Mirroring
Raw Data Ingest
Replicated Data
PROD Cluster DR Cluster
Pub-sub/
Streaming / Batch
Routing
Global
Traffic
Manager
Local Traffic
Manager
Local Traffic
Manager
End Applications/
Users
• Can meet RPO = 1h to24 hrs
• Active/passive site
Data sources
17. Page 17
Mirroring pros and cons
Pros
• Loose requirements, easy to
implement
• Big Data technologies are designed
for this architecture
• Better performance (throughput,
network, latency)
• Can support other use cases
(isolation, geo-locality, legal, etc)
Cons
• Requires two clusters
• High RPO: Potential data loss (asynch
replication) that could be recovered
from the source
• Require a replication layer
• Need to define fail-over/fail-back
logic and process that goes beyond
just data
18. Page 18
Things to consider for mirroring
Applications
(Spark jobs, Hive queries, Zeppelin
notebooks, etc)
Data
(HDFS Files, Hive tables, Kafka
msgs, etc)
Infrastructure
(network, hardware, etc)
Configurations
(OS, Binaries, Ambari, Agents, RPM,
etc)
Process
(SLAs, Business
Continuity, Dev, etc)
Metadata
(Atlas, Ranger, Topics, etc)
Client configurations
(BI tools, Hbase client, Rest API, etc)
Infrastructure
services
(LDAP, AD, LB, etc)
20. Page 20
What RPO can we realistically target?
We can achieves smaller replication frequency and better RPO (ex. 10 mins) – but
this depends on several parameters
Data volume, Data burst, # of
partitions/files/tables, Insert vs
update ratio
Internal/external bandwidth,
latency, dedicated/shared
(day/time), CPU **
** Asynchronous: RPO = F( max(data_generation_rate), available_bandwidth )
* Synchronous: very slow RPOs by throttling writes (impact on performances)
InfrastructureData
Synchronicity*, Incremental
replication, latency (snapshots,
compression, encryption, integrity)
Software
21. Page 21
Spanning Multiple Data Centers
Data sources
Raw Data Ingest
DC1 DC2
DN
NN1
ZK1 JN1
DN
NN2
ZK2 JN2
Traffic
Manager
End Applications/
Users
DC3 (witness)
ZK3 JN3
• Restricted to data centers within a geographic region
(few km).
• Strong constraints: 3 DCs, single digit ms latency,
guaranteed bandwidth *
• Multi-DC is not native in Hadoop
* https://www.cloudera.com/documentation/other/reference-architecture/PDF/cloudera_ref_arch_metal.pdf
22. Page 22
Multiple Data Centers pros and cons
Pros
• Better RPO (synch replication)
• Cheaper, it’s just one cluster
• Simpler for applications
• No need for fail-over/fail-back
Cons
• Strong constraints: nearby 3 DCs, single
digit ms latency, guaranteed bandwidth *
• Advanced configurations: replicas
placement strategy, Yarn labels, etc
• Performance impact by inter DC network
• Not suited for all the animals in the Zoo
(ex. Streaming)
* https://www.cloudera.com/documentation/other/reference-architecture/PDF/cloudera_ref_arch_metal.pdf
24. Page 24
Stretch Cluster : Why !?
• SLA (Service-Level Agreement) : Particular aspects of the service (quality, availability,
responsibilities) :
• RTO (Recovery Time Objective) : The targeted duration of time and a service level within
which a business process must be restored after a disaster
• RPO (Recovery Point Objective) : The maximum targeted period in which data might be
lost
• Goals :
24/7 RPO €
RTO->0 RPO=0 Reduce Costs Consistency Performance
25. Page 25
Stretch Cluster : Why !?
• SLA (Service-Level Agreement) : Particular aspects of the service (quality, availability,
responsibilities) :
• RTO (Recovery Time Objective) : The targeted duration of time and a service level within
which a business process must be restored after a disaster
• RPO (Recovery Point Objective) : The maximum targeted period in which data might be
lost
• Goals :
24/7 RPO €
RTO->0 RPO=0 Reduce Costs Consistency Performance
Financial Context
Dr
Be able to keep data, services and application even if a disaster occurs causing the failure of a complete data center
A separate site (or sites) used to recover against a disaster. Can be <100KM (dark fiber) or >100KM (WAN)
Synchronous replication is desired (RPO is almost 0) but hard at large scale
Backup
A consistent backup occurs when the database is in consistent state-meaning you can restore the backup and open without performing the media recovery.
When a database is restored from an inconsistent backup, database must perform media recovery before the database can be opened, applying any pending changes from the redo logs