Dev Lakhani, Data Scientist at Batch Insights talks on "Real Time Big Data Applications for Investment Banks and Financial Institutions" at the first Big Data Frankfurt event that took place at Die Zentrale, organised by Dataconomy Media
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applications for Investment Banks & Financial Institutions"
1. Real Time Big Data Applications for
Investment Banks & Financial Institutions
2. Dev Lakhani
• 15 years Software Architecture & Development Experience
• 7 Years of Big Data Experience
• Big Data Architectures for Banks, Telecom, Retail, Media
• Deutche Telekom
• ASOS
• Tier 1 Investment Banks in Canary Wharf
• Dentsu Aeigis
• Contributor to Hadoop, Spark, Tachyon, HBase, Ignite
• uk.linkedin.com/in/devlakhani
3. • Overview of Big Data in financial
institutions
• Architectural constraints in investment
banking
• Implementation challenges
• Data model
• Future for financial applications
Introduction
4. • This talk has a technical focus
• This presentation is not representative of any client
• Real time re-definition for Big Data
• Vendor neutral talk
Disclaimers
5. Real Time Definition
[AS MODIFIER] Computing Relating to a system in which input data is
processed within milliseconds so that it is available virtually immediately as
feedback to the process from which it is coming, e.g. in a missile guidance
system:real-time signal processingreal-time software
http://www.oxforddictionaries.com/definition/english/real-time
6. Real Time Definition (Modified)
[AS MODIFIER] Computing Relating to a system in which input data is
processed within a guaranteed response time, using up-to-date
(latest version) information and available on demand as feedback to
the process from which it is coming.
8. Big Data Drivers for Investment Banking &
Financial Instituions
• Capturing billions of trades
• Quantifying risk and exposure
• Regulatory requirements
• Response to news and events
• Detect fraud, rogue trading and anomalies
• Performing simulations & algorithmic trading
• Business analysis -PNL
• Capital reserves and forecasting
Why Use Big Data?
10. • Disaster avoidance (not recovery) through
replication and redundancy
• High availability
• "Chinese Wall" policy and segmentation of
information
• Within the bank
• External to the bank
• Security & role based segmentation
• Responsiveness and throughput
• API or service based architecutre,
transparent to quants/end users
• Data completeness, 1 lost trade = $1 < x <
$10million in VaR estimate
Constraints
11. •Distributed File System, ingest raw data
•Regulatory compliance& archiving
• Last option disaster recovery
• Direct access to "power-users" for modelling and
analysis
Big Data Solution Architecture Components
12. •Distributed Warehouse
•Not always highly transactional
• Trading exchange worries about the
trade/transaction
• Eventually consistent sufficient
•SQL vs No-SQL
•MPP (Massively Parallel Processing)
•In memory vs on disk tuning
Big Data Solution Architecture Components
13. •Analytics and Serving Layer
•Perform descriptive stats
• Trade summaries
• Risk Calculation
• Monte Carlo Simulation
• Machine learning
• Expose APIs
•Report/Aggregate/Present
Big Data Solution Architecture Components
14. Physical Processes and Daemons
• HDFS
• Datanodes- store the data
• Journalnodes - shared edits (HA)
• Primary and Seconday namenode (HA)
• Zookeeper - coordinate between Namenodes
• YARN
• Resource manager x 2
• Node managers x (number of nodes)
• Job history servers
Lower Level Architecture Components
15. Physical Processes and Daemons
• HBase (1.0.0)
• N xHBase zookeepers
• 2 x Hbase masters
• 2 x Hbase master -regionservers
• N x Regionservers
• Spark
• Master (No HA)
• N x slaves
• Monitoring
• JMX monitoring
Lower Level Architecture Components
17. • Estimate Value at Risk
• Over a given timeframe, week, month,
year
• A confidence level 95%-99%
• A loss amount e.g £1m
What is the maximum potential
loss >£1m over that time?
• Using Spark calculate the covariance
matrix of past returns
• Use RDDs and parallel data structures to
simulate various conditions
• Sum, aggregate and take bottom 5%
Analytics, Machine Learning & Simulation
19. • Keys have to be distributed evenly
• Encoding and compression choices have to be
made
• LZO, GZ, Snappy, Codecs
• Serialization choices and memory tuning
• Java objects/JSON objects/JSON to Java
• Replication has to be managed and tested
• Cross cluster replication
• Cross data center replication
• Availability throughput during replication
• Rolling restarts and upgrades
Performance Challenges
20. • In memory tuning, off heap and on heap, region sizes
• Java tuning, heap, permgen, generation (for 20+ daemons!)
• HBase requires a functioning and performant HDFS cluster
• Cassandra requires tuning for compaction, replication
• Spark needs correct partitioning and persistence strategies
• Allocation of resources to nodes, network, disk etc.
• Role and table based segmentation - maintaining the Chinese
Wall
Performance Challenges
21. Once you solve that...
•Distributed File System for ingested/archived
data
•MPP warehouse for querying and analytics
•Quant layer for machine learning and prediction
•Service layer to expose APIs for VaR, stress tests
•Response guarantees for real time Big Data