SlideShare a Scribd company logo
1 of 32
HADOOP AT 
BLOOMBERG 
MEDIUM DATA NEEDS FOR THE FINANCIAL INDUSTRY 
SEPTEMBER // 3 // 2014 
// HADOOP AT BLOOMBERG 
MATTHEW HUNT @INSTANTMATTHEW
// HADOOP AT BLOOMBERG 
BLOOMBERG 
Leading Data and Analytics provider to the financial industry 
2
// HADOOP AT BLOOMBERG 
BLOOMBERG DATA DIVERSITY 
3
// HADOOP AT BLOOMBERG 
DATA MANAGEMENT TODAY 
• Data is our business 
• Bloomberg doesn’t have a “big data” 
problem. It has a “medium data” 
problem… 
• Speed and availability are paramount 
• Hundreds of thousands of users with 
expensive requests 
We’ve built many systems to address 
4
// HADOOP AT BLOOMBERG 
DATA MANAGEMENT CHALLENGES 
• Origin:Single security analytics on 
proprietary Unix 
• Replication of Systems and Data 
• PriceHistory and PORT 
• Calcroute caches, 
“acquisition” 
• Mostly for performance reasons 
• Complexity kills 
5 
>96% Linux. 100% of top 40. 
Top 500 Supercomputer list, 2013
// HADOOP AT BLOOMBERG 
DATA MANAGEMENT TOMORROW 
• Fewer and simpler systems 
• More financial instruments and easy 
multi-security 
• New products 
• Drop in machines for capacity 
• Retain our independence 
• Benefit from external developments 
• … there’s just that little matter of 
details 
6
// HADOOP AT BLOOMBERG 
“BIG DATA” ORIGINS 
• “Big Data” == PETABYTES. 
• Economics problem: index the web, 
serve lots of users… and go bankrupt 
• Big machine$,data center $pace,$AN 
• So, cram as many cheap boxes with 
cheap local disks into as small a space 
as possible, cheaply 
• Many machines == lots of failures. 
• Writing distributed programs is hard. 
• Build frameworks to ease that pain. 
• $$$ PROFIT $$$ 
7
// HADOOP AT BLOOMBERG 
DATA MANAGEMENT AT GOOGLE 
“It takes a 
village – and a 
framework” 
8
// HADOOP AT BLOOMBERG 
HAD OOPS THE RECKONING 
• Most people don’t have hundreds of 
petabytes, including us. 
• Other systems are faster or more mature. 
• Relational databases are very good. 
• Few people. 
• Hadoop is written in Java. 
• ….wait, what? So why bother? 
9
// HADOOP AT BLOOMBERG 
HADOOP THE REASONING 
• We have a medium data problem and so does 
everyone else. 
• Chunk of problem, especially time series, are 
known fit today. 
• Potential to address issues listed earlier. 
• The pace of development is swift. 
10
// HADOOP AT BLOOMBERG 
HADOOP* THE ECOSYSTEM 
• Huge and growing 
ecosystem of services 
with tons of momentum 
• The amount of talent and 
money pouring in are 
astronomical 
• We’ve seen this story 
11
// HADOOP AT BLOOMBERG 
HADOOP PACE OF DEVELOPMENT 
12
// HADOOP AT BLOOMBERG 
ALTERNATIVES 
• Alternatives tend to be pricy, 
would lock us into a single 
vendor, or only solve part of the 
equation 
• Comdb2 has saved us hundreds 
of millions of dollars at a 
minimum 
• Functionalities converging 
13
A DEEPER LOOK 
>>>>>>>>>>>>>>
// HADOOP AT BLOOMBERG 
HBASE OVERVIEW 
15
// HADOOP AT BLOOMBERG 
HBASE WHEN TO USE? 
• Not suitable to every data storage problem 
• Make sure you can live without some features an 
RDBMS can provide… 
16
WHAT HAVE WE 
BEEN UPTO? 
>>>>>>>>>>>>>>
// HADOOP AT BLOOMBERG 
PRICE HISTORY / PORT 
• Pricehistory serves up end of day time 
series data and drives much of the 
terminal’s charting functionality 
• > 5 billion requests a day serving out 
TBs of data 
• 100K queries per second average and 
500K per second at peak 
• PORT represents a demanding & 
contrasting query pattern. 
18
// HADOOP AT BLOOMBERG 
HISTORY: THE HISTORY 
• Key: Security, field, date 
• ~10m securities. Hundreds of Billions 
of datapoints. 
• Large PORT retrievals = PH challenge 
• 20k bondsx40 fieldsx1 year = 10m rows 
• Answer: comdb2 fork of PH: SCUBA 
• 2-3 orders of magnitude faster 
• Works, but.. 
• ….uncomfortably medium 
19 
SECURITY FIELD DATE VALUE 
IBM VOLUME 20140321 12,535,281 
VOLUME 20140320 5,062,629 
VOLUME 20140319 4,323,930 
GOOG CLOSE PX 20140321 1,183.04 
CLOSE PX 20140320 1,197.16
// HADOOP AT BLOOMBERG 
PH/PORT ON HBASE A DEEPER LOOK 
• Time series data fetches are embarrassingly 
parallel. 
• Application has simplistic data types and query 
patterns 
• No need for joins, key-based lookup only 
• Data sets are large enough to consider manual 
sharding… administrative overhead 
• Require commodity framework to consolidate 
various disparate systems built over time 
• Frameworks bring benefit of additional 
analytical tools 
HBase is an excellent fit for this 
problem domain 
20
// HADOOP AT BLOOMBERG 
HBASE CHALLENGES ENCOUNTERED 
• HBase is a technology that we’ve decided to back and are 
making substantial changes to in the process. 
• Our requirements from HBase are the following: 
• Read performance - fast with low variability 
• High availability 
• Operational simplicity 
• Efficient use of our good hardware (128G, SSD, 16 cores) 
Bloomberg has been investing in all these aspects of HBase
// HADOOP AT BLOOMBERG 
RECOVERY IN HBASE TODAY 
22
// HADOOP AT BLOOMBERG 
HA SOLUTIONS CONSIDERED 
• Recovery time is now 1-2 mins – unacceptable to us. 
• Even if recovery time is optimized down to zero, have to wait 
to detect failure. 
• Where to read from in the interim? 
• Another cluster in another DC? 
• Another cluster in the same DC? 
• Two tables – primary and a shadow kept in the same HBase 
instance? 
23
// HADOOP AT BLOOMBERG 
OUR SOLUTION (HBASE-10070) 
24
// HADOOP AT BLOOMBERG 
PERFORMANCE, PERFORMANCE, 
PERFORMANCE 
We’ve been pushing the performance 
aspects of HBase: 
• Multiple RegionServers 
• Off-heap and Multi-level block 
cache 
• Co-processors for Waterfalling and 
Lookback of PORT Scuba requests 
• Synchronized Garbage Collection 
• Continuously-compacting Garbage 
Collectors 
25
DISTRIBUTION AND PARALLELISM 26 
// HADOOP AT BLOOMBERG 
MACHINE 1 MACHINE 2 MACHINE 3 
SALT THE KEY: PERFECT DISTRIBUTION OF DATA 
MORE REGION SERVERS = MORE PARALLELISM 
GOOD, UP TO A POINT… 
Region 
Servers per 
machine 
Total Region 
Servers 
260MS->160MS 
Average time 
1 11 260ms 
3 33 185ms 
5 55 160ms 
10 110 170ms
IN PLACE COMPUTATION 27 
// HADOOP AT BLOOMBERG 
MACHINE 1 MACHINE 2 MACHINE 3 
scuba 
SCUBA REQUIRES AVG OF 5 DB ROUND 
TRIPS FOR WATERFALLING + LOOKBACK 
‘DO IT AT THE DATA’ IN ONE ROUND TRIP… 
scuba 
1) BarCap 
2) Merrill 
3) BVAL 
4) PXNum: 5755 
5) Portfolio Lid:33:7477039:23 
RS 
Region N 
Region N 
Region N 
Region N 
Region N 
Region N 
Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port 
Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port 
Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port 
Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port 
Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port 
Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port 
Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port 
Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port 
Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port 
Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port 
Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port 
Sec2: 1) Barc 2) M 3) BV 5) Port 
Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port 
Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port 
Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port 
Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port 
Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port 
Sec3: 2) M 3) BV 4) PX 
Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port 
Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port 
Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port 
Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port 
Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port 
Sec4: 1) Barc 3) BV 4) PX 
CP 
160MS->85MS
SYNCHRONIZED DISRUPTION 28 
// HADOOP AT BLOOMBERG 
MACHINE 1 MACHINE 2 MACHINE 3 
WHY WERE 10 REGION SERVERS SLOWER? 
HIGH FANOUT AND THE LAGGARD PROBLEM 
SOLUTION: GC AT THE SAME TIME 
Region 
Servers per 
machine 
Total Region 
Servers 
85MS->60MS 
Average time 
1 11 260ms 
3 33 185ms 
5 55 160ms 
10 110 170ms
// HADOOP AT BLOOMBERG 
PORT: NOT JUST PERFORMANCE 
“Custom REDACTED TOP SECRET FEATURE for Portfolio analytics” 
Requires inspection,extraction, modification of whole 
data set. 
Becomes one simple script 
29
// HADOOP AT BLOOMBERG 
HADOOP INFRASTRUCTURE 
• Chef recipes for cluster install and configuration with a high degree 
of repeatability 
• Current focus is on developing the Monitoring toolkit 
• These clusters will be administered by the Hadoop Infrastructure 
team 
• Along with the Hadoop stack, the team also aims to offer [ OTHER 
COOL TECHNOLOGIES ] as service offerings 
30
SUMMARY 
// HADOOP AT BLOOMBERG
QUESTIONS? 
// HADOOP AT BLOOMBERG

More Related Content

What's hot

Scaling Apache Pulsar to 10 Petabytes/Day
Scaling Apache Pulsar to 10 Petabytes/DayScaling Apache Pulsar to 10 Petabytes/Day
Scaling Apache Pulsar to 10 Petabytes/DayScyllaDB
 
Basic and Advanced Analysis of Ceph Volume Backend Driver in Cinder - John Haan
Basic and Advanced Analysis of Ceph Volume Backend Driver in Cinder - John HaanBasic and Advanced Analysis of Ceph Volume Backend Driver in Cinder - John Haan
Basic and Advanced Analysis of Ceph Volume Backend Driver in Cinder - John HaanCeph Community
 
Ceph Day Beijing - Ceph RDMA Update
Ceph Day Beijing - Ceph RDMA UpdateCeph Day Beijing - Ceph RDMA Update
Ceph Day Beijing - Ceph RDMA UpdateDanielle Womboldt
 
Ceph, the future of Storage - Sage Weil
Ceph, the future of Storage - Sage WeilCeph, the future of Storage - Sage Weil
Ceph, the future of Storage - Sage WeilCeph Community
 
Using HAProxy to Scale MySQL
Using HAProxy to Scale MySQLUsing HAProxy to Scale MySQL
Using HAProxy to Scale MySQLBill Sickles
 
On The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterOn The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterSrihari Sriraman
 
Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance
Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 InstanceExtreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance
Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 InstanceScyllaDB
 
G1: To Infinity and Beyond
G1: To Infinity and BeyondG1: To Infinity and Beyond
G1: To Infinity and BeyondScyllaDB
 
Designing for High Performance Ceph at Scale
Designing for High Performance Ceph at ScaleDesigning for High Performance Ceph at Scale
Designing for High Performance Ceph at ScaleJames Saint-Rossy
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureDanielle Womboldt
 
Performance optimization 101 - Erlang Factory SF 2014
Performance optimization 101 - Erlang Factory SF 2014Performance optimization 101 - Erlang Factory SF 2014
Performance optimization 101 - Erlang Factory SF 2014lpgauth
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
HBaseCon 2013: Scalable Network Designs for Apache HBase
HBaseCon 2013: Scalable Network Designs for Apache HBaseHBaseCon 2013: Scalable Network Designs for Apache HBase
HBaseCon 2013: Scalable Network Designs for Apache HBaseCloudera, Inc.
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephScyllaDB
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitecturePatrick McGarry
 
Where Did All These Cycles Go?
Where Did All These Cycles Go?Where Did All These Cycles Go?
Where Did All These Cycles Go?ScyllaDB
 
Ceph - High Performance Without High Costs
Ceph - High Performance Without High CostsCeph - High Performance Without High Costs
Ceph - High Performance Without High CostsJonathan Long
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudPatrick McGarry
 
Evaluation of RBD replication options @CERN
Evaluation of RBD replication options @CERNEvaluation of RBD replication options @CERN
Evaluation of RBD replication options @CERNCeph Community
 

What's hot (20)

Scaling Apache Pulsar to 10 Petabytes/Day
Scaling Apache Pulsar to 10 Petabytes/DayScaling Apache Pulsar to 10 Petabytes/Day
Scaling Apache Pulsar to 10 Petabytes/Day
 
Basic and Advanced Analysis of Ceph Volume Backend Driver in Cinder - John Haan
Basic and Advanced Analysis of Ceph Volume Backend Driver in Cinder - John HaanBasic and Advanced Analysis of Ceph Volume Backend Driver in Cinder - John Haan
Basic and Advanced Analysis of Ceph Volume Backend Driver in Cinder - John Haan
 
Ceph on arm64 upload
Ceph on arm64   uploadCeph on arm64   upload
Ceph on arm64 upload
 
Ceph Day Beijing - Ceph RDMA Update
Ceph Day Beijing - Ceph RDMA UpdateCeph Day Beijing - Ceph RDMA Update
Ceph Day Beijing - Ceph RDMA Update
 
Ceph, the future of Storage - Sage Weil
Ceph, the future of Storage - Sage WeilCeph, the future of Storage - Sage Weil
Ceph, the future of Storage - Sage Weil
 
Using HAProxy to Scale MySQL
Using HAProxy to Scale MySQLUsing HAProxy to Scale MySQL
Using HAProxy to Scale MySQL
 
On The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterOn The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL Cluster
 
Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance
Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 InstanceExtreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance
Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance
 
G1: To Infinity and Beyond
G1: To Infinity and BeyondG1: To Infinity and Beyond
G1: To Infinity and Beyond
 
Designing for High Performance Ceph at Scale
Designing for High Performance Ceph at ScaleDesigning for High Performance Ceph at Scale
Designing for High Performance Ceph at Scale
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
Performance optimization 101 - Erlang Factory SF 2014
Performance optimization 101 - Erlang Factory SF 2014Performance optimization 101 - Erlang Factory SF 2014
Performance optimization 101 - Erlang Factory SF 2014
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
HBaseCon 2013: Scalable Network Designs for Apache HBase
HBaseCon 2013: Scalable Network Designs for Apache HBaseHBaseCon 2013: Scalable Network Designs for Apache HBase
HBaseCon 2013: Scalable Network Designs for Apache HBase
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
 
Where Did All These Cycles Go?
Where Did All These Cycles Go?Where Did All These Cycles Go?
Where Did All These Cycles Go?
 
Ceph - High Performance Without High Costs
Ceph - High Performance Without High CostsCeph - High Performance Without High Costs
Ceph - High Performance Without High Costs
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
 
Evaluation of RBD replication options @CERN
Evaluation of RBD replication options @CERNEvaluation of RBD replication options @CERN
Evaluation of RBD replication options @CERN
 

Viewers also liked

Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Lucidworks
 
ScrumCertificate_71013Intro_Final
ScrumCertificate_71013Intro_FinalScrumCertificate_71013Intro_Final
ScrumCertificate_71013Intro_FinalAshutosh Dubey
 
Twitter Search Architecture
Twitter Search Architecture Twitter Search Architecture
Twitter Search Architecture Ramez Al-Fayez
 
DIT-Radix-2-FFT in SPED
DIT-Radix-2-FFT in SPEDDIT-Radix-2-FFT in SPED
DIT-Radix-2-FFT in SPEDAjay Kumar
 
Chapter 9 computation of the dft
Chapter 9 computation of the dftChapter 9 computation of the dft
Chapter 9 computation of the dftmikeproud
 
Simon Rowles Institutional Bank Presentation July 2014
Simon Rowles Institutional Bank Presentation July 2014Simon Rowles Institutional Bank Presentation July 2014
Simon Rowles Institutional Bank Presentation July 2014Simon Rowles
 
My dream retail company
My dream retail companyMy dream retail company
My dream retail companysitu7682
 
Case Study: Thomson Reuters
Case Study: Thomson ReutersCase Study: Thomson Reuters
Case Study: Thomson ReutersForgeRock
 
Introduction to Bloomberg
Introduction to BloombergIntroduction to Bloomberg
Introduction to BloombergJiaxin Low
 
텐서플로 걸음마 (TensorFlow Tutorial)
텐서플로 걸음마 (TensorFlow Tutorial)텐서플로 걸음마 (TensorFlow Tutorial)
텐서플로 걸음마 (TensorFlow Tutorial)Taejun Kim
 
Implementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
Implementing Click-through Relevance Ranking in Solr and LucidWorks EnterpriseImplementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
Implementing Click-through Relevance Ranking in Solr and LucidWorks EnterpriseLucidworks (Archived)
 
Bank of America presentation
Bank of America presentationBank of America presentation
Bank of America presentationSANDESH GHOSAL
 

Viewers also liked (20)

Allyourbase
AllyourbaseAllyourbase
Allyourbase
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
 
Para sa akin chords
Para sa akin chordsPara sa akin chords
Para sa akin chords
 
ScrumCertificate_71013Intro_Final
ScrumCertificate_71013Intro_FinalScrumCertificate_71013Intro_Final
ScrumCertificate_71013Intro_Final
 
Twitter Search Architecture
Twitter Search Architecture Twitter Search Architecture
Twitter Search Architecture
 
DIT-Radix-2-FFT in SPED
DIT-Radix-2-FFT in SPEDDIT-Radix-2-FFT in SPED
DIT-Radix-2-FFT in SPED
 
Chapter 9 computation of the dft
Chapter 9 computation of the dftChapter 9 computation of the dft
Chapter 9 computation of the dft
 
Simon Rowles Institutional Bank Presentation July 2014
Simon Rowles Institutional Bank Presentation July 2014Simon Rowles Institutional Bank Presentation July 2014
Simon Rowles Institutional Bank Presentation July 2014
 
My dream retail company
My dream retail companyMy dream retail company
My dream retail company
 
Case Study: Thomson Reuters
Case Study: Thomson ReutersCase Study: Thomson Reuters
Case Study: Thomson Reuters
 
Introduction to Bloomberg
Introduction to BloombergIntroduction to Bloomberg
Introduction to Bloomberg
 
Crs company profile
Crs company profileCrs company profile
Crs company profile
 
텐서플로 걸음마 (TensorFlow Tutorial)
텐서플로 걸음마 (TensorFlow Tutorial)텐서플로 걸음마 (TensorFlow Tutorial)
텐서플로 걸음마 (TensorFlow Tutorial)
 
Implementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
Implementing Click-through Relevance Ranking in Solr and LucidWorks EnterpriseImplementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
Implementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
 
Presentation on ernest and young
Presentation on ernest and youngPresentation on ernest and young
Presentation on ernest and young
 
Bank of America presentation
Bank of America presentationBank of America presentation
Bank of America presentation
 
ERNST & YOUNG
ERNST & YOUNGERNST & YOUNG
ERNST & YOUNG
 
Bank of America
Bank of AmericaBank of America
Bank of America
 
sony
sonysony
sony
 
Ernst & Young
Ernst & YoungErnst & Young
Ernst & Young
 

Similar to Hadoop at Bloomberg:Medium data for the financial industry

The Details That Matter: Kafka in Production, at Scale with Or Arnon and Elad...
The Details That Matter: Kafka in Production, at Scale with Or Arnon and Elad...The Details That Matter: Kafka in Production, at Scale with Or Arnon and Elad...
The Details That Matter: Kafka in Production, at Scale with Or Arnon and Elad...HostedbyConfluent
 
High-performance 32G Fibre Channel Module on MDS 9700 Directors:
High-performance 32G Fibre Channel Module on MDS 9700 Directors:High-performance 32G Fibre Channel Module on MDS 9700 Directors:
High-performance 32G Fibre Channel Module on MDS 9700 Directors:Tony Antony
 
FreeSWITCH as a Microservice
FreeSWITCH as a MicroserviceFreeSWITCH as a Microservice
FreeSWITCH as a MicroserviceEvan McGee
 
Connecting applicationswitha mq
Connecting applicationswitha mqConnecting applicationswitha mq
Connecting applicationswitha mqRob Davies
 
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward
 
CoAP Course for m2m and Internet of Things scenarios
CoAP Course for m2m and Internet of Things scenariosCoAP Course for m2m and Internet of Things scenarios
CoAP Course for m2m and Internet of Things scenarioscarlosralli
 
Aurora Serverless, 서버리스 RDB의 서막 - 트랙2, Community Day 2018 re:Invent 특집
Aurora Serverless, 서버리스 RDB의 서막 - 트랙2, Community Day 2018 re:Invent 특집Aurora Serverless, 서버리스 RDB의 서막 - 트랙2, Community Day 2018 re:Invent 특집
Aurora Serverless, 서버리스 RDB의 서막 - 트랙2, Community Day 2018 re:Invent 특집AWSKRUG - AWS한국사용자모임
 
2009-01-28 DOI NBC Red Hat on System z Performance Considerations
2009-01-28 DOI NBC Red Hat on System z Performance Considerations2009-01-28 DOI NBC Red Hat on System z Performance Considerations
2009-01-28 DOI NBC Red Hat on System z Performance ConsiderationsShawn Wells
 
Cowboy dating with big data
Cowboy dating with big data Cowboy dating with big data
Cowboy dating with big data b0ris_1
 
Resolving Firebird performance problems
Resolving Firebird performance problemsResolving Firebird performance problems
Resolving Firebird performance problemsAlexey Kovyazin
 
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStackCeph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStackRed_Hat_Storage
 
DB2 for z/OS - Starter's guide to memory monitoring and control
DB2 for z/OS - Starter's guide to memory monitoring and controlDB2 for z/OS - Starter's guide to memory monitoring and control
DB2 for z/OS - Starter's guide to memory monitoring and controlFlorence Dubois
 
Capacity Planning For LAMP
Capacity Planning For LAMPCapacity Planning For LAMP
Capacity Planning For LAMPJohn Allspaw
 
To Serverless and Beyond
To Serverless and BeyondTo Serverless and Beyond
To Serverless and BeyondScyllaDB
 
Sample Solution Blueprint
Sample Solution BlueprintSample Solution Blueprint
Sample Solution BlueprintMike Alvarado
 
Ultimate journey towards realtime data platform with 2.5M events per sec
Ultimate journey towards realtime data platform with 2.5M events per secUltimate journey towards realtime data platform with 2.5M events per sec
Ultimate journey towards realtime data platform with 2.5M events per secb0ris_1
 
The Data Center and Hadoop
The Data Center and HadoopThe Data Center and Hadoop
The Data Center and HadoopDataWorks Summit
 
Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance Ceph Community
 
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...Aman Kohli
 

Similar to Hadoop at Bloomberg:Medium data for the financial industry (20)

osi-oss-dbs.pptx
osi-oss-dbs.pptxosi-oss-dbs.pptx
osi-oss-dbs.pptx
 
The Details That Matter: Kafka in Production, at Scale with Or Arnon and Elad...
The Details That Matter: Kafka in Production, at Scale with Or Arnon and Elad...The Details That Matter: Kafka in Production, at Scale with Or Arnon and Elad...
The Details That Matter: Kafka in Production, at Scale with Or Arnon and Elad...
 
High-performance 32G Fibre Channel Module on MDS 9700 Directors:
High-performance 32G Fibre Channel Module on MDS 9700 Directors:High-performance 32G Fibre Channel Module on MDS 9700 Directors:
High-performance 32G Fibre Channel Module on MDS 9700 Directors:
 
FreeSWITCH as a Microservice
FreeSWITCH as a MicroserviceFreeSWITCH as a Microservice
FreeSWITCH as a Microservice
 
Connecting applicationswitha mq
Connecting applicationswitha mqConnecting applicationswitha mq
Connecting applicationswitha mq
 
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
 
CoAP Course for m2m and Internet of Things scenarios
CoAP Course for m2m and Internet of Things scenariosCoAP Course for m2m and Internet of Things scenarios
CoAP Course for m2m and Internet of Things scenarios
 
Aurora Serverless, 서버리스 RDB의 서막 - 트랙2, Community Day 2018 re:Invent 특집
Aurora Serverless, 서버리스 RDB의 서막 - 트랙2, Community Day 2018 re:Invent 특집Aurora Serverless, 서버리스 RDB의 서막 - 트랙2, Community Day 2018 re:Invent 특집
Aurora Serverless, 서버리스 RDB의 서막 - 트랙2, Community Day 2018 re:Invent 특집
 
2009-01-28 DOI NBC Red Hat on System z Performance Considerations
2009-01-28 DOI NBC Red Hat on System z Performance Considerations2009-01-28 DOI NBC Red Hat on System z Performance Considerations
2009-01-28 DOI NBC Red Hat on System z Performance Considerations
 
Cowboy dating with big data
Cowboy dating with big data Cowboy dating with big data
Cowboy dating with big data
 
Resolving Firebird performance problems
Resolving Firebird performance problemsResolving Firebird performance problems
Resolving Firebird performance problems
 
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStackCeph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
 
DB2 for z/OS - Starter's guide to memory monitoring and control
DB2 for z/OS - Starter's guide to memory monitoring and controlDB2 for z/OS - Starter's guide to memory monitoring and control
DB2 for z/OS - Starter's guide to memory monitoring and control
 
Capacity Planning For LAMP
Capacity Planning For LAMPCapacity Planning For LAMP
Capacity Planning For LAMP
 
To Serverless and Beyond
To Serverless and BeyondTo Serverless and Beyond
To Serverless and Beyond
 
Sample Solution Blueprint
Sample Solution BlueprintSample Solution Blueprint
Sample Solution Blueprint
 
Ultimate journey towards realtime data platform with 2.5M events per sec
Ultimate journey towards realtime data platform with 2.5M events per secUltimate journey towards realtime data platform with 2.5M events per sec
Ultimate journey towards realtime data platform with 2.5M events per sec
 
The Data Center and Hadoop
The Data Center and HadoopThe Data Center and Hadoop
The Data Center and Hadoop
 
Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance
 
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
DSLing your System For Scalability Testing Using Gatling - Dublin Scala User ...
 

Recently uploaded

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 

Recently uploaded (20)

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdf
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 

Hadoop at Bloomberg:Medium data for the financial industry

  • 1. HADOOP AT BLOOMBERG MEDIUM DATA NEEDS FOR THE FINANCIAL INDUSTRY SEPTEMBER // 3 // 2014 // HADOOP AT BLOOMBERG MATTHEW HUNT @INSTANTMATTHEW
  • 2. // HADOOP AT BLOOMBERG BLOOMBERG Leading Data and Analytics provider to the financial industry 2
  • 3. // HADOOP AT BLOOMBERG BLOOMBERG DATA DIVERSITY 3
  • 4. // HADOOP AT BLOOMBERG DATA MANAGEMENT TODAY • Data is our business • Bloomberg doesn’t have a “big data” problem. It has a “medium data” problem… • Speed and availability are paramount • Hundreds of thousands of users with expensive requests We’ve built many systems to address 4
  • 5. // HADOOP AT BLOOMBERG DATA MANAGEMENT CHALLENGES • Origin:Single security analytics on proprietary Unix • Replication of Systems and Data • PriceHistory and PORT • Calcroute caches, “acquisition” • Mostly for performance reasons • Complexity kills 5 >96% Linux. 100% of top 40. Top 500 Supercomputer list, 2013
  • 6. // HADOOP AT BLOOMBERG DATA MANAGEMENT TOMORROW • Fewer and simpler systems • More financial instruments and easy multi-security • New products • Drop in machines for capacity • Retain our independence • Benefit from external developments • … there’s just that little matter of details 6
  • 7. // HADOOP AT BLOOMBERG “BIG DATA” ORIGINS • “Big Data” == PETABYTES. • Economics problem: index the web, serve lots of users… and go bankrupt • Big machine$,data center $pace,$AN • So, cram as many cheap boxes with cheap local disks into as small a space as possible, cheaply • Many machines == lots of failures. • Writing distributed programs is hard. • Build frameworks to ease that pain. • $$$ PROFIT $$$ 7
  • 8. // HADOOP AT BLOOMBERG DATA MANAGEMENT AT GOOGLE “It takes a village – and a framework” 8
  • 9. // HADOOP AT BLOOMBERG HAD OOPS THE RECKONING • Most people don’t have hundreds of petabytes, including us. • Other systems are faster or more mature. • Relational databases are very good. • Few people. • Hadoop is written in Java. • ….wait, what? So why bother? 9
  • 10. // HADOOP AT BLOOMBERG HADOOP THE REASONING • We have a medium data problem and so does everyone else. • Chunk of problem, especially time series, are known fit today. • Potential to address issues listed earlier. • The pace of development is swift. 10
  • 11. // HADOOP AT BLOOMBERG HADOOP* THE ECOSYSTEM • Huge and growing ecosystem of services with tons of momentum • The amount of talent and money pouring in are astronomical • We’ve seen this story 11
  • 12. // HADOOP AT BLOOMBERG HADOOP PACE OF DEVELOPMENT 12
  • 13. // HADOOP AT BLOOMBERG ALTERNATIVES • Alternatives tend to be pricy, would lock us into a single vendor, or only solve part of the equation • Comdb2 has saved us hundreds of millions of dollars at a minimum • Functionalities converging 13
  • 14. A DEEPER LOOK >>>>>>>>>>>>>>
  • 15. // HADOOP AT BLOOMBERG HBASE OVERVIEW 15
  • 16. // HADOOP AT BLOOMBERG HBASE WHEN TO USE? • Not suitable to every data storage problem • Make sure you can live without some features an RDBMS can provide… 16
  • 17. WHAT HAVE WE BEEN UPTO? >>>>>>>>>>>>>>
  • 18. // HADOOP AT BLOOMBERG PRICE HISTORY / PORT • Pricehistory serves up end of day time series data and drives much of the terminal’s charting functionality • > 5 billion requests a day serving out TBs of data • 100K queries per second average and 500K per second at peak • PORT represents a demanding & contrasting query pattern. 18
  • 19. // HADOOP AT BLOOMBERG HISTORY: THE HISTORY • Key: Security, field, date • ~10m securities. Hundreds of Billions of datapoints. • Large PORT retrievals = PH challenge • 20k bondsx40 fieldsx1 year = 10m rows • Answer: comdb2 fork of PH: SCUBA • 2-3 orders of magnitude faster • Works, but.. • ….uncomfortably medium 19 SECURITY FIELD DATE VALUE IBM VOLUME 20140321 12,535,281 VOLUME 20140320 5,062,629 VOLUME 20140319 4,323,930 GOOG CLOSE PX 20140321 1,183.04 CLOSE PX 20140320 1,197.16
  • 20. // HADOOP AT BLOOMBERG PH/PORT ON HBASE A DEEPER LOOK • Time series data fetches are embarrassingly parallel. • Application has simplistic data types and query patterns • No need for joins, key-based lookup only • Data sets are large enough to consider manual sharding… administrative overhead • Require commodity framework to consolidate various disparate systems built over time • Frameworks bring benefit of additional analytical tools HBase is an excellent fit for this problem domain 20
  • 21. // HADOOP AT BLOOMBERG HBASE CHALLENGES ENCOUNTERED • HBase is a technology that we’ve decided to back and are making substantial changes to in the process. • Our requirements from HBase are the following: • Read performance - fast with low variability • High availability • Operational simplicity • Efficient use of our good hardware (128G, SSD, 16 cores) Bloomberg has been investing in all these aspects of HBase
  • 22. // HADOOP AT BLOOMBERG RECOVERY IN HBASE TODAY 22
  • 23. // HADOOP AT BLOOMBERG HA SOLUTIONS CONSIDERED • Recovery time is now 1-2 mins – unacceptable to us. • Even if recovery time is optimized down to zero, have to wait to detect failure. • Where to read from in the interim? • Another cluster in another DC? • Another cluster in the same DC? • Two tables – primary and a shadow kept in the same HBase instance? 23
  • 24. // HADOOP AT BLOOMBERG OUR SOLUTION (HBASE-10070) 24
  • 25. // HADOOP AT BLOOMBERG PERFORMANCE, PERFORMANCE, PERFORMANCE We’ve been pushing the performance aspects of HBase: • Multiple RegionServers • Off-heap and Multi-level block cache • Co-processors for Waterfalling and Lookback of PORT Scuba requests • Synchronized Garbage Collection • Continuously-compacting Garbage Collectors 25
  • 26. DISTRIBUTION AND PARALLELISM 26 // HADOOP AT BLOOMBERG MACHINE 1 MACHINE 2 MACHINE 3 SALT THE KEY: PERFECT DISTRIBUTION OF DATA MORE REGION SERVERS = MORE PARALLELISM GOOD, UP TO A POINT… Region Servers per machine Total Region Servers 260MS->160MS Average time 1 11 260ms 3 33 185ms 5 55 160ms 10 110 170ms
  • 27. IN PLACE COMPUTATION 27 // HADOOP AT BLOOMBERG MACHINE 1 MACHINE 2 MACHINE 3 scuba SCUBA REQUIRES AVG OF 5 DB ROUND TRIPS FOR WATERFALLING + LOOKBACK ‘DO IT AT THE DATA’ IN ONE ROUND TRIP… scuba 1) BarCap 2) Merrill 3) BVAL 4) PXNum: 5755 5) Portfolio Lid:33:7477039:23 RS Region N Region N Region N Region N Region N Region N Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port Sec2: 1) Barc 2) M 3) BV 5) Port Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port Sec3: 2) M 3) BV 4) PX Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port Sec4: 1) Barc 3) BV 4) PX CP 160MS->85MS
  • 28. SYNCHRONIZED DISRUPTION 28 // HADOOP AT BLOOMBERG MACHINE 1 MACHINE 2 MACHINE 3 WHY WERE 10 REGION SERVERS SLOWER? HIGH FANOUT AND THE LAGGARD PROBLEM SOLUTION: GC AT THE SAME TIME Region Servers per machine Total Region Servers 85MS->60MS Average time 1 11 260ms 3 33 185ms 5 55 160ms 10 110 170ms
  • 29. // HADOOP AT BLOOMBERG PORT: NOT JUST PERFORMANCE “Custom REDACTED TOP SECRET FEATURE for Portfolio analytics” Requires inspection,extraction, modification of whole data set. Becomes one simple script 29
  • 30. // HADOOP AT BLOOMBERG HADOOP INFRASTRUCTURE • Chef recipes for cluster install and configuration with a high degree of repeatability • Current focus is on developing the Monitoring toolkit • These clusters will be administered by the Hadoop Infrastructure team • Along with the Hadoop stack, the team also aims to offer [ OTHER COOL TECHNOLOGIES ] as service offerings 30
  • 31. SUMMARY // HADOOP AT BLOOMBERG
  • 32. QUESTIONS? // HADOOP AT BLOOMBERG