Hadoop at Bloomberg:Medium data for the financial industry

HADOOP AT
BLOOMBERG
MEDIUM DATA NEEDS FOR THE FINANCIAL INDUSTRY
SEPTEMBER // 3 // 2014
// HADOOP AT BLOOMBERG
MATTHEW HUNT @INSTANTMATTHEW

BLOOMBERG
Leading Data and Analytics provider to the financial industry
2

BLOOMBERG DATA DIVERSITY
3

DATA MANAGEMENT TODAY
• Data is our business
• Bloomberg doesn’t have a “big data”
problem. It has a “medium data”
problem…
• Speed and availability are paramount
• Hundreds of thousands of users with
expensive requests
We’ve built many systems to address
4

DATA MANAGEMENT CHALLENGES
• Origin:Single security analytics on
proprietary Unix
• Replication of Systems and Data
• PriceHistory and PORT
• Calcroute caches,
“acquisition”
• Mostly for performance reasons
• Complexity kills
5
>96% Linux. 100% of top 40.
Top 500 Supercomputer list, 2013

DATA MANAGEMENT TOMORROW
• Fewer and simpler systems
• More financial instruments and easy
multi-security
• New products
• Drop in machines for capacity
• Retain our independence
• Benefit from external developments
• … there’s just that little matter of
details
6

“BIG DATA” ORIGINS
• “Big Data” == PETABYTES.
• Economics problem: index the web,
serve lots of users… and go bankrupt
• Big machine$,data center $pace,$AN
• So, cram as many cheap boxes with
cheap local disks into as small a space
as possible, cheaply
• Many machines == lots of failures.
• Writing distributed programs is hard.
• Build frameworks to ease that pain.
• $$$ PROFIT $$$
7

DATA MANAGEMENT AT GOOGLE
“It takes a
village – and a
framework”
8

HAD OOPS THE RECKONING
• Most people don’t have hundreds of
petabytes, including us.
• Other systems are faster or more mature.
• Relational databases are very good.
• Few people.
• Hadoop is written in Java.
• ….wait, what? So why bother?
9

HADOOP THE REASONING
• We have a medium data problem and so does
everyone else.
• Chunk of problem, especially time series, are
known fit today.
• Potential to address issues listed earlier.
• The pace of development is swift.
10

HADOOP* THE ECOSYSTEM
• Huge and growing
ecosystem of services
with tons of momentum
• The amount of talent and
money pouring in are
astronomical
• We’ve seen this story
11

HADOOP PACE OF DEVELOPMENT
12

ALTERNATIVES
• Alternatives tend to be pricy,
would lock us into a single
vendor, or only solve part of the
equation
• Comdb2 has saved us hundreds
of millions of dollars at a
minimum
• Functionalities converging
13

HBASE OVERVIEW
15

HBASE WHEN TO USE?
• Not suitable to every data storage problem
• Make sure you can live without some features an
RDBMS can provide…
16

WHAT HAVE WE
BEEN UPTO?
>>>>>>>>>>>>>>

PRICE HISTORY / PORT
• Pricehistory serves up end of day time
series data and drives much of the
terminal’s charting functionality
• > 5 billion requests a day serving out
TBs of data
• 100K queries per second average and
500K per second at peak
• PORT represents a demanding &
contrasting query pattern.
18

HISTORY: THE HISTORY
• Key: Security, field, date
• ~10m securities. Hundreds of Billions
of datapoints.
• Large PORT retrievals = PH challenge
• 20k bondsx40 fieldsx1 year = 10m rows
• Answer: comdb2 fork of PH: SCUBA
• 2-3 orders of magnitude faster
• Works, but..
• ….uncomfortably medium
19
SECURITY FIELD DATE VALUE
IBM VOLUME 20140321 12,535,281
VOLUME 20140320 5,062,629
VOLUME 20140319 4,323,930
GOOG CLOSE PX 20140321 1,183.04
CLOSE PX 20140320 1,197.16

PH/PORT ON HBASE A DEEPER LOOK
• Time series data fetches are embarrassingly
parallel.
• Application has simplistic data types and query
patterns
• No need for joins, key-based lookup only
• Data sets are large enough to consider manual
sharding… administrative overhead
• Require commodity framework to consolidate
various disparate systems built over time
• Frameworks bring benefit of additional
analytical tools
HBase is an excellent fit for this
problem domain
20

HBASE CHALLENGES ENCOUNTERED
• HBase is a technology that we’ve decided to back and are
making substantial changes to in the process.
• Our requirements from HBase are the following:
• Read performance - fast with low variability
• High availability
• Operational simplicity
• Efficient use of our good hardware (128G, SSD, 16 cores)
Bloomberg has been investing in all these aspects of HBase

RECOVERY IN HBASE TODAY
22

HA SOLUTIONS CONSIDERED
• Recovery time is now 1-2 mins – unacceptable to us.
• Even if recovery time is optimized down to zero, have to wait
to detect failure.
• Where to read from in the interim?
• Another cluster in another DC?
• Another cluster in the same DC?
• Two tables – primary and a shadow kept in the same HBase
instance?
23

OUR SOLUTION (HBASE-10070)
24

PERFORMANCE, PERFORMANCE,
PERFORMANCE
We’ve been pushing the performance
aspects of HBase:
• Multiple RegionServers
• Off-heap and Multi-level block
cache
• Co-processors for Waterfalling and
Lookback of PORT Scuba requests
• Synchronized Garbage Collection
• Continuously-compacting Garbage
Collectors
25

DISTRIBUTION AND PARALLELISM 26
MACHINE 1 MACHINE 2 MACHINE 3
SALT THE KEY: PERFECT DISTRIBUTION OF DATA
MORE REGION SERVERS = MORE PARALLELISM
GOOD, UP TO A POINT…
Region
Servers per
machine
Total Region
Servers
260MS->160MS
Average time
1 11 260ms
3 33 185ms
5 55 160ms
10 110 170ms

IN PLACE COMPUTATION 27
scuba
SCUBA REQUIRES AVG OF 5 DB ROUND
TRIPS FOR WATERFALLING + LOOKBACK
‘DO IT AT THE DATA’ IN ONE ROUND TRIP…
scuba
1) BarCap
2) Merrill
3) BVAL
4) PXNum: 5755
5) Portfolio Lid:33:7477039:23
RS
Region N
Region N
Region N
Region N
Region N
Region N
Sec1: 1) Barc 2) M 3) BV 4) PX 5) Port
Sec2: 1) Barc 2) M 3) BV 5) Port
Sec3: 2) M 3) BV 4) PX
Sec4: 1) Barc 3) BV 4) PX
CP
160MS->85MS

SYNCHRONIZED DISRUPTION 28
WHY WERE 10 REGION SERVERS SLOWER?
HIGH FANOUT AND THE LAGGARD PROBLEM
SOLUTION: GC AT THE SAME TIME
Region
Servers per
machine
Total Region
Servers
85MS->60MS
Average time
1 11 260ms
3 33 185ms
5 55 160ms
10 110 170ms

PORT: NOT JUST PERFORMANCE
“Custom REDACTED TOP SECRET FEATURE for Portfolio analytics”
Requires inspection,extraction, modification of whole
data set.
Becomes one simple script
29

HADOOP INFRASTRUCTURE
• Chef recipes for cluster install and configuration with a high degree
of repeatability
• Current focus is on developing the Monitoring toolkit
• These clusters will be administered by the Hadoop Infrastructure
team
• Along with the Hadoop stack, the team also aims to offer [ OTHER
COOL TECHNOLOGIES ] as service offerings
30

SUMMARY

QUESTIONS?

Hadoop at Bloomberg:Medium data for the financial industry

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop at Bloomberg:Medium data for the financial industry

Similar to Hadoop at Bloomberg:Medium data for the financial industry (20)

Recently uploaded

Recently uploaded (20)

Hadoop at Bloomberg:Medium data for the financial industry