The Bank of England is the central Bank of the United Kingdom, established in 1694. Representatives from the Bank’s Data Analytics & Modelling team will discuss the Bank of England's journey to delivering a Big Data capability and how the Hortonworks HDP platform is helping us deliver on our mission statement of “promote the good of the people of the United Kingdom by maintaining monetary and financial stability". We will explore the challenges we've faced, how we have overcome some of these and those that remain to be conquered. We will also present our strategy for the Bank’s future Big Data platform as we look to scale up further in the coming years.
We will focus in particular on our first successful ‘Big Data’ production system. This exists in response to the financial crises of 2008 and the subsequent push to make the derivative markets safer by reducing systemic risk. In Europe this was delivered through the European Market Infrastructure Regulation (EMIR). We will explain the Bank of England’s role in monitoring UK entities within this important market and describe the significant challenges facing our team in building a data analytics platform to facilitate this
Speakers
Nick Vaughan, Domain SME - Data Analytics & Modelling
Bank of England
Adrian Waddy, Technical Lead
Bank of England
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Promote the Good of the People of the United Kingdom by Maintaining Monetary and Financial Stability
1. Data Works
Adrian Waddy , Nick Vaughan and Eloise Hindes
The Building Blocks of Big
Data
TheBankofEngland'sjourneytodeliveringa
bigdatacapability
2. Agenda
1. The Bank
Who are we, and what we do
3. Data Warehouse
Initial progress
2. Historic IT
Where are we starting from?
4. Hub 1 and 2
First adventures in Big Data
Any questions5. The Future
Where next? Scaling up
4. The Bank
4
“Arguably we are
now the most
powerful,
unelected
institution our
country has ever
seen. We need to
respond to that
by becoming
more open, more
accountable and
more
transparent.”
Spencer Dale
5. The Bank
5
1694:
‘Promote the publick Good
and Benefit of our
People…’
Current:
‘Promoting the good of the
people of the United
Kingdom by maintaining
monetary and financial
stability’
15. Data Warehouse
15
• Given that:
• A step change in
capability was
realised
• The progress made
could only be
described as a
success
• Why the need for a
change of direction?
17. Data Warehouse
17
• Data is being stored in databases,
shared drives and a document
management solution - difficult to
search, retrieve, combine and analyse
data
• Many individuals are reliant on their
experience and internal network to
determine what data exists
• Analytical communities in the Bank
would like to collaborate more and to
use new tools and techniques that are
becoming standard in highly analytical
data environments
• Not all individuals have access to the
right tools or environment to be able to
run analysis
18. Data Warehouse
18
• The nature of Economic publications
were gradually moving from qualitative
to quantitative through the second half
of the 20th century
• In the 21st century and in particular in
response to the Financial Crisis there
was a marked acceleration in this
process
• The variety of mathematical and
statistical operations increasingly
appearing in Economics publications
need data on which to operate!
http://www.istl.org/12-fall/refereed4.html
19. European Market Infrastructure Regulation (EMIR)
• European Parliament & Council of the EU
• Implementation of G20 commitment
• Risk management regulation
• Avoidance of systemic risk
• Reduce likelihood and severity of future shocks
• Applies to…
• Over-the-counter derivatives (OTC) *
• Central counterparties (CCP)
• Trade Repositories (TR)
19
• What this meant for the Bank of England
• Oversight of OTC & exchange trades
• For UK entities supervised by the PRA
• 85 million transactions from 6 TRs
• 80 files of varying schemas (up to 20gb per file)
• 200+ columns per file
• A new data architecture to collect, store and process!
* $595 trillion market – Bank of International Settlements data end of June 2018
20. Central Banks & Granular Data – 2013
20
• ‘The Future of Regulatory Data and Analytics’
• A new data strategy?
• Micro-prudential data with macro–financial statistics?
• Storing and making use of granular datasets?
• Can heterogeneous data be harmonised?
• Who pays the costs for larger, faster and more accurate data?
• Individual privacy vs public transparency?
• Prudential Regulation Authority
• A new legal subsidiary of the BoE
• Supervisory & regulatory responsibilities
• Promote the safety & soundness of regulated firms
• Contribute to securing protection for policyholders
• A requirement to collect, store and process more data
21. Centre for Central Banking Studies – July 2014
• ‘Big Data and Central Banks’
• Diversification of data sources
• Legalities of enabling / constraining scope of granular data collections
• Development of inductive analytical approaches
• Advancement of data analysis capabilities, ML & AI
• Open Source tooling
• Importance of ‘Big Data’ to Central Banks in the years ahead
21
• Could Big Data…..
• Change the way that central banks operate?
• Transform how financial firms and other economic agents do business?
• Change the economy in ways that impact monetary and financial stability?
• Have implications for economic growth and employment?
https://www.aboveallimages.co.uk/wp-content/gallery/london/london_07.jpg
22. Bank of England Strategic Review – ‘One Mission, One Bank’
22
• ‘One Bank Data Architecture’
• Ability to share data across the Bank
• Reduce data silos
• Reduce the numbers of systems
• Improve discoverability
• Improve analytical capabilities via shared tooling
• Support genuine Big Data use cases
• Strategic data themes
• Management [Governance & Security]
• Collaboration [Sharing of Data]
• Standardisation [More robust processing]
• Exploitation [Tooling for gaining data insight]
24. 24
Landing Zone Raw Zone
DTCC zip
x20
UnaVista zip
X12
CME zip
x8
ICE zip
x9
RegisTR zip
x9
RefinedZone
ConsumeZone
StructuredZone
csvzip unzipFTP
csvzip unzipFTP
csvzip unzipFTP
csvzip unzipFTP
csvzip unzipFTP
Source file format will change, although
change will not affect the ingestion and
unzip processes on the Raw Zone Stores historical data of source files in HDFS in its raw uncompressed format
Description
FTP process to load zip files into Data Hub cluster
Keep existing process that moves zip files, provided by the business, in the Landing Zone, into the Raw Zone.
Unzip process to extract raw data files
Keep existing process that unzips files to its raw format. The unzipped csv file is placed temporarily in a hdfs
directory. An external Hive table is created at this directory allowing the csv file to be queried using Hive or SparkSQL.
At the end of the process, this file is removed.
1
2
1 2
• Standard ETL process within market best practices for loading and
storage of data in its raw format
• N/A
LimitationsBenefits
Low Level Design
Raw Zone
25. Structured ZoneRaw Zone
csv
csv
csv
csv
csv
orc
orcorc
orcorc
orcorc
orcorc
orcorc
3 4
RefinedZone
ConsumeZone
3 Spark jobs that insert each source file into individual structured file table
Direct data ingestion from source file into a ORC Hive table. Each TR file data is ingested into a different structured ORC
table avoiding any mapping at this stage. Having one table per file also adds flexibility to the process, in terms of change
requests (changes are limited to the specific table and mapping rules to mapping table if a file added or an existing is
altered) and reprocessing workflow (only required to run partition of given file until mapping stage, reducing overall
workload).
• Allows easy access to the raw data, without any changes to it’s
underlying structure or format, with efficient compression for
storage and query efficiency
• Having individual tables for each file simplifies mapping process and
diminishes reprocessing workload
• File sizes on tables will be suboptimal,
although mitigated by the simplicity of
the mapping process and flexibility to
schema changes
LimitationsBenefits
4 Spark jobs that map each source file schema to a normalized schema for state information
Simplify mapping process, on both query complexity and performance axioms, by having individual spark mapping jobs to
a normalized state TR schema, both on table structure and on data types.
Table name Storage
format
Partitions Data sorted by Description
**_**_****_****_****** ORC year, month,
day
- One ORC table per TR,
file and version that
stores data in Hive
without columns
mapping
********_***** ORC year, month,
day, filetype
- One ORC table for state
TR data to store mapped
columns in a normalized
schema
25
Converts raw files into ORC and applies data type conversion and mapping rules to store information on a single table
Description
Low Level Design
Structured Zone
26. Structured
Zone
Raw
Zone
Refined ZoneStructured
Zone
orc orc
Landing Zone
zip
EXTRNAL DATA
ConsumeZone
5
6
Extracts external data source’s in
order to enrich and validate TR
data, maintaining historical data
for reprocessing purposes
Table name Storage
format
Partitions Data sorted by Description
****_****_***** ORC year, month,
day
assetclass,
counterparty
Stores TR data enriched
with external data
sources and additional
columns calculated
based on business rules.
These columns include
the de-duplication rule
set.
5 Load external data sources
Process that loads, unzips and inserts external data into Hive tables to use on data preparation step.
orcorc
Raw
Zone
Landing Zone
TRStateDataExternalReferenceData
6 Spark job that applies business rules and enriches source data with external table information
Calculate additional business columns and enrich with external reference data. Apply the de-duplication rule set and
Contract Continuity specifics
Creates materialized views for business consumption that is optimized for system performance
Description
26
• Centralized table that aggregates all TR state information on a single
point of access
• Segregation of concepts by calculating of business logic rules and
enrichment of source data with external sources on a separate layer
• Late arrival of files require a
reprocessing of daily partition
• Changes in business transformation
requirements require reprocessing of
the full table
LimitationsBenefits
Low Level Design
Refined Zone
27. Consume Zone
RawZone
StructuredZone
Refined Zone
orc orc
**_*****_****_*****_***_****
orc
**_*****_****_*****_***_****
Table name Storage
format
Partitions Data sorted by Use cases
**_*****_****_*****_***_**** ORC year, month,
day, assetclass
otc_or_etd, c1, c2 *****
*****
Contractual Continuity
*****
**_*****_****_*****_***_**** ORC year, month,
day
c1, c2 *****
**_*****_****_***** ORC year, month,
assetclass
c1, c2 Monthly time series
7
7 Spark job that creates materialized views physically optimized for standard in-house entry points of analysis
Replicates data in Refined Zone into the Consume Zone, with optimized technical partitions, to allow fast performance
while querying and data exploration based on different use cases of analysis. Process can be easily replicated to
accommodate different use cases by creating new partition keys.
Creates materialized views for business consumption that is optimized for system performance
orc
**_*****_****_*****
Description
• Captures generic entry points of analytical analysis
• Optimized to accommodate different analytical workloads based on
requirements
• Improves query performance due to physical partitioning of data
• Duplication of data and onus of
assessing the correct materialised view
is on the user. This could be mitigated
by including a OLAP cube, such as
Apache Druid
LimitationsBenefits
27
Low Level Design
Consume Zone
28. 28
EMIR Trade Repositories framework
Landing
Zone
Structured ZoneRaw Zone Refined Zone Consume Zone
Data Governance
orc
mappings
orcorc orc
orc
orc
orc
TR DATA
zip csvzip
TRStateData
orcorczip csv
Reference data
30. EMIR Project benefits for the wider Data Programme
Designed to set the right path for the Data Programme in 4 key aspects, aligned with the
One Bank Value:
Set the right technical
architecture to serve as a
standard for BoE Big Data
projects
Provide the drivers for a more
self-service Operating Model
Pair programming sessions for
on-the-job training and
coaching
Deliver a Data Quality and
plausibility Management
solution to be used across the
Data Programme
Architecture Self-service TOMData Science skills Data Plausibility
30
31. Demonstrate Data Science knowledge can be
upskilled
Sr. Data Scientist will deliver on-the-job training and coaching to FMID in order to upskill the existing team. From
this, we expect users to gain autonomy to develop new data analysis and ad hoc data exploration on existing
datasets in Data Hub.
31
1. How will training be delivered to
business areas?
2. What skills should be centralised
and what should stay in each
business team?
3. Upscale current team skillset or
expand resources?
Questions still open:
Training On the job coaching
Provide core skills and
understand how to use Big Data
tools
Pair programing and advisory work to
provide experience using Big Data
tools with R
How can Data Science skills be attained?
34. Data Hub 2
34
• VMWare VxRack HCI offering
• EMC’s Isilon storage
• 392 cores per site, and circa 10 TB
RAM
• 320 TB of “usable” storage
• Storage: The equivalent to 7500
standard iphone Xs (1.32 tonnes of
iphones!!)
• Processing: The equivalent “cores”
as 84 standard iphone Xs
• Memory: The equivalent RAM as
4608 standard iphone Xs (a pile of
phones 35 ½ metres high)