This document discusses using Spark to model catastrophic events more efficiently than MapReduce. It describes how catastrophe (cat) models involve large datasets but generate even larger intermediate datasets requiring complex analytics. Spark is better suited than MapReduce for this work due to its ability to share memory and resources across processes, providing faster performance at lower costs. The document advocates designing cat model workflows in Spark to take advantage of its flexible architecture and high code quality.
3. What is Reinsurance?
• Reinsurance companies provide insurance to
insurance companies.
• The most common source of risk covered by
reinsurance are natural catastrophes.
• Who insures the Reinsurance companies?
– In fact, Reinsurance companies cover other
reinsurance companies. This is called
retrocession insurance.
4. What is it really all about?
Take a primary insurer
Who covers a portfolio
of residential homes:
7. A reinsurance contract
• The primary insurer may buy protection from a
reinsurer.
• The reinsurer will cover up to 12 million of loss,
once the insurer has paid for first 10 million.
• Not just one single reinsurance company will
take on the “excess” risk, but several will share
the risk between themselves.
9. Cat Model data
Exposure data
• Single Set: 1 – 500 MB
• Buildings with location
– Occupancy type
– Construction type and age
– More attrinbutes
• Policy information:
– Deductibles, limits
– Other financial terms
Model data
• Single Model: 300 GB – 2 TB
• Stochastic catalog: 50k
hurricanes
• Footprint for each hurricane.
• Vulnerability curves.
10. Cat Model analytics
• Evaluate
– deductibles and limits
– more complex financial policy terms
– reinsurance contracts.
• Create various custom risk metrics.
• Allow interactive analytics on intermediate data
sets.
11. Blue sky version
Policy 1
Builing 1 Building 2 building 3 building 4
Hurricane 1
Hurricane 2
Hurricane 3
Policy 2
Sparse cross-product (10 TB +)
Hazard
data
Exposure data
Analytics
13. Challenging metrics (across cores)
Policy 1
Builing 1 Building 2 building 3 building 4
Core 1
Core 2
Core 3
Exposure data
Policy 2
Hurricane 1
Hurricane 2
Hurricane 3
Metric: Loss of top 100 hurricanes for Policy 1
15. Requirement Overview
• Data for each Cat Model: 100 GB – 4 TB
• Data for each exposure set: 10 MB – 500 MB
• Intermediary results for analysis: 10 GB – 100 TB
• Daily Analyses: 70 – 100 initially, growth expected
• Occasional burst batch Analysis: 1500 – 2000 (short timelines)
• Generally, analyses required to finish in less than 30 mins
• Cost: On average, less than $5 per analysis
• Support complex algorithm: 100 pages of specs provided by
the research team
16. Work With MapReduce
• Overview
– MapReduce on EMR Cluster
– Cost per analysis: $0.5 – $15
– Time per analysis: 5 – 30 mins (cluster with 5 * r3*8xlarge )
• Issues
– Expensive
– Resources Cap/Limitation on EMR Hadoop (cap at 16GB memory for each
mapper/reducer)
– Fragmental Workflow: expensive input preparation and post processing required
– Inefficient at caching and sharing resources
– Inconsistent on/off cluster running mode
17. Why Spark?
• Shared memory/cache for all executors on the same instance
– Multiple processes (MapReduce) vs Multiple Threads (Spark)
– Multiple cache copies in memory (MapReduce) vs single cache shared between
threads (Spark)
• Total memory 244 GB = 32 instance * 7.6 GB memory (per mapper/reducer slot)
• Each mapper can use 7.6 GB – 4 GB = 3.6 GB as working memory to process business logic
• For Spark,each thread has (244GB – 4GB) /32 = 7.5 GB
– Less memory required on hardware, i.e. lower cost
• 30% – 50 % Faster
• 10 – 30 times bigger throughput with the same cost
– Shared memory allows executing multiple analyses at once
– Hardware with less memory requirement (c3*8xlarge vs r3*8xlarge)
– Save 90% cost
18. Better Design
• High Code Quality
– Consistent codebase for on/off cloud execution
– Unit tests are easier to integrate
– Easier to implement multi-step complex business workflow
• More Options for Architecture Design
– Easier to integrate with downstream process
– Flexible input format
19. Conclusion
• Cat models don’t start with Big Data, but create
Big Data along the way.
• They require complex custom analytics on this
data.
• Spark is well suited to address the challenges of
businesses that run Cat Models on a large scale.