Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofmann and Shuai Zheng

Modeling Catastrophic
events in Spark
Georg Hofmann, Shuai Zheng
Validus Research

What is Reinsurance?
• Reinsurance companies provide insurance to
insurance companies.
• The most common source of risk covered by
reinsurance are natural catastrophes.
• Who insures the Reinsurance companies?
– In fact, Reinsurance companies cover other
reinsurance companies. This is called
retrocession insurance.

What is it really all about?
Take a primary insurer
Who covers a portfolio
of residential homes:

An average year
Premium
collected
upfront
Everything
within
budget

A severe year
It’s all about uncertainty
and correlationof losses.

A reinsurance contract
• The primary insurer may buy protection from a
reinsurer.
• The reinsurer will cover up to 12 million of loss,
once the insurer has paid for first 10 million.
• Not just one single reinsurance company will
take on the “excess” risk, but several will share
the risk between themselves.

Catastrophe (Cat) Models
Section 2

Cat Model data
Exposure data
• Single Set: 1 – 500 MB
• Buildings with location
– Occupancy type
– Construction type and age
– More attrinbutes
• Policy information:
– Deductibles, limits
– Other financial terms
Model data
• Single Model: 300 GB – 2 TB
• Stochastic catalog: 50k
hurricanes
• Footprint for each hurricane.
• Vulnerability curves.

Cat Model analytics
• Evaluate
– deductibles and limits
– more complex financial policy terms
– reinsurance contracts.
• Create various custom risk metrics.
• Allow interactive analytics on intermediate data
sets.

Blue sky version
Policy 1
Builing 1 Building 2 building 3 building 4
Hurricane 1
Hurricane 2
Hurricane 3
Policy 2
Sparse cross-product (10 TB +)
Hazard
data
Exposure data
Analytics

Obvious parallelism
Policy 1
Core 1
Core 2
Core 3
Exposure data
Policy 2
Hurricane 1
Hurricane 2
Hurricane 3

Challenging metrics (across cores)
Policy 1
Core 1
Core 2
Core 3
Exposure data
Policy 2
Hurricane 1
Hurricane 2
Hurricane 3
Metric: Loss of top 100 hurricanes for Policy 1

From MapReduce to Spark
Section 3

Requirement Overview
• Data for each Cat Model: 100 GB – 4 TB
• Data for each exposure set: 10 MB – 500 MB
• Intermediary results for analysis: 10 GB – 100 TB
• Daily Analyses: 70 – 100 initially, growth expected
• Occasional burst batch Analysis: 1500 – 2000 (short timelines)
• Generally, analyses required to finish in less than 30 mins
• Cost: On average, less than $5 per analysis
• Support complex algorithm: 100 pages of specs provided by
the research team

Work With MapReduce
• Overview
– MapReduce on EMR Cluster
– Cost per analysis: $0.5 – $15
– Time per analysis: 5 – 30 mins (cluster with 5 * r3*8xlarge )
• Issues
– Expensive
– Resources Cap/Limitation on EMR Hadoop (cap at 16GB memory for each
mapper/reducer)
– Fragmental Workflow: expensive input preparation and post processing required
– Inefficient at caching and sharing resources
– Inconsistent on/off cluster running mode

Why Spark?
• Shared memory/cache for all executors on the same instance
– Multiple processes (MapReduce) vs Multiple Threads (Spark)
– Multiple cache copies in memory (MapReduce) vs single cache shared between
threads (Spark)
• Total memory 244 GB = 32 instance * 7.6 GB memory (per mapper/reducer slot)
• Each mapper can use 7.6 GB – 4 GB = 3.6 GB as working memory to process business logic
• For Spark,each thread has (244GB – 4GB) /32 = 7.5 GB
– Less memory required on hardware, i.e. lower cost
• 30% – 50 % Faster
• 10 – 30 times bigger throughput with the same cost
– Shared memory allows executing multiple analyses at once
– Hardware with less memory requirement (c3*8xlarge vs r3*8xlarge)
– Save 90% cost

Better Design
• High Code Quality
– Consistent codebase for on/off cloud execution
– Unit tests are easier to integrate
– Easier to implement multi-step complex business workflow
• More Options for Architecture Design
– Easier to integrate with downstream process
– Flexible input format

Conclusion
• Cat models don’t start with Big Data, but create
Big Data along the way.
• They require complex custom analytics on this
data.
• Spark is well suited to address the challenges of
businesses that run Cat Models on a large scale.

Thank You.
georg.hofmann@validusresearch.com
shuai.zheng@validusresearch.com

Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofmann and Shuai Zheng

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofmann and Shuai Zheng

Similar to Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofmann and Shuai Zheng (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofmann and Shuai Zheng