2. The World's Leading Catastrophe Risk
Modeling Company
From earthquakes, hurricanes, and floods to terrorism and
infectious diseases, RMS helps financial institutions and
public agencies understand, quantify, and manage risk
3. 3
So what do we actually do?
● Models
○ We have complex models for various types of risk
■ Fire, flood, earthquakes, etc
○ Our customers run our models against their portfolios of risk items (e.g.
properties) to understand financial impact
○ The models produce a lot of data
● Interactive Queries
○ Insurance analysts are similar to data scientists
○ Lots of result data to slice and dice and visualize
○ Low latency analytics on relatively large datasets
■ Too much for a SQL database but not PB scale
5. 5
RMS Datastore Stack
Intelligent query parsing, rewriting
and routing.
Cost-based optimizations.
Ability to use different query
engines depending on use case or
size of data set.
6. 6
Query Service 1.0
● Native Query Execution
○ Scala code, using Apache Arrow and Parquet libraries
○ Column-based file readers with projection push-down
○ Row-based query execution
○ Apache Arrow for the type system
● Performance
○ Order of magnitude improvements compared to Spark for some use cases
○ Slower than Spark for other use cases (larger data sets, JOINs, etc)
● SQL Interface
○ Apache Hive for our internal SQL dialect
○ Apache Hive protocol for compatibility with ODBC/JDBC drivers
○ REST API for integration with microservices
7. 7
Query Service Conclusions & Next Steps
● The Query Service was successful
○ Reduced TCO (fewer Spark nodes required)
○ Improved performance for interactive queries
● In my spare time I had been working on an open source project called
DataFusion
○ DataFusion started out as a generic Rust query engine
○ I felt that Rust was much better suited than JVM
○ I learned a lot more about Apache Arrow and the benefits of columnar
processing
● So how could we leverage this at RMS?
○ I donated the initial Rust implementation of Apache Arrow and later donated
DataFusion as well
11. 11
Apache Arrow
● Standardized language-independent columnar memory format
○ for flat and hierarchical data
○ organized for efficient analytic operations on modern hardware
■ Vectorized processing, SIMD, GPU
● Implementations available for many programming languages
○ C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.
● Zero-copy interprocess communication
○ IPC metadata defined in flatbuffer format
12. 12
Apache Arrow
● Computational libraries
○ C++ libraries that leverage LLVM (donated by Dremio)
○ NVIDIA CUDA support
● Query Engines
○ Ursa Labs initiative
■ C++ query engine
○ DataFusion
■ Rust query engine
13. 13
Apache Arrow
● 3 years as a top level project
● Project Management Committee (PMC) members work for ...
○ Cloudera, Databricks, DataStax, Dremio, Hortonworks, Looker, MapR, RMS,
RStudio, Salesforce, Twitter, UC Berkeley RISELab, Ursa Labs, WeWork,
Workday
● Committers work for ...
○ Amazon, CERN, Google, IBM
● Also many individual contributors
● Companies providing financial support (via Ursa Labs)
○ nVIDIA, ODSC, RStudio, Two Sigma
17. 17
Why Rust
● See https://www.rust-lang.org/ for detailed information
● My take
○ Speed of C++ with the safety of Java
○ Memory efficient (no GC)
○ Predictable performance
○ Lower TCO
○ Forces you to think about what you are doing
■ Thread safety has to be explicit
■ Memory management has to be explicit
○ The compiler acts as a peer reviewer … tough but fair
23. 23
Benchmark ResultsEC2 c5.18xlarge instance
72 vCPUs
144 GB
SSD (100 IOPS / 3000 burst)
Data set:
5MM risk items
Wide table (~600 columns)
~16 GB on disk
(higher is better)
24. 24
DataFusion Roadmap
● DataFrame-style API for building logical query plans, as alternative to SQL
● Parallel Query Execution (threads, partitions)
● Support for more data sources (Parquet, JSON)
● More complete SQL support (joins, subqueries, columnar UDFs)
● Distributed Execution
○ Distributed query planner & optimizer
○ Kubernetes & Docker deployment model
○ Apache Flight protocol for streaming data between nodes
Apache Arrow is a “do-ocracy” where the individual contributors get to decide the
roadmap, but here are some things that I am planning on working on
25. 25
Want to contribute?
● Great time to get involved!
○ The code base is still relatively small
■ Core Arrow library is 6k LOC
■ DataFusion is 4k LOC
○ Small number of regular contributors
○ Where to start?
■ https://cwiki.apache.org/confluence/display/ARROW/Rust+JIRA+Dashboard
○ Try adding DataFusion as a crate dependency