Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak

LEVERAGING GPU-ACCELERATED ANALYTICS ON TOP OF
APACHE SPARK
June 6, 2017
Todd Mostak | Co-founder + CEO, MapD
@toddmostak | @mapd

A compute inflection point
2
Data Growth
CPU Processing Power
40% per year
20% per year

GPUs offer a way forward
3
Data Growth
CPU Processing Power
GPU Processing Power
50% per year
40% per year
20% per year

Ability to Read Data
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
Memory Bandwidth
0
10
20
30
40
50
60
70
80
90
2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
Teraflops
memorybandwidthGB/sec
floatingpointoperations/sec
GPUs outperform CPUs in data critical areas
4
GPU GPU
CPU CPU
Compute PowerCompute Power Ability to Read Data

5
MapD Core MapD Immerse
MapD: software optimized for the fastest hardware
An in-memory, relational, column
store database powered by GPUs
A visual analytics engine that
leverages the speed + rendering
capabilities of MapD Core
+
100x Faster Queries Speed of ThoughtVisualization

Tableau or
3rd party viz
Non-Viz Output
JDBC/Hadoop
6
Where MapD sits
Streaming Data
MapDImmerseMapDCore
GPUAcceleration
JDBC,
ODBC,
Thrift
Kafka

MapD Core
7
The world's fastest in-memory GPU database
powers the world's most immersive data
exploration experience

Data Lake/Data Warehouse/SOR
Performance starts with memory management
8
SSD or NVRAM STORAGE (L3)
250GB to 20TB
1-2 GB/sec
CPU RAM (L2)
32GB to 3TB
70-120 GB/sec
GPU RAM (L1)
24GB to 384GB
3000-5000 GB/sec
Hot Data
Speedup = 1500x to 5000x
Over Cold Data
Warm Data
Speedup = 35x to 120x
Over Cold Data
Cold Data
COMPUTE
LAYER
STORAGE
LAYER
SPEEDINCREASES

Query Compilation with LLVM
Traditional DBs can be highly inefficient
• each operator in SQL treated as a separate function
• incurs tremendous overhead and prevents vectorization
MapD compiles queries w/LLVM to create one custom function
• Queries run at speeds approaching hand-written functions
• LLVM enables generic targeting of different architectures (GPUs, X86,ARM, etc).
• Code can be generated to run query on CPU and GPU simultaneously
10111010101001010110101101010101
00110101101101010101010101011101
9

These innovations drive exceptional speed + scale
10
Noted DB blogger, Mark Litwintschik has
benchmarked MapD vs. major CPU
systems and found it to be between 74x
to 3,500x faster than CPU DBs.

The GPU Open Analytics Initiative (GOAI) and the
GPU Data Frame (GDF)
11

MapD Immerse
Lightning fast visual analytics for the
MapD Core database

Basic charts are frontend
rendered using D3 and other
related toolkits
Scatterplots, pointmaps +
polygons are backend
rendered using the Iris
Rendering Engine on GPUs
Geo-Viz is composited over a
frontend rendered basemap
MapD Immerse: our hybrid approach
13

Server side rendering Data goes from compute (CUDA) to
graphics (OpenGL)
pipeline without copy and comes back
as compressed PNG (~100 KB) rather
than raw data (> 1GB)
Vega Spec (a visualization grammar)
• A declarative JSON format for creating
visualization designs
• Used to describe backend visualizations
• Defines attributes of render primitives
which can be driven
by data columns and mapped
by scales
Shader Compilation Framework
• Templatized: supports multiple types (ints,
floats, colors, etc),
and multiple continuities
(discrete, continuous)
Backend
Query-to-
Render
PNG
SQL+
Vega
Frontend
The X-Factor
14

Supercharging
Spark
Using MapD as a small-footprint, high-power
cache on top of Apache Spark

Delivering significant customer value
across industries
19
Polling smartphones on
demand to assess network
health.
Running complex queries to
analyze wireless QOS data.
EOG uses MapD to query
and visualize well data
Previous queries were non-
interactive
Took hours on Oracle
previously.
Other BI and GIS systems could
not scale to their data sizes

Closing thoughts
20
We are at an inflection point in compute and GPUs are set to become
mainstream in analytics.

Closing thoughts
21
GPUs allow users to scale up before needing to scale out, making GPU-
accelerated analytics an excellent high-performance cache on a larger
data lake or Spark deployment.

Closing thoughts
22
Integrated Analytics on GPUs comprising querying, viz and ML provide
critical efficiencies and capabilities not found in siloed systems.

Cores and memory define the difference
Confidential & Proprietary 24
Traditional CPU Processing vs. GPU Processing
24 cores
(actually it would take a bit more than 24 of these slides to show 40k cores)
40k cores

Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak

Ähnlich wie Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak (20)

Mehr von Databricks

Mehr von Databricks (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak