There has been growing interest in harnessing the parallelism of Graphics Processing Units (GPUs) to accelerate analytics workloads. GPUs have become the standard platform for many machine learning algorithms, particularly in the field of deep neural networks (DNNs), while making increasing inroads into more traditional domains such as analytics databases and visual analytics. However there is a strong need to couple these new platforms with Apache Spark, which has emerged as the de facto analytics platform for data scientists. In this talk we discuss how we built a connector from Spark to the open source GPU-powered MapD Analytics Platform, and the use cases such a connector enables around being able to pull high value data from Spark and cache it on the GPU for subsequent interactive visual analysis and machine learning. We will conclude with a brief demo of an end-to-end Spark-to-MapD pipeline.
2. A compute inflection point
2
Data Growth
CPU Processing Power
40% per year
20% per year
3. GPUs offer a way forward
3
Data Growth
CPU Processing Power
GPU Processing Power
50% per year
40% per year
20% per year
4. Ability to Read Data
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
Memory Bandwidth
0
10
20
30
40
50
60
70
80
90
2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
Teraflops
memorybandwidthGB/sec
floatingpointoperations/sec
GPUs outperform CPUs in data critical areas
4
GPU GPU
CPU CPU
Compute PowerCompute Power Ability to Read Data
5. 5
MapD Core MapD Immerse
MapD: software optimized for the fastest hardware
An in-memory, relational, column
store database powered by GPUs
A visual analytics engine that
leverages the speed + rendering
capabilities of MapD Core
+
100x Faster Queries Speed of ThoughtVisualization
6. Tableau or
3rd party viz
Non-Viz Output
JDBC/Hadoop
6
Where MapD sits
Streaming Data
MapDImmerseMapDCore
GPUAcceleration
JDBC,
ODBC,
Thrift
Kafka
7. MapD Core
7
The world's fastest in-memory GPU database
powers the world's most immersive data
exploration experience
8. Data Lake/Data Warehouse/SOR
Performance starts with memory management
8
SSD or NVRAM STORAGE (L3)
250GB to 20TB
1-2 GB/sec
CPU RAM (L2)
32GB to 3TB
70-120 GB/sec
GPU RAM (L1)
24GB to 384GB
3000-5000 GB/sec
Hot Data
Speedup = 1500x to 5000x
Over Cold Data
Warm Data
Speedup = 35x to 120x
Over Cold Data
Cold Data
COMPUTE
LAYER
STORAGE
LAYER
SPEEDINCREASES
9. Query Compilation with LLVM
Traditional DBs can be highly inefficient
• each operator in SQL treated as a separate function
• incurs tremendous overhead and prevents vectorization
MapD compiles queries w/LLVM to create one custom function
• Queries run at speeds approaching hand-written functions
• LLVM enables generic targeting of different architectures (GPUs, X86,ARM, etc).
• Code can be generated to run query on CPU and GPU simultaneously
10111010101001010110101101010101
00110101101101010101010101011101
9
10. These innovations drive exceptional speed + scale
10
Noted DB blogger, Mark Litwintschik has
benchmarked MapD vs. major CPU
systems and found it to be between 74x
to 3,500x faster than CPU DBs.
11. The GPU Open Analytics Initiative (GOAI) and the
GPU Data Frame (GDF)
11
13. Basic charts are frontend
rendered using D3 and other
related toolkits
Scatterplots, pointmaps +
polygons are backend
rendered using the Iris
Rendering Engine on GPUs
Geo-Viz is composited over a
frontend rendered basemap
MapD Immerse: our hybrid approach
13
14. Server side rendering Data goes from compute (CUDA) to
graphics (OpenGL)
pipeline without copy and comes back
as compressed PNG (~100 KB) rather
than raw data (> 1GB)
Vega Spec (a visualization grammar)
• A declarative JSON format for creating
visualization designs
• Used to describe backend visualizations
• Defines attributes of render primitives
which can be driven
by data columns and mapped
by scales
Shader Compilation Framework
• Templatized: supports multiple types (ints,
floats, colors, etc),
and multiple continuities
(discrete, continuous)
Backend
Query-to-
Render
PNG
SQL+
Vega
Frontend
The X-Factor
14
19. Delivering significant customer value
across industries
19
Polling smartphones on
demand to assess network
health.
Running complex queries to
analyze wireless QOS data.
EOG uses MapD to query
and visualize well data
Previous queries were non-
interactive
Took hours on Oracle
previously.
Other BI and GIS systems could
not scale to their data sizes
20. Closing thoughts
20
We are at an inflection point in compute and GPUs are set to become
mainstream in analytics.
21. Closing thoughts
21
GPUs allow users to scale up before needing to scale out, making GPU-
accelerated analytics an excellent high-performance cache on a larger
data lake or Spark deployment.
24. Cores and memory define the difference
Confidential & Proprietary 24
Traditional CPU Processing vs. GPU Processing
24 cores
(actually it would take a bit more than 24 of these slides to show 40k cores)
40k cores