SlideShare ist ein Scribd-Unternehmen logo
1 von 56
Downloaden Sie, um offline zu lesen
GPU-ACCELERATING UDFS IN
PYSPARK WITH NUMBA AND PYGDF
Joshua Patterson @datametrician
Keith Kraus @keithjkraus
2
THE DATA
STRUGGLE IS
REAL…
3
DATA DELUGE TO INSIGHT HUNGRY
INCREASING DATA VARIETY
Search
Marketing
Behavioral
Targeting
Dynamic
Funnels
User
Generated
Content
Mobile Web
SMS/MMS
Sentiment
HD Video
Speech To
Text
Product/
Service Logs
Social
Network
Business
Data Feeds
User Click
Stream
Sensors Infotainment
Systems
Wearable
Devices
Cyber
Security Logs
Connected
Vehicles
Machine
Data
IoT Data
Dynamic
Pricing
Payment
Record
Purchase
Detail
Purchase
Record
Support
Contacts
Segmentation
Offer
Details
Web
Logs
Offer
History
A/B
Testing
BUSINESS
PROCESS
PETABYTESTERABYTESGIGABYTESEXABYTESZETTABYTES
Streaming
Video
Natural
Language
Processing
WEB
DIGITAL
AI
4
DATA FORMATS
Avro
XML
JSON
GML
ProtoBuf
HDFS
Pickle
CSV
Parquet
Pandas
Plain Text vs Binary
Compressed vs Uncompressed
CSR
COO
CSC
* Not a complete list
Numpy
5
DATA PROCESSING EVOLUTION
Faster Data Access Less Data Movement
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
Hadoop Processing, Reading from disk
6
DATA PROCESSING EVOLUTION
Faster Data Access Less Data Movement
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read
Query ETL ML Train
Hadoop Processing, Reading from disk
25-100x
Improvement
Less code
Language flexible
Primarily In-Memory
Spark In-Memory Processing
7
Cluster computing framework
Spark has almost become synonymous with Hadoop and Big Data•
Integrates with nearly the entire Big Data ecosystem•
The processing layer for big data and leading ML framework•
Five main components RDD API, SQL, Streaming,• MLlib, and GraphX
APACHE SPARK
8
SPARK IS NOT ENOUGH
Basic workloads are bottlenecked by the CPU
Source: Mark Litwintschik’s blog: 1.1 Billion Taxi Rides: EC2 versus EMR
In a simple benchmark consisting•
of aggregating data, the CPU is
the bottleneck
This is after the data is parsed and•
cached into memory which is
another common bottleneck
The CPU bottleneck is even worse•
in more complex workloads!
SELECT cab_type, count(*) FROM
trips_orc GROUP BY cab_type;
9
SPARK ECOSYSTEM
Lacks Full GPU Integration
4 Core Parts• : SQL, Streaming (Spark functions micro batched), Machine Learning, & Graph
Spark is currently optimizing its existing code base, adding more usability, not GPU support yet•
10
SPARK ECOSYSTEM
Using• Numba, Microsoft Azure team released a
basic example showing a ~5x speedup using
GPUs with Spark
This example is extremely limited in that•
they’re not passing any real data to the Python
process or the GPU
When wanting to pass data from Spark to the•
GPU there are new issues and performance
considerations
GPU-Acceleration Possible But Not Ideal
Source: https://github.com/Azure/aztk/blob/master/node_scripts/jupyter-
samples/GPU%2Bvs%2BCPU%2Busing%2BNumba.ipynb
11
GPUS FTW!
12
GPUS ARE FAST
1.1 Billion Taxi Ride Benchmark
21 30
1560
80 99
1250
150
269
2250
372
696
2970
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
MapD DGX-1 MapD 4 x P100 Redshift 6-node Spark 11-node
Query 1 Query 2 Query 3 Query 4
TimeinMilliseconds
Source: MapD Benchmarks on DGX from internal NVIDIA testing following guidelines of
Mark Litwintschik’s blogs: Redshift, 6-node ds2.8xlarge cluster & Spark 2.1, 11 x m3.xlarge cluster w/ HDFS @marklit82
10190 8134 19624 85942
13
GPUS ARE FAST
K-Means Benchmark
10 with latest solver
14
25-100x Improvement
Less code
Language flexible
Primarily In-Memory
DATA PROCESSING EVOLUTION
Faster Data Access Less Data Movement
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read Query ETL ML Train
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
5-10x Improvement
More code
Language rigid
Substantially on GPU
GPU/Spark In-Memory Processing
Hadoop Processing, Reading from disk
Spark In-Memory Processing
15
GPU ACCELERATED TECHNOLOGIES
GRAPH
PROCESSING
ANALYTICS
GPU DATABASES
16
APP A
GPU-ACCELERATED ARCHITECTURE THEN
Too much data movement and too many different data formats
CPU GPU
APP B
Read DataH2O.ai Graphistry
Copy & Convert
Copy & Convert
Copy & Convert
Load Data
APP A GPU
Data
APP B
GPU
Data
BlazingDB MapDSimantex
Anaconda GunrocknvGRAPH
17
APP A
GPU-ACCELERATED ARCHITECTURE THEN
Too much data movement and too many different data formats
CPU GPU
APP B
BlazingDB MapD
Copy & Convert
Copy & Convert
Copy & Convert
Load Data
APP A GPU
Data
APP B
GPU
Data
Simantex
Read DataH2O.ai Graphistry
Anaconda GunrocknvGRAPH
18
APACHE ARROW COMMON DATA LAYER
From Apache Arrow Home Page - https://arrow.apache.org/
19
GPU-ACCELERATED ARCHITECTURE NOW
Single data format and shared access to data on GPU
CPU GPU
GPU
MEM
Read Data
BlazingDB MapD Load Data
Apache Arrow
Powered by:
GPU Data Frame
Simantex
H2O.ai Graphistry
Anaconda GunrocknvGRAPH
20
25-100x Improvement
Less code
Language flexible
Primarily In-Memory
GPU DATA FRAME
Faster Data Access Less Data Movement
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read Query ETL ML Train
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
Arrow
Read
Query ETL
ML
Train
5-10x Improvement
More code
Language rigid
Substantially on GPU
25-100x Improvement
Same code
Language flexible
Primarily on GPU
End to End GPU Processing (GOAI)
GPU/Spark In-Memory Processing
Hadoop Processing, Reading from disk
Spark In-Memory Processing
21
GPU OPEN ANALYTICS INITIATIVE
First Project, the GPU Data Frame
No Copy & Converts - Full
Interoperability
H2O.ai
Numba Gunrock
Graphistry
BlazingDB MapD
GPU Data
Frame
GPU Data Frame is the first project of GOAI•
Apache Arrow for GPU•
libgdf• : A C library of helper functions, including:
Copying the GDF metadata block to the host and parsing it•
to a host-side struct.
Importing/exporting a GDF using the CUDA IPC mechanism.•
CUDA kernels to perform element• -wise math operations on
GDF columns.
CUDA sort, join, and reduction operations on GDFs.•
pygdf• : A Python library for manipulating GDFs
Python interface to• libgdf library with additional
functionality
Creating GDFs from• Numpy arrays and Pandas DataFrames
JIT compilation of group by and filter kernels using• Numba
dask_gdf• : Extension for Dask to work with distributed GDFs.
Same operations as• pygdf, but working on GDFs chunked
onto different GPUs and different servers.
Will bring the same Kubernetes support that• Dask already
has.
github.com/gpuopenanalytics
nvGRAPH
Apache Arrow
Powered by:
Simantex
22
GOAI ECOSYSTEM
GRAPH
PROCESSING
ANALYTICS
GPU DATABASES
Apache Arrow
Powered by:
23
GPU ACCELERATION ACROSS THE ECOSYSTEM
Apache Arrow
H2O.ai
Numba Gunrock
Graphistry
BlazingDB MapD
GPU Data
Frame
nvGRAPH
Apache Arrow
Powered by:
Simantex
24
25-100x Improvement
Less code
Language flexible
Primarily In-Memory
DATA PROCESSING EVOLUTION
Faster Data Access Less Data Movement
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
HDFS
Read
Query ETL ML Train
HDFS
Read Query ETL ML Train
HDFS
Read
GPU
Read
Query
CPU
Write
GPU
Read
ETL
CPU
Write
GPU
Read
ML
Train
Arrow
Read
Query ETL
ML
Train
5-10x Improvement
More code
Language rigid
Substantially on GPU
25-100x Improvement
Same code
Language flexible
Primarily on GPU
End to End GPU Processing (GoAi)
GPU/Spark In-Memory Processing
Hadoop Processing, Reading from disk
Spark In-Memory Processing
25
PYTHON GPU
DATAFRAME
26
PYGDF @gpuoai
Python GPU DataFrame library
27
PYGDF @gpuoai
Pandas ↔ PyGDF
28
PYGDF @gpuoai
Built-In Functions
29
APACHE SPARK
30
Cluster computing framework
APACHE SPARK
JVM
Local Cluster
Local Code
Spark Context
JVM
JVM
31
PYSPARK
Python API for Spark
32
PYSPARK
No cluster execution in Python if using Spark built-ins
JVM
Local Cluster
Local Code
Spark Context
JVM
JVM
33
PYSPARK UDFS
When Spark built-ins can’t get the job done alone
User defined functions (UDFs)•
allow for creating column-based
functions outside of the scope of
Spark built-in functions
UDFs can be defined in Scala/Java•
or Python and be called from
PySpark
Using Python lambdas in map•
functions is essentially the same
as using a Python UDF
34
PYSPARK PYTHON UDFS
Python UDFs in PySpark need Python workers and data movement
JVM
Local Cluster
Local Code
Spark Context
JVM
JVM
35
PYSPARK PYTHON UDFS
Moving data from the JVM to Python efficiently is hard
JVM
Local Cluster
Local Code
Spark Context
JVM
JVM
36
PYSPARK PYTHON UDFS
How is the data movement implemented?
Rows of data are pickled•
and sent from the
executor JVM process to
Python worker processes
This bottlenecks the•
data pipeline, but how
badly?
Many people avoid this•
problem by defining
their UDFs in Scala/Java
and calling them from
PySpark
JVM
Executor Python Workers
Rows (Pickle)
Rows (Pickle)
37
PYSPARK PYTHON UDFS
Performance analysis of a basic UDF
Source: Julien LeDem, Li Jin: Improving Python and Spark Performance and Interoperability with Apache Arrow
Almost all of the time is•
spent serializing and
deserializing data as
opposed to the actual
calculations!
We can’t actually feed•
the GPU fast enough to
take advantage of the
performance benefits!
lambda x: x + 1
38
PYSPARK 2.3
First release with Apache Arrow compatibility!
Apache
Arrow
spark.sql.execution.arrow.enabled à true
39
PYSPARK 2.3 PANDAS
Optimized Spark Data Frame ↔ Pandas Data Frame
df.toPandas()
createDataFrame(pdf)
40
PYSPARK 2.3 PANDAS UDFS
Vectorized user defined functions using Pandas
Scalar Pandas UDFs Grouped Map Pandas UDFs
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)@pandas_udf(‘double’, PandasUDFType.SCALAR)
Pandas.Series• in, Pandas.Series out
Input and output Series must be the same length•
Output Series must be of the type defined in the•
decorator
Pandas.DataFrame• in, Pandas.DataFrame out
Output• DataFrame can be any length
Output• DataFrame schema defined via a Spark
SQL DataFrame schema
41
PYSPARK 2.3 PANDAS UDFS
PySpark data movement performance issues resolved
JVM
Executor Python Workers
Columnar
Record Batch
Columnar
Record Batch
Data is converted from•
rows to Apache Arrow
columnar record batches
within the executor JVM
processes
Data does• not have to
be serialized or
deserialized!
Apache
Arrow
42
PYSPARK 2.3 PANDAS UDFS
No more serialization and deserialization overhead!
Source: Julien LeDem, Li Jin: Improving Python and Spark Performance and Interoperability with Apache Arrow
With the data movement•
performance issues resolved,
the bottleneck for many
UDFs gets pushed back to
the compute
We can utilize GPUs to help•
in this respect!
lambda x: x + 1
43
APACHE SPARK
WITH PYGDF
44
PANDAS UDFS WITH GPUS
Pandas ↔ PyGDF makes this easy!
Scalar Pandas UDFs Grouped Map Pandas UDFs
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)@pandas_udf(‘double’, PandasUDFType.SCALAR)
Pandas.Series PyGDF.Series Pandas.DataFrame PyGDF.DataFrame
45
PANDAS UDFS WITH GPUS
What about for more advanced operations?
Many UDFs are created because the function•
can’t be easily created using Spark primitives
Probably can’t be created with• PyGDF
primitives either
Writing low level code and tying it into your•
UDF is a non-starter
46
PANDAS UDFS WITH GPUS
Numba to the rescue!
Luckily,• PyGDF has convenience functions for
Numba to JIT compile CUDA kernels for
optimized execution on the GPU
DataFrame.apply_rows• ()
Series.applymap• ()
UDFs within UDFS!•
47
PANDAS UDFS WITH GPUS
Numba GPU-Accelerated PyGDF UDFs in Pandas UDFs
Scalar Pandas UDFs Grouped Map Pandas UDFs
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)@pandas_udf(‘double’, PandasUDFType.SCALAR)
Pandas.Series PyGDF.Series Pandas.DataFrame PyGDF.DataFrame
48
LESSONS LEARNED
GPU-Accelerated UDFs as hard to do right
Data needs to be large enough to utilize the GPU•
effectively, but not too large to exhaust GPU memory
(1e6 – 9e9)
The work done on the GPU needs to be substantial•
enough to prevent data transfer from dominating
execution time
I.E.• Group by a timestamp and run a Grouped
Map Pandas UDF of GPU-accelerated pagerank
per group
PyGDF• depends on Arrow 0.7.1 for now while PySpark
uses Arrow 0.8+, WIP to update dependency
https• ://github.com/kkraus14/libgdf/tree/temp_r
emove_ipc_arrow for temporary workaround
49
FUTURE
50
PYGDF AND LIBGDF
Optimized join performance•
GDF Graph Analytics Library•
Support for multiple•
interconnected GPUs in LibGDF
and PyGDF (same PCIe root or
NVLink)
General• performance
improvements across the board
TIME (MS) SF1 SF10 SF100
CPU (single-threaded) 1329 31731 465064
V100 (PCIe3) 22 164 1521
V100 (3xNVLINK2) 12 45 466
3.2x
300x
TPCH Query 21 – End to End Results Using 32-bit Keys*
TIME (MS) SF1 SF10 SF100
CPU (single-threaded) 150 2041 24960
V100 (PCIe3) 13 105 946
V100 (3xNVLINK2) 7 23 308
3.1x
26x
TPCH Query 4 – End to End Results Using 32-bit Keys*
51
NUMBA AND CUPY
Standard Python GPU N-Dimensional Array
Numba• and CuPy are unifying their GPU backends
to share an n-dimensional array implementation
Hoping to get additional Python libraries like•
PyCUDA, PyTorch, etc. to unify as well in the future
PyCUDA
52
DASK.GDF AND DASK.CUPY
Scale out in addition to scaling up
Use• Dask as the scale out method for distributed
GPU data structures
Extend• Dask’s Kubernetes integration as needed to
support the full extent of GPU integration
Dask.GDF• is in the very early stages of development
https://github.com/gpuopenanalytics/dask_gdf
Dask.CuPy• has not started yet, but if interested
we’re hiring!
53
SPARK 2.3+ WISHES
More Arrow-based Pandas UDF types
Partition Pandas UDFs
@pandas_udf(schema, PandasUDFType.PARTITION)
Pandas.DataFrame• in, Pandas.DataFrame out
Output• DataFrame can be any length
Output• DataFrame schema defined via a Spark
SQL DataFrame schema
54
SPARK 2.3+ WISHES
Arrow as the primary data format for Spark DataFrame
Currently Spark can take advantage of columnar•
file formats and columnar data connections by
loading the necessary columns and pushing down
predicates
Most typical operations benefit from columnar data•
structure
Using Arrow will allow for optimized compute•
kernels and reduce the JVM dependency in the
future
Eventually native GPU acceleration•
Executor
55
JOIN THE REVOLUTION
Everyone Can Help!
Integrations, feedback, documentation support, pull requests, new issues, or donations welcomed!
APACHE ARROW GPU Open Analytics
Initiative
https://arrow.apache.org/
@ApacheArrow
http://gpuopenanalytics.com/
@Gpuoai
Joshua Patterson @datametrician
Keith Kraus @keithjkraus
QUESTIONS?

Weitere ähnliche Inhalte

Was ist angesagt?

Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL Administration
Command Prompt., Inc
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Yael Garten
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Was ist angesagt? (20)

Ingesting streaming data into Graph Database
Ingesting streaming data into Graph DatabaseIngesting streaming data into Graph Database
Ingesting streaming data into Graph Database
 
Postgresql database administration volume 1
Postgresql database administration volume 1Postgresql database administration volume 1
Postgresql database administration volume 1
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL Administration
 
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysParallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected Ways
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Postgresql Database Administration- Day3
Postgresql Database Administration- Day3Postgresql Database Administration- Day3
Postgresql Database Administration- Day3
 
Real Time Test Data with Grafana
Real Time Test Data with GrafanaReal Time Test Data with Grafana
Real Time Test Data with Grafana
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
 
Ingesting data at scale into elasticsearch with apache pulsar
Ingesting data at scale into elasticsearch with apache pulsarIngesting data at scale into elasticsearch with apache pulsar
Ingesting data at scale into elasticsearch with apache pulsar
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
 
How Graphs are Changing AI
How Graphs are Changing AIHow Graphs are Changing AI
How Graphs are Changing AI
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Disaster Recovery Synapse
Disaster Recovery SynapseDisaster Recovery Synapse
Disaster Recovery Synapse
 

Ähnlich wie GPU-Accelerating UDFs in PySpark with Numba and PyGDF

Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
inside-BigData.com
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
Connected Data World
 

Ähnlich wie GPU-Accelerating UDFs in PySpark with Numba and PyGDF (20)

Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentation
 
RAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature EngineeringRAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature Engineering
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data Science
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
 
RAPIDS Overview
RAPIDS OverviewRAPIDS Overview
RAPIDS Overview
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
 
GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017
 
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
 
BlazingSQL & Graphistry - Netflow Demo
BlazingSQL & Graphistry - Netflow DemoBlazingSQL & Graphistry - Netflow Demo
BlazingSQL & Graphistry - Netflow Demo
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
 
20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL
 
End to End Machine Learning Open Source Solution Presented in Cisco Developer...
End to End Machine Learning Open Source Solution Presented in Cisco Developer...End to End Machine Learning Open Source Solution Presented in Cisco Developer...
End to End Machine Learning Open Source Solution Presented in Cisco Developer...
 
Better Together: How Graph database enables easy data integration with Spark ...
Better Together: How Graph database enables easy data integration with Spark ...Better Together: How Graph database enables easy data integration with Spark ...
Better Together: How Graph database enables easy data integration with Spark ...
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
 
SFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfSFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdf
 

Kürzlich hochgeladen

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Kürzlich hochgeladen (20)

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 

GPU-Accelerating UDFs in PySpark with Numba and PyGDF

  • 1. GPU-ACCELERATING UDFS IN PYSPARK WITH NUMBA AND PYGDF Joshua Patterson @datametrician Keith Kraus @keithjkraus
  • 3. 3 DATA DELUGE TO INSIGHT HUNGRY INCREASING DATA VARIETY Search Marketing Behavioral Targeting Dynamic Funnels User Generated Content Mobile Web SMS/MMS Sentiment HD Video Speech To Text Product/ Service Logs Social Network Business Data Feeds User Click Stream Sensors Infotainment Systems Wearable Devices Cyber Security Logs Connected Vehicles Machine Data IoT Data Dynamic Pricing Payment Record Purchase Detail Purchase Record Support Contacts Segmentation Offer Details Web Logs Offer History A/B Testing BUSINESS PROCESS PETABYTESTERABYTESGIGABYTESEXABYTESZETTABYTES Streaming Video Natural Language Processing WEB DIGITAL AI
  • 4. 4 DATA FORMATS Avro XML JSON GML ProtoBuf HDFS Pickle CSV Parquet Pandas Plain Text vs Binary Compressed vs Uncompressed CSR COO CSC * Not a complete list Numpy
  • 5. 5 DATA PROCESSING EVOLUTION Faster Data Access Less Data Movement HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train Hadoop Processing, Reading from disk
  • 6. 6 DATA PROCESSING EVOLUTION Faster Data Access Less Data Movement HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train Hadoop Processing, Reading from disk 25-100x Improvement Less code Language flexible Primarily In-Memory Spark In-Memory Processing
  • 7. 7 Cluster computing framework Spark has almost become synonymous with Hadoop and Big Data• Integrates with nearly the entire Big Data ecosystem• The processing layer for big data and leading ML framework• Five main components RDD API, SQL, Streaming,• MLlib, and GraphX APACHE SPARK
  • 8. 8 SPARK IS NOT ENOUGH Basic workloads are bottlenecked by the CPU Source: Mark Litwintschik’s blog: 1.1 Billion Taxi Rides: EC2 versus EMR In a simple benchmark consisting• of aggregating data, the CPU is the bottleneck This is after the data is parsed and• cached into memory which is another common bottleneck The CPU bottleneck is even worse• in more complex workloads! SELECT cab_type, count(*) FROM trips_orc GROUP BY cab_type;
  • 9. 9 SPARK ECOSYSTEM Lacks Full GPU Integration 4 Core Parts• : SQL, Streaming (Spark functions micro batched), Machine Learning, & Graph Spark is currently optimizing its existing code base, adding more usability, not GPU support yet•
  • 10. 10 SPARK ECOSYSTEM Using• Numba, Microsoft Azure team released a basic example showing a ~5x speedup using GPUs with Spark This example is extremely limited in that• they’re not passing any real data to the Python process or the GPU When wanting to pass data from Spark to the• GPU there are new issues and performance considerations GPU-Acceleration Possible But Not Ideal Source: https://github.com/Azure/aztk/blob/master/node_scripts/jupyter- samples/GPU%2Bvs%2BCPU%2Busing%2BNumba.ipynb
  • 12. 12 GPUS ARE FAST 1.1 Billion Taxi Ride Benchmark 21 30 1560 80 99 1250 150 269 2250 372 696 2970 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 MapD DGX-1 MapD 4 x P100 Redshift 6-node Spark 11-node Query 1 Query 2 Query 3 Query 4 TimeinMilliseconds Source: MapD Benchmarks on DGX from internal NVIDIA testing following guidelines of Mark Litwintschik’s blogs: Redshift, 6-node ds2.8xlarge cluster & Spark 2.1, 11 x m3.xlarge cluster w/ HDFS @marklit82 10190 8134 19624 85942
  • 13. 13 GPUS ARE FAST K-Means Benchmark 10 with latest solver
  • 14. 14 25-100x Improvement Less code Language flexible Primarily In-Memory DATA PROCESSING EVOLUTION Faster Data Access Less Data Movement HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train HDFS Read GPU Read Query CPU Write GPU Read ETL CPU Write GPU Read ML Train 5-10x Improvement More code Language rigid Substantially on GPU GPU/Spark In-Memory Processing Hadoop Processing, Reading from disk Spark In-Memory Processing
  • 16. 16 APP A GPU-ACCELERATED ARCHITECTURE THEN Too much data movement and too many different data formats CPU GPU APP B Read DataH2O.ai Graphistry Copy & Convert Copy & Convert Copy & Convert Load Data APP A GPU Data APP B GPU Data BlazingDB MapDSimantex Anaconda GunrocknvGRAPH
  • 17. 17 APP A GPU-ACCELERATED ARCHITECTURE THEN Too much data movement and too many different data formats CPU GPU APP B BlazingDB MapD Copy & Convert Copy & Convert Copy & Convert Load Data APP A GPU Data APP B GPU Data Simantex Read DataH2O.ai Graphistry Anaconda GunrocknvGRAPH
  • 18. 18 APACHE ARROW COMMON DATA LAYER From Apache Arrow Home Page - https://arrow.apache.org/
  • 19. 19 GPU-ACCELERATED ARCHITECTURE NOW Single data format and shared access to data on GPU CPU GPU GPU MEM Read Data BlazingDB MapD Load Data Apache Arrow Powered by: GPU Data Frame Simantex H2O.ai Graphistry Anaconda GunrocknvGRAPH
  • 20. 20 25-100x Improvement Less code Language flexible Primarily In-Memory GPU DATA FRAME Faster Data Access Less Data Movement HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train HDFS Read GPU Read Query CPU Write GPU Read ETL CPU Write GPU Read ML Train Arrow Read Query ETL ML Train 5-10x Improvement More code Language rigid Substantially on GPU 25-100x Improvement Same code Language flexible Primarily on GPU End to End GPU Processing (GOAI) GPU/Spark In-Memory Processing Hadoop Processing, Reading from disk Spark In-Memory Processing
  • 21. 21 GPU OPEN ANALYTICS INITIATIVE First Project, the GPU Data Frame No Copy & Converts - Full Interoperability H2O.ai Numba Gunrock Graphistry BlazingDB MapD GPU Data Frame GPU Data Frame is the first project of GOAI• Apache Arrow for GPU• libgdf• : A C library of helper functions, including: Copying the GDF metadata block to the host and parsing it• to a host-side struct. Importing/exporting a GDF using the CUDA IPC mechanism.• CUDA kernels to perform element• -wise math operations on GDF columns. CUDA sort, join, and reduction operations on GDFs.• pygdf• : A Python library for manipulating GDFs Python interface to• libgdf library with additional functionality Creating GDFs from• Numpy arrays and Pandas DataFrames JIT compilation of group by and filter kernels using• Numba dask_gdf• : Extension for Dask to work with distributed GDFs. Same operations as• pygdf, but working on GDFs chunked onto different GPUs and different servers. Will bring the same Kubernetes support that• Dask already has. github.com/gpuopenanalytics nvGRAPH Apache Arrow Powered by: Simantex
  • 23. 23 GPU ACCELERATION ACROSS THE ECOSYSTEM Apache Arrow H2O.ai Numba Gunrock Graphistry BlazingDB MapD GPU Data Frame nvGRAPH Apache Arrow Powered by: Simantex
  • 24. 24 25-100x Improvement Less code Language flexible Primarily In-Memory DATA PROCESSING EVOLUTION Faster Data Access Less Data Movement HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train HDFS Read GPU Read Query CPU Write GPU Read ETL CPU Write GPU Read ML Train Arrow Read Query ETL ML Train 5-10x Improvement More code Language rigid Substantially on GPU 25-100x Improvement Same code Language flexible Primarily on GPU End to End GPU Processing (GoAi) GPU/Spark In-Memory Processing Hadoop Processing, Reading from disk Spark In-Memory Processing
  • 26. 26 PYGDF @gpuoai Python GPU DataFrame library
  • 30. 30 Cluster computing framework APACHE SPARK JVM Local Cluster Local Code Spark Context JVM JVM
  • 32. 32 PYSPARK No cluster execution in Python if using Spark built-ins JVM Local Cluster Local Code Spark Context JVM JVM
  • 33. 33 PYSPARK UDFS When Spark built-ins can’t get the job done alone User defined functions (UDFs)• allow for creating column-based functions outside of the scope of Spark built-in functions UDFs can be defined in Scala/Java• or Python and be called from PySpark Using Python lambdas in map• functions is essentially the same as using a Python UDF
  • 34. 34 PYSPARK PYTHON UDFS Python UDFs in PySpark need Python workers and data movement JVM Local Cluster Local Code Spark Context JVM JVM
  • 35. 35 PYSPARK PYTHON UDFS Moving data from the JVM to Python efficiently is hard JVM Local Cluster Local Code Spark Context JVM JVM
  • 36. 36 PYSPARK PYTHON UDFS How is the data movement implemented? Rows of data are pickled• and sent from the executor JVM process to Python worker processes This bottlenecks the• data pipeline, but how badly? Many people avoid this• problem by defining their UDFs in Scala/Java and calling them from PySpark JVM Executor Python Workers Rows (Pickle) Rows (Pickle)
  • 37. 37 PYSPARK PYTHON UDFS Performance analysis of a basic UDF Source: Julien LeDem, Li Jin: Improving Python and Spark Performance and Interoperability with Apache Arrow Almost all of the time is• spent serializing and deserializing data as opposed to the actual calculations! We can’t actually feed• the GPU fast enough to take advantage of the performance benefits! lambda x: x + 1
  • 38. 38 PYSPARK 2.3 First release with Apache Arrow compatibility! Apache Arrow spark.sql.execution.arrow.enabled à true
  • 39. 39 PYSPARK 2.3 PANDAS Optimized Spark Data Frame ↔ Pandas Data Frame df.toPandas() createDataFrame(pdf)
  • 40. 40 PYSPARK 2.3 PANDAS UDFS Vectorized user defined functions using Pandas Scalar Pandas UDFs Grouped Map Pandas UDFs @pandas_udf(schema, PandasUDFType.GROUPED_MAP)@pandas_udf(‘double’, PandasUDFType.SCALAR) Pandas.Series• in, Pandas.Series out Input and output Series must be the same length• Output Series must be of the type defined in the• decorator Pandas.DataFrame• in, Pandas.DataFrame out Output• DataFrame can be any length Output• DataFrame schema defined via a Spark SQL DataFrame schema
  • 41. 41 PYSPARK 2.3 PANDAS UDFS PySpark data movement performance issues resolved JVM Executor Python Workers Columnar Record Batch Columnar Record Batch Data is converted from• rows to Apache Arrow columnar record batches within the executor JVM processes Data does• not have to be serialized or deserialized! Apache Arrow
  • 42. 42 PYSPARK 2.3 PANDAS UDFS No more serialization and deserialization overhead! Source: Julien LeDem, Li Jin: Improving Python and Spark Performance and Interoperability with Apache Arrow With the data movement• performance issues resolved, the bottleneck for many UDFs gets pushed back to the compute We can utilize GPUs to help• in this respect! lambda x: x + 1
  • 44. 44 PANDAS UDFS WITH GPUS Pandas ↔ PyGDF makes this easy! Scalar Pandas UDFs Grouped Map Pandas UDFs @pandas_udf(schema, PandasUDFType.GROUPED_MAP)@pandas_udf(‘double’, PandasUDFType.SCALAR) Pandas.Series PyGDF.Series Pandas.DataFrame PyGDF.DataFrame
  • 45. 45 PANDAS UDFS WITH GPUS What about for more advanced operations? Many UDFs are created because the function• can’t be easily created using Spark primitives Probably can’t be created with• PyGDF primitives either Writing low level code and tying it into your• UDF is a non-starter
  • 46. 46 PANDAS UDFS WITH GPUS Numba to the rescue! Luckily,• PyGDF has convenience functions for Numba to JIT compile CUDA kernels for optimized execution on the GPU DataFrame.apply_rows• () Series.applymap• () UDFs within UDFS!•
  • 47. 47 PANDAS UDFS WITH GPUS Numba GPU-Accelerated PyGDF UDFs in Pandas UDFs Scalar Pandas UDFs Grouped Map Pandas UDFs @pandas_udf(schema, PandasUDFType.GROUPED_MAP)@pandas_udf(‘double’, PandasUDFType.SCALAR) Pandas.Series PyGDF.Series Pandas.DataFrame PyGDF.DataFrame
  • 48. 48 LESSONS LEARNED GPU-Accelerated UDFs as hard to do right Data needs to be large enough to utilize the GPU• effectively, but not too large to exhaust GPU memory (1e6 – 9e9) The work done on the GPU needs to be substantial• enough to prevent data transfer from dominating execution time I.E.• Group by a timestamp and run a Grouped Map Pandas UDF of GPU-accelerated pagerank per group PyGDF• depends on Arrow 0.7.1 for now while PySpark uses Arrow 0.8+, WIP to update dependency https• ://github.com/kkraus14/libgdf/tree/temp_r emove_ipc_arrow for temporary workaround
  • 50. 50 PYGDF AND LIBGDF Optimized join performance• GDF Graph Analytics Library• Support for multiple• interconnected GPUs in LibGDF and PyGDF (same PCIe root or NVLink) General• performance improvements across the board TIME (MS) SF1 SF10 SF100 CPU (single-threaded) 1329 31731 465064 V100 (PCIe3) 22 164 1521 V100 (3xNVLINK2) 12 45 466 3.2x 300x TPCH Query 21 – End to End Results Using 32-bit Keys* TIME (MS) SF1 SF10 SF100 CPU (single-threaded) 150 2041 24960 V100 (PCIe3) 13 105 946 V100 (3xNVLINK2) 7 23 308 3.1x 26x TPCH Query 4 – End to End Results Using 32-bit Keys*
  • 51. 51 NUMBA AND CUPY Standard Python GPU N-Dimensional Array Numba• and CuPy are unifying their GPU backends to share an n-dimensional array implementation Hoping to get additional Python libraries like• PyCUDA, PyTorch, etc. to unify as well in the future PyCUDA
  • 52. 52 DASK.GDF AND DASK.CUPY Scale out in addition to scaling up Use• Dask as the scale out method for distributed GPU data structures Extend• Dask’s Kubernetes integration as needed to support the full extent of GPU integration Dask.GDF• is in the very early stages of development https://github.com/gpuopenanalytics/dask_gdf Dask.CuPy• has not started yet, but if interested we’re hiring!
  • 53. 53 SPARK 2.3+ WISHES More Arrow-based Pandas UDF types Partition Pandas UDFs @pandas_udf(schema, PandasUDFType.PARTITION) Pandas.DataFrame• in, Pandas.DataFrame out Output• DataFrame can be any length Output• DataFrame schema defined via a Spark SQL DataFrame schema
  • 54. 54 SPARK 2.3+ WISHES Arrow as the primary data format for Spark DataFrame Currently Spark can take advantage of columnar• file formats and columnar data connections by loading the necessary columns and pushing down predicates Most typical operations benefit from columnar data• structure Using Arrow will allow for optimized compute• kernels and reduce the JVM dependency in the future Eventually native GPU acceleration• Executor
  • 55. 55 JOIN THE REVOLUTION Everyone Can Help! Integrations, feedback, documentation support, pull requests, new issues, or donations welcomed! APACHE ARROW GPU Open Analytics Initiative https://arrow.apache.org/ @ApacheArrow http://gpuopenanalytics.com/ @Gpuoai
  • 56. Joshua Patterson @datametrician Keith Kraus @keithjkraus QUESTIONS?