SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Downloaden Sie, um offline zu lesen
Scalable Acceleration of XGBoost Training
on Spark GPU Clusters
Rong Ou, Bobby Wang
NVIDIA
Agenda
Rong Ou
Introduction to XGBoost, gradient-based
sampling, learning to rank
Bobby Wang
XGBoost training with GPUs on Spark
2.x/3.0
XGBoost
XGBoost
▪ Open source gradient boosting library
▪ Supports regression, classification, ranking and user
defined objectives
▪ Wins many data science and machine learning
challenges
▪ Used in production by multiple companies
Distributed XGBoost
▪ Supports distributed training on multiple machines,
including AWS, GCE, Azure, and Yarn clusters
▪ Can be integrated with Flink, Spark and other cloud
dataflow systems
XGBoost GPU Support
▪ Tree construction (training) and prediction can be
accelerated with CUDA-capable GPUs
▪ Use gpu_hist as the tree method
Gradient-based Sampling
Out-of-core Boosting
▪ GPU memory is typically smaller than main memory
▪ Large datasets may not fit in GPU memory, even on a
production cluster
▪ Naively streaming data over the PCIe bus is too slow
Sampling
▪ At the beginning of each iteration, sample the data,
then use the sample to build the tree
▪ Uniform sampling requires at least 50% of the data to
be sampled
Gradient-based Sampling
▪ Sample based on probability proportional to the
gradients
▪ Gradient-based One-Side Sampling (GOSS)
▪ Minimal Variance Sampling (MVS)
▪ Sample ratio as low as 0.1 without loss of accuracy
Maximum Data Size
# Rows
In-core GPU 9 million
Out-of-core GPU 12 million
Out-of-core GPU, f = 0.1 85 million
Synthetic dataset with 500 columns, NVIDIA Tesla V100 GPU (16 GB)
Training Time
Time (seconds) AUC
CPU In-core 1309.64 0.8393
CPU Out-of-core 1228.53 0.8393
GPU In-core 241.52 0.8398
GPU Out-of-core, f = 1.0 211.91 0.8396
GPU Out-of-core, f = 0.5 427.41 0.8395
GPU Out-of-core, f = 0.3 421.59 0.8399
Higgs dataset, NVIDIA Titan V
Model Accuracy
Learning to Rank
Learning to Rank (LTR) in a Nutshell
▪ Used in Information Retrieval (IR) class of problems
▪ A search engine indexes billions of documents
▪ A search user query should return most relevant documents
▪ Hence, pages are grouped first based on user query relevance,
domains, sub domains etc.
▪ Within each group, the pages are ranked
▪ Initial ranking is based on editorial judgement of user queries
▪ The ranking is iteratively refined based on the performance of the
previous model
LTR in XGBoost
▪ XGBoost incrementally builds a better model by
combining multiple weak models
▪ Models are built by gradient descent using an objective
function such as LTR
▪ XGBoost uses LambdaMart ranking algorithm which
uses pairwise ranking approach
▪ This minimizes pairwise loss by repeatedly sampling
pairs of instances
LTR Algorithms
▪ 3 Algorithms are supported
▪ Pairwise (default)
▪ mAP - mean Average Precision
▪ nDCG - normalized Discounted Cumulative Gain
▪ mAP and nDCG further minimizes Pairwise loss by
adjusting it with the weight of instance pair chosen
Enable and Measure Model Performance
▪ Train on GPU (tree_method = gpu_hist)
▪ Choose the appropriate objective function (objective = rank:map)
▪ Measure performance of the model after each training round by enabling one of the following
ranking metric (eval_metric = map)
▪ Ranking and metric evaluation are both accelerated on the GPU
▪ mAP - mean Average Precision (default)
▪ pre[@n] - precision [for top n documents]
▪ nDCG[@n] - normalized Discounted Cumulative Gain [for top n documents]
▪ auc - area under the ROC curve
▪ aucpr - area under the precision recall curve
▪ For more information and paper references, please refer to this blog
Performance - Environment and Configuration
▪ Used Microsoft benchmark ranking dataset
▪ Consists of ~11.3 million training instances, scattered across ~95K groups and
consuming ~13 GB of disk space
▪ System info
▪ Intel Xeon 2.3 GHZ, 1 socket, 6 cores / socket, 2 threads / core, 80 GB system
memory, 1 NVIDIA V100 16GB GPU; does not use hyper threads (uses only 6 cores
for training)
▪ Training configuration
▪ Used default training configuration on GPU; built 100 trees; used pairwise, ndcg
and map ranking algorithms and map to measure the model performance
Performance - Numbers
Algorithm pairwise ndcg map
GPU 1.72 2.54 2.73
CPU 42.37 59.33 46.38
Speedup 24.63x 23.36x 16.99x
Ranking + metric computation times (in seconds) - using XGBoost HEAD from 5/18/20
XGBoost + Spark 2.x
XGBoost
▪ How to use XGBoost to train on existing data?
▪ Convert the existing data to the numeric data
▪ Do ETL on existing data
XGBoost4j - Spark
▪ Integrate XGBoost with Apache Spark
▪ Use the high-performance algorithm implementation of XGBoost
▪ Leverage the powerful data processing engine of Spark
XGBoost + Spark 2.x + Rapids
▪ Rapids cuDF (libCudf + language bindings)
XGBoost + Spark 2.x + Rapids
▪ Read CSV/Parquet/Orc directly to GPU memory
▪ Chunks loading
▪ Convert column-major cuDF to sparse, row-major
DMatrix
Training on GPUs with Spark 2.x
val df = spark.read.parquet(path)
val featureNames = Seq("f1", "f2", "f3")
val vectorAssembler = new VectorAssembler()
.setInputCols(featureNames.toArray)
.setOutputCol("features")
val xgbInput = vectorAssembler
.transform(df).select("features", labelColName)
val xgbClassifier = new XGBoostClassifier(params)
.setLabelCol(labelColName)
.setTreeMethod("hist")
.setFeaturesCol("features")
val model = xgbClassifier.fit(xgbInput)
val gpuDf = new GpuDataReader(spark).parquet(path)
val featureNames = Seq("f1", "f2", "f3")
val xgbClassifier = new XGBoostClassifier(params)
.setLabelCol(labelColName)
.setTreeMethod("gpu_hist")
.setFeaturesCols(featureNames)
val model = xgbClassifier.fit(gpuDf)
CPU GPU
XGBoost + Spark 2.x + Rapids
▪ Training classification model for 17 year mortgage data (190GB)
XGBoost + Spark 3.0
XGBoost + Spark 3.0 + Rapids
▪ Rapids-plugin-4-spark
▪ Apache Spark plugin that leverages GPUs to accelerate processing
via Rapids libraries
Seamless Integration with Spark 3.0
▪ Features
▪ Use existing (unmodified)
customer code
▪ Spark features that are not
GPU enabled run transparently
on the CPU
▪ Initial Release - GPU Acceleration
of:
▪ Spark Data Frames
▪ Spark SQL
▪ ML/DL training frameworks
Rapids Plugin
UCX LibrariesRapids C++ Libraries
CUDA
JNI bindings
Mapping From Java/Scala to C++
RAPIDS Accelerator
for Spark
DISTRIBUTED SCALE-OUT SPARK APPLICATIONS
Spark SQL API Spark ShuffleDataFrame API
if gpu_enabled(operation, data_type)
call-out to RAPIDS
else
execute standard Spark operation
JNI bindings
Mapping From Java/Scala to C++
● Custom Implementation of Spark
Shuffle
● Optimized to use RDMA and GPU-
to-GPU direct communication
APACHE SPARK CORE
XGBoost + Spark 3.0 + Rapids
▪ GPU-scheduling
▪ GPU-accelerated data reader
▪ Chunks loading
▪ Operators run on GPU, e.g. filter, sort, join, groupby,
etc.
Training on GPUs with Spark 3.0
val df = spark.read.parquet(path)
val featureNames = Seq("f1", "f2", "f3")
val vectorAssembler = new VectorAssembler()
.setInputCols(featureNames.toArray)
.setOutputCol("features")
val xgbInput = vectorAssembler
.transform(df).select("features", labelColName)
val xgbClassifier = new XGBoostClassifier(params)
.setLabelCol(labelColName)
.setTreeMethod("hist")
.setFeaturesCol("features")
val model = xgbClassifier.fit(xgbInput)
val df = spark.read.parquet(path)
val featureNames = Seq("f1", "f2", "f3")
val xgbClassifier = new XGBoostClassifier(params)
.setLabelCol(labelColName)
.setTreeMethod("gpu_hist")
.setFeaturesCols(featureNames)
val model = xgbClassifier.fit(df)
CPU GPU
XGBoost + Spark 3 + Rapids
▪ Training classification model for 23 days Criteo data (1TB)
New eBook: Accelerating Spark 3
Download at: nvidia.com/Spark-book
In this ebook you'll learn about:
● The data processing evolution, from Hadoop to
GPUs and the NVIDIA RAPIDS™ library
● Spark, what it is, what it does, and why it
matters
● GPU-acceleration in Spark
● DataFrames and Spark SQL
● A Spark regression example with a random
forest classifier
● An example of an end-to-end machine learning
workflow GPU-accelerated with XGBoost
Reference
▪ XGBoost for Spark 2.x
▪ https://github.com/rapidsai/xgboost/tree/rapids-spark
▪ XGBoost for Spark 3
▪ https://github.com/rapidsai/xgboost/tree/rapids-spark3.0
▪ XGBoost example for Spark 2.x
▪ https://github.com/rapidsai/spark-examples/tree/master
▪ XGBoost example for Spark 3
▪ https://github.com/rapidsai/spark-examples/tree/support-spark3.0
▪ Blog: Machine learning with XGBoost gets faster with Dataproc on GPUs
▪ Blog: GPU-Accelerated Spark XGBoost – A Major Milestone on the Road to Large-Scale AI
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks
 

Was ist angesagt? (20)

PostgreSQL Deep Internal
PostgreSQL Deep InternalPostgreSQL Deep Internal
PostgreSQL Deep Internal
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 
Data Federation with Apache Spark
Data Federation with Apache SparkData Federation with Apache Spark
Data Federation with Apache Spark
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Vectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookVectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at Facebook
 
Query Optimizer – MySQL vs. PostgreSQL
Query Optimizer – MySQL vs. PostgreSQLQuery Optimizer – MySQL vs. PostgreSQL
Query Optimizer – MySQL vs. PostgreSQL
 
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
Apache Spark and MongoDB - Turning Analytics into Real-Time ActionApache Spark and MongoDB - Turning Analytics into Real-Time Action
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
 
In-memory Databases
In-memory DatabasesIn-memory Databases
In-memory Databases
 
개발자를 위한 Amazon Lightsail Deep-Dive - 정창훈(당근마켓)
개발자를 위한 Amazon Lightsail Deep-Dive - 정창훈(당근마켓)개발자를 위한 Amazon Lightsail Deep-Dive - 정창훈(당근마켓)
개발자를 위한 Amazon Lightsail Deep-Dive - 정창훈(당근마켓)
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 

Ähnlich wie Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters

Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUsChoose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Databricks
 
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
SQUADEX
 
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
Databricks
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
PeterAndreasEntschev
 

Ähnlich wie Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters (20)

Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
 
Comparing pregel related systems
Comparing pregel related systemsComparing pregel related systems
Comparing pregel related systems
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
 
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUsChoose Your Weapon: Comparing Spark on FPGAs vs GPUs
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
 
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
 
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
 
20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
 
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using Spark
 
PGConf APAC 2018 - PostgreSQL performance comparison in various clouds
PGConf APAC 2018 - PostgreSQL performance comparison in various cloudsPGConf APAC 2018 - PostgreSQL performance comparison in various clouds
PGConf APAC 2018 - PostgreSQL performance comparison in various clouds
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
 
RAPIDS Overview
RAPIDS OverviewRAPIDS Overview
RAPIDS Overview
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systems
 

Mehr von Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Kürzlich hochgeladen

Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 

Kürzlich hochgeladen (20)

Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 

Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters

  • 1.
  • 2. Scalable Acceleration of XGBoost Training on Spark GPU Clusters Rong Ou, Bobby Wang NVIDIA
  • 3. Agenda Rong Ou Introduction to XGBoost, gradient-based sampling, learning to rank Bobby Wang XGBoost training with GPUs on Spark 2.x/3.0
  • 5. XGBoost ▪ Open source gradient boosting library ▪ Supports regression, classification, ranking and user defined objectives ▪ Wins many data science and machine learning challenges ▪ Used in production by multiple companies
  • 6. Distributed XGBoost ▪ Supports distributed training on multiple machines, including AWS, GCE, Azure, and Yarn clusters ▪ Can be integrated with Flink, Spark and other cloud dataflow systems
  • 7. XGBoost GPU Support ▪ Tree construction (training) and prediction can be accelerated with CUDA-capable GPUs ▪ Use gpu_hist as the tree method
  • 9. Out-of-core Boosting ▪ GPU memory is typically smaller than main memory ▪ Large datasets may not fit in GPU memory, even on a production cluster ▪ Naively streaming data over the PCIe bus is too slow
  • 10. Sampling ▪ At the beginning of each iteration, sample the data, then use the sample to build the tree ▪ Uniform sampling requires at least 50% of the data to be sampled
  • 11. Gradient-based Sampling ▪ Sample based on probability proportional to the gradients ▪ Gradient-based One-Side Sampling (GOSS) ▪ Minimal Variance Sampling (MVS) ▪ Sample ratio as low as 0.1 without loss of accuracy
  • 12. Maximum Data Size # Rows In-core GPU 9 million Out-of-core GPU 12 million Out-of-core GPU, f = 0.1 85 million Synthetic dataset with 500 columns, NVIDIA Tesla V100 GPU (16 GB)
  • 13. Training Time Time (seconds) AUC CPU In-core 1309.64 0.8393 CPU Out-of-core 1228.53 0.8393 GPU In-core 241.52 0.8398 GPU Out-of-core, f = 1.0 211.91 0.8396 GPU Out-of-core, f = 0.5 427.41 0.8395 GPU Out-of-core, f = 0.3 421.59 0.8399 Higgs dataset, NVIDIA Titan V
  • 16. Learning to Rank (LTR) in a Nutshell ▪ Used in Information Retrieval (IR) class of problems ▪ A search engine indexes billions of documents ▪ A search user query should return most relevant documents ▪ Hence, pages are grouped first based on user query relevance, domains, sub domains etc. ▪ Within each group, the pages are ranked ▪ Initial ranking is based on editorial judgement of user queries ▪ The ranking is iteratively refined based on the performance of the previous model
  • 17. LTR in XGBoost ▪ XGBoost incrementally builds a better model by combining multiple weak models ▪ Models are built by gradient descent using an objective function such as LTR ▪ XGBoost uses LambdaMart ranking algorithm which uses pairwise ranking approach ▪ This minimizes pairwise loss by repeatedly sampling pairs of instances
  • 18. LTR Algorithms ▪ 3 Algorithms are supported ▪ Pairwise (default) ▪ mAP - mean Average Precision ▪ nDCG - normalized Discounted Cumulative Gain ▪ mAP and nDCG further minimizes Pairwise loss by adjusting it with the weight of instance pair chosen
  • 19. Enable and Measure Model Performance ▪ Train on GPU (tree_method = gpu_hist) ▪ Choose the appropriate objective function (objective = rank:map) ▪ Measure performance of the model after each training round by enabling one of the following ranking metric (eval_metric = map) ▪ Ranking and metric evaluation are both accelerated on the GPU ▪ mAP - mean Average Precision (default) ▪ pre[@n] - precision [for top n documents] ▪ nDCG[@n] - normalized Discounted Cumulative Gain [for top n documents] ▪ auc - area under the ROC curve ▪ aucpr - area under the precision recall curve ▪ For more information and paper references, please refer to this blog
  • 20. Performance - Environment and Configuration ▪ Used Microsoft benchmark ranking dataset ▪ Consists of ~11.3 million training instances, scattered across ~95K groups and consuming ~13 GB of disk space ▪ System info ▪ Intel Xeon 2.3 GHZ, 1 socket, 6 cores / socket, 2 threads / core, 80 GB system memory, 1 NVIDIA V100 16GB GPU; does not use hyper threads (uses only 6 cores for training) ▪ Training configuration ▪ Used default training configuration on GPU; built 100 trees; used pairwise, ndcg and map ranking algorithms and map to measure the model performance
  • 21. Performance - Numbers Algorithm pairwise ndcg map GPU 1.72 2.54 2.73 CPU 42.37 59.33 46.38 Speedup 24.63x 23.36x 16.99x Ranking + metric computation times (in seconds) - using XGBoost HEAD from 5/18/20
  • 23. XGBoost ▪ How to use XGBoost to train on existing data? ▪ Convert the existing data to the numeric data ▪ Do ETL on existing data
  • 24. XGBoost4j - Spark ▪ Integrate XGBoost with Apache Spark ▪ Use the high-performance algorithm implementation of XGBoost ▪ Leverage the powerful data processing engine of Spark
  • 25. XGBoost + Spark 2.x + Rapids ▪ Rapids cuDF (libCudf + language bindings)
  • 26. XGBoost + Spark 2.x + Rapids ▪ Read CSV/Parquet/Orc directly to GPU memory ▪ Chunks loading ▪ Convert column-major cuDF to sparse, row-major DMatrix
  • 27. Training on GPUs with Spark 2.x val df = spark.read.parquet(path) val featureNames = Seq("f1", "f2", "f3") val vectorAssembler = new VectorAssembler() .setInputCols(featureNames.toArray) .setOutputCol("features") val xgbInput = vectorAssembler .transform(df).select("features", labelColName) val xgbClassifier = new XGBoostClassifier(params) .setLabelCol(labelColName) .setTreeMethod("hist") .setFeaturesCol("features") val model = xgbClassifier.fit(xgbInput) val gpuDf = new GpuDataReader(spark).parquet(path) val featureNames = Seq("f1", "f2", "f3") val xgbClassifier = new XGBoostClassifier(params) .setLabelCol(labelColName) .setTreeMethod("gpu_hist") .setFeaturesCols(featureNames) val model = xgbClassifier.fit(gpuDf) CPU GPU
  • 28. XGBoost + Spark 2.x + Rapids ▪ Training classification model for 17 year mortgage data (190GB)
  • 30. XGBoost + Spark 3.0 + Rapids ▪ Rapids-plugin-4-spark ▪ Apache Spark plugin that leverages GPUs to accelerate processing via Rapids libraries
  • 31. Seamless Integration with Spark 3.0 ▪ Features ▪ Use existing (unmodified) customer code ▪ Spark features that are not GPU enabled run transparently on the CPU ▪ Initial Release - GPU Acceleration of: ▪ Spark Data Frames ▪ Spark SQL ▪ ML/DL training frameworks
  • 32. Rapids Plugin UCX LibrariesRapids C++ Libraries CUDA JNI bindings Mapping From Java/Scala to C++ RAPIDS Accelerator for Spark DISTRIBUTED SCALE-OUT SPARK APPLICATIONS Spark SQL API Spark ShuffleDataFrame API if gpu_enabled(operation, data_type) call-out to RAPIDS else execute standard Spark operation JNI bindings Mapping From Java/Scala to C++ ● Custom Implementation of Spark Shuffle ● Optimized to use RDMA and GPU- to-GPU direct communication APACHE SPARK CORE
  • 33. XGBoost + Spark 3.0 + Rapids ▪ GPU-scheduling ▪ GPU-accelerated data reader ▪ Chunks loading ▪ Operators run on GPU, e.g. filter, sort, join, groupby, etc.
  • 34. Training on GPUs with Spark 3.0 val df = spark.read.parquet(path) val featureNames = Seq("f1", "f2", "f3") val vectorAssembler = new VectorAssembler() .setInputCols(featureNames.toArray) .setOutputCol("features") val xgbInput = vectorAssembler .transform(df).select("features", labelColName) val xgbClassifier = new XGBoostClassifier(params) .setLabelCol(labelColName) .setTreeMethod("hist") .setFeaturesCol("features") val model = xgbClassifier.fit(xgbInput) val df = spark.read.parquet(path) val featureNames = Seq("f1", "f2", "f3") val xgbClassifier = new XGBoostClassifier(params) .setLabelCol(labelColName) .setTreeMethod("gpu_hist") .setFeaturesCols(featureNames) val model = xgbClassifier.fit(df) CPU GPU
  • 35. XGBoost + Spark 3 + Rapids ▪ Training classification model for 23 days Criteo data (1TB)
  • 36. New eBook: Accelerating Spark 3 Download at: nvidia.com/Spark-book In this ebook you'll learn about: ● The data processing evolution, from Hadoop to GPUs and the NVIDIA RAPIDS™ library ● Spark, what it is, what it does, and why it matters ● GPU-acceleration in Spark ● DataFrames and Spark SQL ● A Spark regression example with a random forest classifier ● An example of an end-to-end machine learning workflow GPU-accelerated with XGBoost
  • 37. Reference ▪ XGBoost for Spark 2.x ▪ https://github.com/rapidsai/xgboost/tree/rapids-spark ▪ XGBoost for Spark 3 ▪ https://github.com/rapidsai/xgboost/tree/rapids-spark3.0 ▪ XGBoost example for Spark 2.x ▪ https://github.com/rapidsai/spark-examples/tree/master ▪ XGBoost example for Spark 3 ▪ https://github.com/rapidsai/spark-examples/tree/support-spark3.0 ▪ Blog: Machine learning with XGBoost gets faster with Dataproc on GPUs ▪ Blog: GPU-Accelerated Spark XGBoost – A Major Milestone on the Road to Large-Scale AI
  • 38. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.