Simplify Data Conversion from Spark to TensorFlow and PyTorch

•

1 gefällt mir•1,878 views

In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks. Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model? The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity. The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters. In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.

Daten & Analysen

Simplify Data
Conversion from
Spark to Deep
Learning
Liang Zhang
Software Engineer @ databricks

About Me
▪ Machine Learning Team
@ Databricks
▪ Master in Carnegie
Mellon University Liang Zhang
linkedin.com/in/liangz1/

Agenda
▪ Why should we care
about data conversion
between spark and deep
learning frameworks?
▪ Pain points
▪ Overview of the Spark
Dataset Converter
▪ Demo
▪ Best Practices

Spark
DataFrame
Motivation: Data Conversion from Spark to DL
TensorFlow
PyTorch
?
• Images from driving camera: Detect traffic lights
• Large amount of data - TBs
• New images arriving every day
• Data cleaning and labeling
• Train the model with all available data and periodically re-train with new data
• Predict the label of new images

Pain points: Data Conversion from Spark to Deep
Learning frameworks

Pain points: Data Conversion from Spark to DL
• Single-node training:
• Collect a sample of data to the driver in a pandas DataFrame
• Distributed training:
• Save the Spark DataFrame to TFRecords files and load TFRecords using
TensorFlow
• Save the Spark DataFrame to parquet files and write your custom PyTorch
DataLoader to load the partitions

Spark
DataFrame
Spark Dataset Converter API Overview
TensorFlow
Dataset
PyTorch
DataLoader
Spark
Dataset
Converter
from petastorm.spark import make_spark_converter
converter = make_spark_converter(df)
with converter.make_tf_dataset() as dataset:
tf_model.fit(dataset)
with converter.make_torch_dataloader() as dataloader:
train(torch_model, dataloader)

Spark Dataset Converter API
HDFS/DBFS
Spark
DataFrame
tf.data.Dataset /
torch.dataloader
Found
cached
parquet file?
Cache
DataFrame in
parquet file
data.parquet
No
Yes Load cached
parquet file with
petastorm
ETL Training

Spark Dataset Converter Features
▪ Recognize cached Spark
DataFrame by checking
the analyzed query plan
▪ Automatic cache cleaning
at program exit
• Change two arguments
to migrate your data
loading code from
single-node to
distributed setting
• Easy migration to distributed
• Cache intermediate ﬁles
• Convert MLlib vectors to
1D arrays automatically
• MLlib vector Handling

How to use the Spark Dataset Converter API?
(demo)

Demo notebooks
• Image Classiﬁcation
• Spark to TensorFlow Dataset
• https://docs.databricks.com/_static/notebooks/deep-learning/petastorm-spark-converter-tenso
rflow.html
• Spark to PyTorch DataLoader
• https://docs.databricks.com/_static/notebooks/deep-learning/petastorm-spark-converter-pytor
ch.html

Best Practices with Spark Dataset Converter
• Image data decoding and preprocessing
• Decode image bytes and preprocess in TransformSpec, not in Spark
• Spark -> TransformSpec -> Dataset.map -> in the model (GPU)
• Generate inﬁnite batches using num_epochs=None
• In distributed training, to guarantee that every worker get exactly the same
amount of data.
• Manage the lifecycle of cache data
• On local laptop or in a scheduled job on Databricks, the cache files will be
automatically deleted when the python process exits.
• In Databricks notebook, we recommend configuring lifecycle rules for the
underlying S3 buckets storing the cache files.

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Empfohlen

Training Week: Create a Knowledge Graph: A Simple ML Approach Neo4j

PySpark dataframeJaemun Jung

Spark introduction and architectureSohil Jain

End-to-End Deep Learning with Horovod on Apache SparkDatabricks

Dropbox Talk at Netflix ML Platform Meetup Spe 2019Faisal Siddiqi

Deep Dive: Memory Management in Apache SparkDatabricks

Hoodie - DataEngConf 2017Vinoth Chandar

Getting Started with Apache Spark on KubernetesDatabricks

Empfohlen

Training Week: Create a Knowledge Graph: A Simple ML Approach Neo4j

PySpark dataframeJaemun Jung

Spark introduction and architectureSohil Jain

End-to-End Deep Learning with Horovod on Apache SparkDatabricks

Dropbox Talk at Netflix ML Platform Meetup Spe 2019Faisal Siddiqi

Deep Dive: Memory Management in Apache SparkDatabricks

Hoodie - DataEngConf 2017Vinoth Chandar

Getting Started with Apache Spark on KubernetesDatabricks

Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022Jim Dowling

The Parquet Format and Performance Optimization OpportunitiesDatabricks

Making Apache Spark Better with Delta LakeDatabricks

Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks

3D: DBT using Databricks and DeltaDatabricks

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Edureka!

Programming in Spark using PySpark Mostafa

Batch Processing vs Stream Processing Differencejeetendra mandal

The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...Databricks

Neo4j GraphTalk Helsinki - Introduction and Graph Use CasesNeo4j

Observability for Data Pipelines With OpenLineageDatabricks

SHACL: Shaping the Big Ball of Data MudRichard Cyganiak

Parquet performance tuning: the missing guideRyan Blue

Introduction to Apache SparkRahul Jain

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Applied Machine Learning for Ranking Products in an Ecommerce SettingDatabricks

Databricks Partner Enablement Guide.pdfssuserb74636

Data ingestionnitheeshe2

Building a modern Application with DataFramesDatabricks

Building a modern Application with DataFramesSpark Summit

Weitere ähnliche Inhalte

Was ist angesagt?

Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022Jim Dowling

The Parquet Format and Performance Optimization OpportunitiesDatabricks

Making Apache Spark Better with Delta LakeDatabricks

Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks

3D: DBT using Databricks and DeltaDatabricks

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Edureka!

Programming in Spark using PySpark Mostafa

Batch Processing vs Stream Processing Differencejeetendra mandal

The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...Databricks

Neo4j GraphTalk Helsinki - Introduction and Graph Use CasesNeo4j

Observability for Data Pipelines With OpenLineageDatabricks

SHACL: Shaping the Big Ball of Data MudRichard Cyganiak

Parquet performance tuning: the missing guideRyan Blue

Introduction to Apache SparkRahul Jain

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Applied Machine Learning for Ranking Products in an Ecommerce SettingDatabricks

Databricks Partner Enablement Guide.pdfssuserb74636

Data ingestionnitheeshe2

Was ist angesagt? (20)

Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022

The Parquet Format and Performance Optimization Opportunities

Making Apache Spark Better with Delta Lake

Architect’s Open-Source Guide for a Data Mesh Architecture

3D: DBT using Databricks and Delta

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...

Programming in Spark using PySpark

Batch Processing vs Stream Processing Difference

The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...

Neo4j GraphTalk Helsinki - Introduction and Graph Use Cases

Observability for Data Pipelines With OpenLineage

SHACL: Shaping the Big Ball of Data Mud

Parquet performance tuning: the missing guide

Introduction to Apache Spark

Processing Large Datasets for ADAS Applications using Apache Spark

Data Lakehouse Symposium | Day 1 | Part 2

Applied Machine Learning for Ranking Products in an Ecommerce Setting

Databricks Partner Enablement Guide.pdf

Data ingestion

Ähnlich wie Simplify Data Conversion from Spark to TensorFlow and PyTorch

Building a modern Application with DataFramesDatabricks

Building a modern Application with DataFramesSpark Summit

Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks

Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Databricks

Integrating Deep Learning Libraries with Apache SparkDatabricks

Spark WorshopJuan Pedro Moreno

Introduction to Apache Spark Juan Pedro Moreno

Jump Start with Apache Spark 2.0 on DatabricksAnyscale

Leveraging Apache Spark for Scalable Data Prep and Inference in Deep LearningDatabricks

Data processing with spark in r & pythonMaloy Manna, PMP®

The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine

Koalas: How Well Does Koalas Work?Databricks

Introduction to and Extending Spark MLHolden Karau

Getting started with Apache Spark in Python - PyLadies Toronto 2016Holden Karau

Deploying Data Science Engines to ProductionMostafa Majidpour

Using Databricks as an Analysis PlatformDatabricks

DIscover Spark and Spark streamingMaturin BADO

Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi

Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Jason Dai

Paris Data Geek - Spark Streaming Djamel Zouaoui

Ähnlich wie Simplify Data Conversion from Spark to TensorFlow and PyTorch (20)

Building a modern Application with DataFrames

Build Large-Scale Data Analytics and AI Pipeline Using RayDP

Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...

Integrating Deep Learning Libraries with Apache Spark

Spark Worshop

Introduction to Apache Spark

Jump Start with Apache Spark 2.0 on Databricks

Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning

Data processing with spark in r & python

The Nitty Gritty of Advanced Analytics Using Apache Spark in Python

Koalas: How Well Does Koalas Work?

Introduction to and Extending Spark ML

Getting started with Apache Spark in Python - PyLadies Toronto 2016

Deploying Data Science Engines to Production

Using Databricks as an Analysis Platform

DIscover Spark and Spark streaming

Processing Large Data with Apache Spark -- HasGeek

Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)

Paris Data Geek - Spark Streaming

Mehr von Databricks

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Machine Learning CI/CD for Email Attack DetectionDatabricks

Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks

Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueDatabricks

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Massive Data Processing in Adobe Using Delta Lake

Machine Learning CI/CD for Email Attack Detection

Jeeves Grows Up: An AI Chatbot for Performance and Quality

Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue

Kürzlich hochgeladen

Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ

Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics

Easter Eggs From Star Wars and in cars 1 and 217djon017

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali

RadioAdProWritingCinderellabyButleri.pdfgstagge

Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly

NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna

Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics

How we prevented account sharing with MFAAndrei Kaleshka

Semantic Shed - Squashing and Squeezing.pptxMike Bennett

Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy

Kürzlich hochgeladen (20)

Advanced Machine Learning for Business Professionals

Heart Disease Classification Report: A Data Analysis Project

Easter Eggs From Star Wars and in cars 1 and 2

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...

RadioAdProWritingCinderellabyButleri.pdf

Generative AI for Social Good at Open Data Science East 2024

NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...

Identifying Appropriate Test Statistics Involving Population Mean

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT

How we prevented account sharing with MFA

Semantic Shed - Squashing and Squeezing.pptx

Biometric Authentication: The Evolution, Applications, Benefits and Challenge...

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service

Data Factory in Microsoft Fabric (MsBIP #82)

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD

Student Profile Sample report on improving academic performance by uniting gr...

Simplify Data Conversion from Spark to TensorFlow and PyTorch

1. Simplify Data Conversion from Spark to Deep Learning Liang Zhang Software Engineer @ databricks

2. About Me ▪ Machine Learning Team @ Databricks ▪ Master in Carnegie Mellon University Liang Zhang linkedin.com/in/liangz1/

3. Agenda ▪ Why should we care about data conversion between spark and deep learning frameworks? ▪ Pain points ▪ Overview of the Spark Dataset Converter ▪ Demo ▪ Best Practices

4. Spark DataFrame Motivation: Data Conversion from Spark to DL TensorFlow PyTorch ? • Images from driving camera: Detect traffic lights • Large amount of data - TBs • New images arriving every day • Data cleaning and labeling • Train the model with all available data and periodically re-train with new data • Predict the label of new images

5. Pain points: Data Conversion from Spark to Deep Learning frameworks

6. Pain points: Data Conversion from Spark to DL • Single-node training: • Collect a sample of data to the driver in a pandas DataFrame • Distributed training: • Save the Spark DataFrame to TFRecords files and load TFRecords using TensorFlow • Save the Spark DataFrame to parquet files and write your custom PyTorch DataLoader to load the partitions

7. Pain points: Data Conversion from Spark to DL • Single-node training: • Collect a sample of data to the driver in a pandas DataFrame • Distributed training: • Save the Spark DataFrame to TFRecords files and parse the serialized data in TFRecords using TensorFlow • Save the Spark DataFrame to parquet files and write your custom PyTorch DataLoader to load the partitions • Hard to migrate from single-node to distributed training • Many lines of extra code to save, load and parse intermediate ﬁles

8. Overview of the Spark Dataset Converter

9. Spark DataFrame Spark Dataset Converter API Overview TensorFlow Dataset PyTorch DataLoader Spark Dataset Converter from petastorm.spark import make_spark_converter converter = make_spark_converter(df) with converter.make_tf_dataset() as dataset: tf_model.fit(dataset) with converter.make_torch_dataloader() as dataloader: train(torch_model, dataloader)

10. Spark Dataset Converter API HDFS/DBFS Spark DataFrame tf.data.Dataset / torch.dataloader Found cached parquet file? Cache DataFrame in parquet file data.parquet No Yes Load cached parquet file with petastorm ETL Training

11. Spark Dataset Converter Features ▪ Recognize cached Spark DataFrame by checking the analyzed query plan ▪ Automatic cache cleaning at program exit • Change two arguments to migrate your data loading code from single-node to distributed setting • Easy migration to distributed • Cache intermediate ﬁles • Convert MLlib vectors to 1D arrays automatically • MLlib vector Handling

12. How to use the Spark Dataset Converter API? (demo)

13. Demo notebooks • Image Classiﬁcation • Spark to TensorFlow Dataset • https://docs.databricks.com/_static/notebooks/deep-learning/petastorm-spark-converter-tenso rflow.html • Spark to PyTorch DataLoader • https://docs.databricks.com/_static/notebooks/deep-learning/petastorm-spark-converter-pytor ch.html

14. Best Practices

15. Best Practices with Spark Dataset Converter • Image data decoding and preprocessing • Decode image bytes and preprocess in TransformSpec, not in Spark • Spark -> TransformSpec -> Dataset.map -> in the model (GPU) • Generate inﬁnite batches using num_epochs=None • In distributed training, to guarantee that every worker get exactly the same amount of data. • Manage the lifecycle of cache data • On local laptop or in a scheduled job on Databricks, the cache files will be automatically deleted when the python process exits. • In Databricks notebook, we recommend configuring lifecycle rules for the underlying S3 buckets storing the cache files.

16. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.