SlideShare ist ein Scribd-Unternehmen logo
1 von 38
Downloaden Sie, um offline zu lesen
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Luca Canali, CERN
Deep Learning Pipelines
for High Energy Physics
using Apache Spark with
Distributed Keras and
Analytics Zoo
#UnifiedDataAnalytics #SparkAISummit
About Luca
3#UnifiedDataAnalytics #SparkAISummit
• Data Engineer at CERN
– Hadoop and Spark service, database services
– 19+ years of experience with data engineering
• Sharing and community
– Blog, notes, tools, contributions to Apache Spark
@LucaCanaliDB – http://cern.ch/canali
CMS
ALICE
ATLAS
LHCb
CERN:
Particle Accelerators (LHC)
High Energy Physics Experiments
Experimental High Energy Physics
is Data Intensive
5
Particle Collisions Physics Discoveries
Large Scale
Computing
https://twiki.cern.ch/twiki/pub/CMSPublic/Hig13002TWiki/HZZ4l_animated.gif
And https://iopscience.iop.org/article/10.1088/1742-6596/455/1/012027
Key Data Processing Challenge
• Proton-proton collisions at LHC experiments happen at 40MHz.
• Hundreds of TB/s of electrical signals that allow physicists to investigate
particle collision events.
• Storage, limited by bandwidth
• Currently, only 1 every ~40K events stored to disk (~10 GB/s).
2018: 5 collisions/beam cross
Current LHC
2026: 400 collisions/beam cross
Future: High-Luminosity LHC upgrade
This can generate up to a petabyte of raw data per second
Reduced to GB/s by filtering in real time
Key is how to select potentially interesting events (trigger systems).
PB/s
40 million
collisions
per
second
(Raw)
100,000
selections
per
second
(L1)
TB/s
1,000
selections
per
second
(L2)
GB/s
Data Flow at LHC Experiments
R&D – Data Pipelines
• Improve the quality of filtering systems
• Reduce false positive rate
• From rule-based algorithms to classifiers based on
Deep Learning
• Advanced analytics at the edge
• Avoid wasting resources in offline computing
• Reduction of operational costs
Particle Classifiers Using Neural Networks
1
63%
2
36%
3
1%
Particle
Classifier
W + j
QCD
t-t̅
• R&D to improve the quality of filtering systems
• Develop a “Deep Learning classifier” to be used by the filtering system
• Goal: Identify events of interest for physics and reduce false positives
• False positives have a cost, as wasted storage bandwidth and computing
• “Topology classification with deep learning to improve real-time event selection at the
LHC”, Nguyen et al. Comput.Softw.Big Sci. 3 (2019) no.1, 12
Deep Learning Pipeline for Physics Data
Data
Ingestion
Feature
Preparation
Model
Development
Training
Read physics
data and
feature
engineering
Prepare
input for
Deep
Learning
network
1. Specify model
topology
2. Tune model
topology on
small dataset
Train the best
model
Technology: the pipeline uses Apache Spark + Analytics Zoo and
TensorFlow/Keras. Code on Python Notebooks.
Analytics Platform at CERN
HEP software
Experiments storage
HDFS
Personal storage
Integrating new “Big Data”
components with existing
infrastructure:
• Software distribution
• Data platforms
Text
Code
Monitoring
Visualizations
Hadoop and Spark Clusters at CERN
• Clusters:
• YARN/Hadoop
• Spark on Kubernetes
• Hardware: Intel based servers, continuous refresh and capacity expansion
Accelerator logging
(part of LHC
infrastructure)
Hadoop - YARN - 30 nodes
(Cores - 1200, Mem - 13 TB, Storage – 7.5 PB)
General Purpose Hadoop - YARN, 65 nodes
(Cores – 2.2k, Mem – 20 TB, Storage – 12.5 PB)
Cloud containers Kubernetes on Openstack VMs, Cores - 250, Mem – 2 TB
Storage: remote HDFS or EOS (for physics data)
Extending Spark to Read Physics Data
• Physics data
• Currently: >300 PBs of Physics data, increasing ~90 PB/year
• Stored in the CERN EOS storage system in ROOT Format and
accessible via XRootD protocol
• Integration with Spark ecosystem
• Hadoop-XRootD connector, HDFS compatible filesystem
• Spark Datasource for ROOT format
JNI
Hadoop
HDFS
APIHadoop-
XRootD
Connector
EOS
Storage
Service XRootD
Client
C++ Java
https://github.com/cerndb/hadoop-xrootd
https://github.com/diana-hep/spark-root
Labeled Data for Training and Test
● Simulated events
● Software simulators are used to generate events
and calculate the detector response
● Raw data contains arrays of simulated particles
and their properties, stored in ROOT format
● 54 million events
Step 1: Data Ingestion
• Read input files: 4.5 TB from custom (ROOT) format
• Feature engineering
• Python and PySpark code, using Jupyter notebooks
• Write output in Parquet format
Output:
• 25 M events
• 950 GB in Parquet format
• Target storage (HDFS)
Input:
• 54 M events
~4.5 TB
• Physics data
storage (EOS)
• Physics data
format (ROOT)
● Filtering
● Multiple filters, keep only events of interest
● Example: “events with one electrons or muon with Pt > 23 Gev”
• Prepare “Low Level Features”
• Every event is associated to a matrix of particles and features (801x19)
• High Level Features (HLF)
• Additional 14 features are computed from low level particle features
• Calculated based on domain-specific knowledge
Feature Engineering
Step 2: Feature Preparation
Features are converted to formats
suitable for training
• One Hot Encoding of categories
• MinMax scaler for High Level Features
• Sorting Low Level Features: prepare input
for the sequence classifier, using a metric
based on physics. This use a Python UDF.
• Undersampling: use the same number of
events for each of the three categories
Result
• 3.6 Million events, 317 GB
• Shuffled and split into training and test
datasets
• Code: in a Jupyter notebook using
PySpark with Spark SQL and ML
Performance and Lessons Learned
• Data preparation is CPU bound
• Heavy serialization-deserialization due to Python UDF
• Ran using 400 cores: data ingestion took ~3 hours,
• It can be optimized, but is it worth it ?
• Use Spark SQL, Scala instead of Python UDF
• Optimization: replacing parts of Python UDF code with Spark SQL
and higher order functions: run time from 3 hours to 2 hours
Neural Network Models and
1. Fully connected feed-forward deep neural
network
• Trained using High Level Features (~1 GB of data)
2. Neural network based on Gated Recurrent
Unit (GRU)
• Trained using Low Level Features (~ 300 GB of
data)
3. Inclusive classifier model
• Combination of (1) + (2)
Complexity +
Classifier
Performance
Hyper-Parameter Tuning– DNN
• Hyper-parameter tuning of the DNN model
• Trained with a subset of the data (cached in memory)
• Parallelized with Spark, using spark_sklearn.grid_search
• And scikit-learn + keras: tensorflow.keras.wrappers.scikit_learn
Deep Learning at Scale with Spark
• Investigations and constraints for our exercise
• How to run deep learning in a Spark data pipeline?
• Neural network models written using Keras API
• Deploy on Hadoop and/or Kubernetes clusters (CPU clusters)
• Distributed deep learning
• GRU-based model is complex
• Slow to train on a single commodity (CPU) server
Spark, Analytics Zoo and BigDL
• Apache Spark
• Leading tool and API for data processing at scale
• Analytics Zoo is a platform for unified analytics
and AI
• Runs on Apache Spark leveraging BigDL / Tensorflow
• For service developers: integration with infrastructure
(hardware, data access, operations)
• For users: Keras APIs to run user models, integration
with Spark data structures and pipelines
• BigDL is an open source distributed deep learning
framework for Apache Spark
BigDL Run as Standard Spark Programs
Spark
Program
DL App on Driver
Spark
Executor
(JVM)
Spark
Task
BigDL lib
Worker
Intel MKL
Standard
Spark jobs
Worker
Worker Worker
Worker
Spark
Executor
(JVM)
Spark
Task
BigDL lib
Worker
Intel MKL
BigDL
library
Spark
jobs
BigDL Program
Standard Spark jobs
• No changes to the Spark or Hadoop clusters needed
Iterative
• Each iteration of the training runs as a Spark job
Data parallel
• Each Spark task runs the same model on a subset of the data (batch)
Source: Intel BigDL Team
BigDL Parameter Synchronization
Source: https://github.com/intel-analytics/BigDL/blob/master/docs/docs/whitepaper.md
Model Development – DNN for HLF
• Model is instantiated using the Keras-
compatible API provided by Analytics Zoo
Model Development – GRU + HLF
A more complex network topology, combining a GRU of Low Level Feature + a
DNN of High Level Features
Distributed Training
Instantiate the estimator using Analytics Zoo / BigDL
The actual training is distributed to Spark executors
Storing the model for later use
Analytics Zoo/BigDL on Spark scales up in the ranges tested
Performance and Scalability of Analytics Zoo/BigDL
Inclusive classifier model DNN model, HLF features
Workload Characterization
• Training with Analytics zoo
• GRU-based model: Distributed training on YARN cluster
• Measure with Spark Dashboard: it is CPU bound
Results – Model Performance
• Trained models with
Analytics Zoo and BigDL
• Met the expected results
for model performance:
ROC curve and AUC
Training with TensorFlow 2.0
• Training and test data
• Converted from Parquet to TFRecord format using Spark
• TensorFlow: data ingestion using tf.data and tf.io
• Distributed training with tf.distribute + tool for K8S: https://github.com/cerndb/tf-spawner
Distributed training with TensorFlow
2.0 on Kubernetes (CERN cloud)
Distributed training of the Keras
model with:
tf.distribute.experimental.
MultiWorkerMirroredStrategy
Performance and Lessons Learned
• Measured distributed training elapsed time
• From a few hours to 11 hours, depending on model, number of epochs and batch
size. Hard to compare different methods and solutions (many parameters)
• Distributed training with BigDL and Analytics Zoo
• Integrates very well with Spark
• Need to cache data in memory
• Noisy clusters with stragglers can add latency to parameter synchronization
• TensorFlow 2.0
• It is straightforward to distribute training on CPUs and GPUs with tf.distribute
• Data flow: Use TFRecord format, read with TensorFlow’s tf.data and tf.io
• GRU training performance on GPU: 10x speedup in TF 2.0
• Training of the Inclusive Classifier on a single P100 in 5 hours
Data and
models from
Research
Input:
labeled
data and
DL models
Feature
engineering
at scale
Distributed
model training
Output: particle
selector model
Hyperparameter
optimization
(Random/Grid
search)
Recap: our Deep Learning Pipeline with Spark
Model Serving and Future Work
• Using Apache Kafka
and Spark?
• FPGA serving DNN models
MODEL
Streaming
platform
MODEL
RTL
translation
FPGA Output
pipeline:
to storage
/ further
online
analysis
Output
pipeline
Summary
• The use case developed addresses the needs for higher
efficiency in event filtering at LHC experiments
• Spark, Python notebooks
• Provide well-known APIs and productive environment for data preparation
• Data preparation performance, lessons learned:
• Use Spark SQL/DataFrame API, avoid Python UDF when possible
• Successfully scaled Deep Learning on Spark clusters
• Using Analytics Zoo and BigDL
• Deployed on existing Intel Xeon-based servers: Hadoop clusters and cloud
• Good results also with Tensorflow 2.0, running on Kubernetes
• Continuous evolution and improvements of DL at scale
• Data preparation and scalable distributed training are key
Acknowledgments
• Matteo Migliorini, Marco Zanetti, Riccardo Castellotti, Michał Bień, Viktor
Khristenko, CERN Spark and Hadoop service, CERN openlab
• Authors of “Topology classification with deep learning to improve real-time
event selection at the LHC”, notably Thong Nguyen, Maurizio Pierini
• Intel team for BigDL and Analytics Zoo: Jiao (Jennie) Wang, Sajan Govindan
– Analytics Zoo: https://github.com/intel-analytics/analytics-zoo
– BigDL: https://software.intel.com/bigdl
References:
– Data and code: https://github.com/cerndb/SparkDLTrigger
– Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics
http://arxiv.org/abs/1909.10389
#UnifiedDataAnalytics #SparkAISummit 37
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Weitere ähnliche Inhalte

Was ist angesagt?

Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
 
Dropbox Talk at Netflix ML Platform Meetup Spe 2019
Dropbox Talk at Netflix ML Platform Meetup Spe 2019Dropbox Talk at Netflix ML Platform Meetup Spe 2019
Dropbox Talk at Netflix ML Platform Meetup Spe 2019Faisal Siddiqi
 
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...Edureka!
 
Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdf
Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdfAuto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdf
Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdfKundjanasith Thonglek
 
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Databricks
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
 
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021StreamNative
 
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDelight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Optimizing Apache Spark UDFs
Optimizing Apache Spark UDFsOptimizing Apache Spark UDFs
Optimizing Apache Spark UDFsDatabricks
 
Python time series analysis and visualization for self-driving cars
Python time series analysis and visualization for self-driving carsPython time series analysis and visualization for self-driving cars
Python time series analysis and visualization for self-driving carsAndreas Pawlik
 
Big Data Analytics with Spark
Big Data Analytics with SparkBig Data Analytics with Spark
Big Data Analytics with SparkMohammed Guller
 
Python Anaconda Tutorial | Edureka
Python Anaconda Tutorial | EdurekaPython Anaconda Tutorial | Edureka
Python Anaconda Tutorial | EdurekaEdureka!
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySparkRussell Jurney
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 

Was ist angesagt? (20)

Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Dropbox Talk at Netflix ML Platform Meetup Spe 2019
Dropbox Talk at Netflix ML Platform Meetup Spe 2019Dropbox Talk at Netflix ML Platform Meetup Spe 2019
Dropbox Talk at Netflix ML Platform Meetup Spe 2019
 
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdf
Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdfAuto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdf
Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdf
 
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021
 
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDelight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Optimizing Apache Spark UDFs
Optimizing Apache Spark UDFsOptimizing Apache Spark UDFs
Optimizing Apache Spark UDFs
 
CPU vs GPU Comparison
CPU  vs GPU ComparisonCPU  vs GPU Comparison
CPU vs GPU Comparison
 
Python time series analysis and visualization for self-driving cars
Python time series analysis and visualization for self-driving carsPython time series analysis and visualization for self-driving cars
Python time series analysis and visualization for self-driving cars
 
Big Data Analytics with Spark
Big Data Analytics with SparkBig Data Analytics with Spark
Big Data Analytics with Spark
 
Python Anaconda Tutorial | Edureka
Python Anaconda Tutorial | EdurekaPython Anaconda Tutorial | Edureka
Python Anaconda Tutorial | Edureka
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Catalyst optimizer
Catalyst optimizerCatalyst optimizer
Catalyst optimizer
 

Ähnlich wie Deep Learning Pipelines for High Energy Physics using Apache Spark with Distributed Keras on Analytics Zoo

Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Databricks
 
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Databricks
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Jason Dai
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningDataWorks Summit
 
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...Globus
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetupGanesan Narayanasamy
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit
 
Deep learning and Apache Spark
Deep learning and Apache SparkDeep learning and Apache Spark
Deep learning and Apache SparkQuantUniversity
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkIntegrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkDatabricks
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light SourcesIan Foster
 
PEARC 17: Spark On the ARC
PEARC 17: Spark On the ARCPEARC 17: Spark On the ARC
PEARC 17: Spark On the ARCHimanshu Bedi
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceinside-BigData.com
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks
 
Scientific
Scientific Scientific
Scientific marpierc
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresSpark Summit
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Databricks
 
CaffeOnSpark: Deep Learning On Spark Cluster
CaffeOnSpark: Deep Learning On Spark ClusterCaffeOnSpark: Deep Learning On Spark Cluster
CaffeOnSpark: Deep Learning On Spark ClusterJen Aman
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit
 

Ähnlich wie Deep Learning Pipelines for High Energy Physics using Apache Spark with Distributed Keras on Analytics Zoo (20)

Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
 
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learning
 
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca Canali
 
Deep learning and Apache Spark
Deep learning and Apache SparkDeep learning and Apache Spark
Deep learning and Apache Spark
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkIntegrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache Spark
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
PEARC 17: Spark On the ARC
PEARC 17: Spark On the ARCPEARC 17: Spark On the ARC
PEARC 17: Spark On the ARC
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
 
Scientific
Scientific Scientific
Scientific
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi Torres
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
 
CaffeOnSpark: Deep Learning On Spark Cluster
CaffeOnSpark: Deep Learning On Spark ClusterCaffeOnSpark: Deep Learning On Spark Cluster
CaffeOnSpark: Deep Learning On Spark Cluster
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 

Mehr von Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Kürzlich hochgeladen

办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一F sss
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 

Kürzlich hochgeladen (20)

办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 

Deep Learning Pipelines for High Energy Physics using Apache Spark with Distributed Keras on Analytics Zoo

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Luca Canali, CERN Deep Learning Pipelines for High Energy Physics using Apache Spark with Distributed Keras and Analytics Zoo #UnifiedDataAnalytics #SparkAISummit
  • 3. About Luca 3#UnifiedDataAnalytics #SparkAISummit • Data Engineer at CERN – Hadoop and Spark service, database services – 19+ years of experience with data engineering • Sharing and community – Blog, notes, tools, contributions to Apache Spark @LucaCanaliDB – http://cern.ch/canali
  • 5. Experimental High Energy Physics is Data Intensive 5 Particle Collisions Physics Discoveries Large Scale Computing https://twiki.cern.ch/twiki/pub/CMSPublic/Hig13002TWiki/HZZ4l_animated.gif And https://iopscience.iop.org/article/10.1088/1742-6596/455/1/012027
  • 6. Key Data Processing Challenge • Proton-proton collisions at LHC experiments happen at 40MHz. • Hundreds of TB/s of electrical signals that allow physicists to investigate particle collision events. • Storage, limited by bandwidth • Currently, only 1 every ~40K events stored to disk (~10 GB/s). 2018: 5 collisions/beam cross Current LHC 2026: 400 collisions/beam cross Future: High-Luminosity LHC upgrade
  • 7. This can generate up to a petabyte of raw data per second Reduced to GB/s by filtering in real time Key is how to select potentially interesting events (trigger systems). PB/s 40 million collisions per second (Raw) 100,000 selections per second (L1) TB/s 1,000 selections per second (L2) GB/s Data Flow at LHC Experiments
  • 8. R&D – Data Pipelines • Improve the quality of filtering systems • Reduce false positive rate • From rule-based algorithms to classifiers based on Deep Learning • Advanced analytics at the edge • Avoid wasting resources in offline computing • Reduction of operational costs
  • 9. Particle Classifiers Using Neural Networks 1 63% 2 36% 3 1% Particle Classifier W + j QCD t-t̅ • R&D to improve the quality of filtering systems • Develop a “Deep Learning classifier” to be used by the filtering system • Goal: Identify events of interest for physics and reduce false positives • False positives have a cost, as wasted storage bandwidth and computing • “Topology classification with deep learning to improve real-time event selection at the LHC”, Nguyen et al. Comput.Softw.Big Sci. 3 (2019) no.1, 12
  • 10. Deep Learning Pipeline for Physics Data Data Ingestion Feature Preparation Model Development Training Read physics data and feature engineering Prepare input for Deep Learning network 1. Specify model topology 2. Tune model topology on small dataset Train the best model Technology: the pipeline uses Apache Spark + Analytics Zoo and TensorFlow/Keras. Code on Python Notebooks.
  • 11. Analytics Platform at CERN HEP software Experiments storage HDFS Personal storage Integrating new “Big Data” components with existing infrastructure: • Software distribution • Data platforms
  • 13. Hadoop and Spark Clusters at CERN • Clusters: • YARN/Hadoop • Spark on Kubernetes • Hardware: Intel based servers, continuous refresh and capacity expansion Accelerator logging (part of LHC infrastructure) Hadoop - YARN - 30 nodes (Cores - 1200, Mem - 13 TB, Storage – 7.5 PB) General Purpose Hadoop - YARN, 65 nodes (Cores – 2.2k, Mem – 20 TB, Storage – 12.5 PB) Cloud containers Kubernetes on Openstack VMs, Cores - 250, Mem – 2 TB Storage: remote HDFS or EOS (for physics data)
  • 14. Extending Spark to Read Physics Data • Physics data • Currently: >300 PBs of Physics data, increasing ~90 PB/year • Stored in the CERN EOS storage system in ROOT Format and accessible via XRootD protocol • Integration with Spark ecosystem • Hadoop-XRootD connector, HDFS compatible filesystem • Spark Datasource for ROOT format JNI Hadoop HDFS APIHadoop- XRootD Connector EOS Storage Service XRootD Client C++ Java https://github.com/cerndb/hadoop-xrootd https://github.com/diana-hep/spark-root
  • 15. Labeled Data for Training and Test ● Simulated events ● Software simulators are used to generate events and calculate the detector response ● Raw data contains arrays of simulated particles and their properties, stored in ROOT format ● 54 million events
  • 16. Step 1: Data Ingestion • Read input files: 4.5 TB from custom (ROOT) format • Feature engineering • Python and PySpark code, using Jupyter notebooks • Write output in Parquet format Output: • 25 M events • 950 GB in Parquet format • Target storage (HDFS) Input: • 54 M events ~4.5 TB • Physics data storage (EOS) • Physics data format (ROOT)
  • 17. ● Filtering ● Multiple filters, keep only events of interest ● Example: “events with one electrons or muon with Pt > 23 Gev” • Prepare “Low Level Features” • Every event is associated to a matrix of particles and features (801x19) • High Level Features (HLF) • Additional 14 features are computed from low level particle features • Calculated based on domain-specific knowledge Feature Engineering
  • 18. Step 2: Feature Preparation Features are converted to formats suitable for training • One Hot Encoding of categories • MinMax scaler for High Level Features • Sorting Low Level Features: prepare input for the sequence classifier, using a metric based on physics. This use a Python UDF. • Undersampling: use the same number of events for each of the three categories Result • 3.6 Million events, 317 GB • Shuffled and split into training and test datasets • Code: in a Jupyter notebook using PySpark with Spark SQL and ML
  • 19. Performance and Lessons Learned • Data preparation is CPU bound • Heavy serialization-deserialization due to Python UDF • Ran using 400 cores: data ingestion took ~3 hours, • It can be optimized, but is it worth it ? • Use Spark SQL, Scala instead of Python UDF • Optimization: replacing parts of Python UDF code with Spark SQL and higher order functions: run time from 3 hours to 2 hours
  • 20. Neural Network Models and 1. Fully connected feed-forward deep neural network • Trained using High Level Features (~1 GB of data) 2. Neural network based on Gated Recurrent Unit (GRU) • Trained using Low Level Features (~ 300 GB of data) 3. Inclusive classifier model • Combination of (1) + (2) Complexity + Classifier Performance
  • 21. Hyper-Parameter Tuning– DNN • Hyper-parameter tuning of the DNN model • Trained with a subset of the data (cached in memory) • Parallelized with Spark, using spark_sklearn.grid_search • And scikit-learn + keras: tensorflow.keras.wrappers.scikit_learn
  • 22. Deep Learning at Scale with Spark • Investigations and constraints for our exercise • How to run deep learning in a Spark data pipeline? • Neural network models written using Keras API • Deploy on Hadoop and/or Kubernetes clusters (CPU clusters) • Distributed deep learning • GRU-based model is complex • Slow to train on a single commodity (CPU) server
  • 23. Spark, Analytics Zoo and BigDL • Apache Spark • Leading tool and API for data processing at scale • Analytics Zoo is a platform for unified analytics and AI • Runs on Apache Spark leveraging BigDL / Tensorflow • For service developers: integration with infrastructure (hardware, data access, operations) • For users: Keras APIs to run user models, integration with Spark data structures and pipelines • BigDL is an open source distributed deep learning framework for Apache Spark
  • 24. BigDL Run as Standard Spark Programs Spark Program DL App on Driver Spark Executor (JVM) Spark Task BigDL lib Worker Intel MKL Standard Spark jobs Worker Worker Worker Worker Spark Executor (JVM) Spark Task BigDL lib Worker Intel MKL BigDL library Spark jobs BigDL Program Standard Spark jobs • No changes to the Spark or Hadoop clusters needed Iterative • Each iteration of the training runs as a Spark job Data parallel • Each Spark task runs the same model on a subset of the data (batch) Source: Intel BigDL Team
  • 25. BigDL Parameter Synchronization Source: https://github.com/intel-analytics/BigDL/blob/master/docs/docs/whitepaper.md
  • 26. Model Development – DNN for HLF • Model is instantiated using the Keras- compatible API provided by Analytics Zoo
  • 27. Model Development – GRU + HLF A more complex network topology, combining a GRU of Low Level Feature + a DNN of High Level Features
  • 28. Distributed Training Instantiate the estimator using Analytics Zoo / BigDL The actual training is distributed to Spark executors Storing the model for later use
  • 29. Analytics Zoo/BigDL on Spark scales up in the ranges tested Performance and Scalability of Analytics Zoo/BigDL Inclusive classifier model DNN model, HLF features
  • 30. Workload Characterization • Training with Analytics zoo • GRU-based model: Distributed training on YARN cluster • Measure with Spark Dashboard: it is CPU bound
  • 31. Results – Model Performance • Trained models with Analytics Zoo and BigDL • Met the expected results for model performance: ROC curve and AUC
  • 32. Training with TensorFlow 2.0 • Training and test data • Converted from Parquet to TFRecord format using Spark • TensorFlow: data ingestion using tf.data and tf.io • Distributed training with tf.distribute + tool for K8S: https://github.com/cerndb/tf-spawner Distributed training with TensorFlow 2.0 on Kubernetes (CERN cloud) Distributed training of the Keras model with: tf.distribute.experimental. MultiWorkerMirroredStrategy
  • 33. Performance and Lessons Learned • Measured distributed training elapsed time • From a few hours to 11 hours, depending on model, number of epochs and batch size. Hard to compare different methods and solutions (many parameters) • Distributed training with BigDL and Analytics Zoo • Integrates very well with Spark • Need to cache data in memory • Noisy clusters with stragglers can add latency to parameter synchronization • TensorFlow 2.0 • It is straightforward to distribute training on CPUs and GPUs with tf.distribute • Data flow: Use TFRecord format, read with TensorFlow’s tf.data and tf.io • GRU training performance on GPU: 10x speedup in TF 2.0 • Training of the Inclusive Classifier on a single P100 in 5 hours
  • 34. Data and models from Research Input: labeled data and DL models Feature engineering at scale Distributed model training Output: particle selector model Hyperparameter optimization (Random/Grid search) Recap: our Deep Learning Pipeline with Spark
  • 35. Model Serving and Future Work • Using Apache Kafka and Spark? • FPGA serving DNN models MODEL Streaming platform MODEL RTL translation FPGA Output pipeline: to storage / further online analysis Output pipeline
  • 36. Summary • The use case developed addresses the needs for higher efficiency in event filtering at LHC experiments • Spark, Python notebooks • Provide well-known APIs and productive environment for data preparation • Data preparation performance, lessons learned: • Use Spark SQL/DataFrame API, avoid Python UDF when possible • Successfully scaled Deep Learning on Spark clusters • Using Analytics Zoo and BigDL • Deployed on existing Intel Xeon-based servers: Hadoop clusters and cloud • Good results also with Tensorflow 2.0, running on Kubernetes • Continuous evolution and improvements of DL at scale • Data preparation and scalable distributed training are key
  • 37. Acknowledgments • Matteo Migliorini, Marco Zanetti, Riccardo Castellotti, Michał Bień, Viktor Khristenko, CERN Spark and Hadoop service, CERN openlab • Authors of “Topology classification with deep learning to improve real-time event selection at the LHC”, notably Thong Nguyen, Maurizio Pierini • Intel team for BigDL and Analytics Zoo: Jiao (Jennie) Wang, Sajan Govindan – Analytics Zoo: https://github.com/intel-analytics/analytics-zoo – BigDL: https://software.intel.com/bigdl References: – Data and code: https://github.com/cerndb/SparkDLTrigger – Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics http://arxiv.org/abs/1909.10389 #UnifiedDataAnalytics #SparkAISummit 37
  • 38. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT