SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
1 © Hortonworks Inc. 2011–2018. All rights reserved
Hadoop {Submarine} Project:
Running deep learning workloads on YARN
Wangda Tan (wangda@apache.org)
2 © Hortonworks Inc. 2011–2018. All rights reserved
About me
• Wangda Tan
• Engineering Manager of YARN team @ Hortonworks.
• Apache Hadoop PMC member and committer, working on Hadoop since 2011.
• Major working field: scheduler / deep learning on YARN / GPUs on YARN, etc.
3 © Hortonworks Inc. 2011–2018. All rights reserved
Agenda
• Machine Learning in production.
• With data scientist hat – requirements.
• {Submarine} project introduction with demo.
• How other YARN feature helps.
• Status, plan and case study.
Machine Learning in Production
Image courtesy of the NOAA Office of Ocean Exploration and Research, Gulf of Mexico 2018.
5 © Hortonworks Inc. 2011–2018. All rights reserved
Machine Learning in tutorial
$ nvidia-docker run -it -p 8888:8888 tensorflow/tensorflow:latest-gpu
Go to your browser on http://localhost:8888/
6 © Hortonworks Inc. 2011–2018. All rights reserved
Machine Learning in a Unified Platform
“Hidden Technical Debt in Machine Learning Systems”, Google
7 © Hortonworks Inc. 2011–2018. All rights reserved
Data pipelines for Machine Learning (Big Data)
ETLData Exploration
Join / Sampling /
Feature Extraction
Split train, test Data set, etc.
8 © Hortonworks Inc. 2011–2018. All rights reserved
Training Hierarchical Models
Word Embedding Model
Food picture classifier Model
Ensemble Model
"Burger is great.
however onion rings
were over cooked"
(Image/Photo from Yelp)
With Data Scientist Hat – Requirements
Image courtesy of the NOAA Office of Ocean Exploration and Research, Gulf of Mexico 2018.
10 © Hortonworks Inc. 2011–2018. All rights reserved
Who they are?
• After spoke to many Machine Learning Engineer or Data Scientist ..
• What they are familiar with?
• Linear algebra, statistics, machine learning algorithms and models, deep neural
networks(DNN/CNN/RNN), basic programming skill, etc.
• What they are not familiar with?
• System environment and programming
• Resource management and scheduling
• Networking and storage, etc.
11 © Hortonworks Inc. 2011–2018. All rights reserved
What they use
• Liblinear
• LibFM
• Scikit-learn
• XGBoost/LightGBM
• Spark MLlib
• TensorFlow/PyTorch/MXNet
12 © Hortonworks Inc. 2011–2018. All rights reserved
How they do?
• Where is the training and test dataset?
• HDFS / S3
• Sharing between team members
• Distributed preprocessing with MapReduce/Spark
• How to do experiments?
• Sample from full dataset
• Choose state of the art models, tuning hyper-parameters with cross validation
• Single node with CPUs
• Single node with GPUs
• Train with best parameters on full dataset
• Multi-node with CPUs and GPUs
• Push model into serving
{Submarine}
Hadoop {Submarine} Project Introduction
Image courtesy of the NOAA Office of Ocean Exploration and Research, Gulf of Mexico 2018.
The only machine can take human to deep
14 © Hortonworks Inc. 2011–2018. All rights reserved
Things to do to support easy-to-use Machine learning platform
What Machine Learning Engineer See
What Infra Learning Engineer See
15 © Hortonworks Inc. 2011–2018. All rights reserved
{Submarine}
• So ... What Submarine can do?
16 © Hortonworks Inc. 2011–2018. All rights reserved
{Submarine} - “Launch distributed TF job like hello world”
• (Only prerequisite) Setup a YARN cluster (3.1.0+).
• Run distributed TF training with one command:
yarn jar hadoop-yarn-applications-submarine-<version>.jar job run 
--name tf-job-001 --docker_image <your docker image> 
--input_path hdfs://default/dataset/cifar-10-data 
--checkpoint_path hdfs://default/tmp/cifar-10-jobdir 
--num_workers 2 
--worker_resources memory=8G,vcores=2,gpu=2 
--worker_launch_cmd "cmd for worker ..." 
--num_ps 2 
--ps_resources memory=4G,vcores=2,gpu=0 
--ps_launch_cmd "cmd for ps"
17 © Hortonworks Inc. 2011–2018. All rights reserved
{Submarine} – “View your jobs history like a king/queen”
• Run a service to monitor all TF job’s training progress in one tensorboard dashboard
with one command.
yarn jar hadoop-yarn-applications-submarine-<version>.jar job run 
--name tensorboard-service-001 --docker_image <your docker image> 
--tensorboard
18 © Hortonworks Inc. 2011–2018. All rights reserved
{Submarine} - “Cloud Notebook for Data Scientists”
• Run a notebook (like Zeppelin) leveraging GPU with one command
yarn jar hadoop-yarn-applications-submarine-<version>.jar job run 
--name zeppelin-note—book-001 --docker_image <your docker image> 
--num_workers 1 
--worker_resources memory=8G,vcores=2,gpu=4 
--worker_launch_cmd "/zeppelin/bin/zeppelin.sh" 
--quicklink Zeppelin_Notebook=http://master-0:8080
19 © Hortonworks Inc. 2011–2018. All rights reserved
{Submarine} - “Same hello world examples for MXNet/Pytorch”
• Run MXNet/PyTorch training with one command:
yarn jar hadoop-yarn-applications-submarine-<version>.jar job run 
--name xyz-job-001 --docker_image <your docker image> 
--input_path hdfs://default/dataset/cifar-10-data 
--checkpoint_path hdfs://default/tmp/cifar-10-jobdir 
--num_workers 1 
--worker_resources memory=8G,vcores=2,gpu=2 
--worker_launch_cmd “cmd for MXNet/PyTorch"
20 © Hortonworks Inc. 2011–2018. All rights reserved
{Submarine} Project Requirements
• Run deep learning workloads on the same cluster as analytics, stream processing etc!
• Allows jobs easy access data/models in HDFS and other storages.
• Supports run distributed Tensorflow, etc. jobs with simple configs.
• Supports run user-specified Docker images.
• Supports specify GPU and other resources.
• Supports launch tensorboard for training jobs if user specified.
21 © Hortonworks Inc. 2011–2018. All rights reserved
Demo
22 © Hortonworks Inc. 2011–2018. All rights reserved
Targeted features
Job Management:
- Start/Stop standalone TF/MXNet/PyTorch
- Start/Stop distributed TF/ MXNet (WIP), PyTorch (WIP)
- Stop
- Monitoring (Tensorboard / history)
Model Management (WIP)
- Checkpoint / Saved model
- Model serving.
Library dependency management
- BYOD (bring your own docker image)
- Python library dependencies (WIP)
Handled by YARN:
- Log
- Job monitoring
- Best job scheduler: SLA, Quota, etc
Submarine
23 © Hortonworks Inc. 2011–2018. All rights reserved
Architecture
How other YARN feature helps
Image courtesy of the NOAA Office of Ocean Exploration and Research, Gulf of Mexico 2018.
25 © Hortonworks Inc. 2011–2018. All rights reserved
GPU support on YARN (Apache Hadoop 3.1.0)
• Why need isolation?
• Multiple processes use the single GPU will be:
• Serialized.
• Cause OOM easily.
• GPU isolation on YARN: .
• Granularity is for per-GPU device.
• Use Cgroups / docker to enforce the isolation.
26 © Hortonworks Inc. 2011–2018. All rights reserved
Docker + GPU support on YARN (Apache Hadoop 3.1.0)
• Most of machine learning platforms has
python/R/cudnn/CUDA dependencies.
• Docker solves messy dependencies issues
• But it may introduce problems for GPU base
libraries
• Nvidia-docker-plugin mounts Nvidia driver,
etc. when container got launched.
• YARN supports Docker and as well as
nvidia-docker-plugin.
Tensorflow 1.2
Nginx AppUbuntu 14:04
Nginx AppHost OS
GPU Base Lib v1
Volume Mount
CUDA Library 5.0
Tensorflow 1.2
Nginx AppUbuntu 14:04
GPU Base Lib v2
Nginx AppHost OS
GPU Base Lib v1
X Fails
CUDA Library 5.0
27 © Hortonworks Inc. 2011–2018. All rights reserved
• Global scheduling enhancements: (YARN-5139)
• YARN scheduler can allocate 3k+ containers per second ≈ 10 mil allocations / hour!
• 10X throughput gains
• Scale:
• Microsoft: 52K nodes in single cluster (RM federation)
• https://azure.microsoft.com/en-us/blog/how-microsoft-drives-exabyte-analytics-on-the-world-
s-largest-yarn-cluster/
• Exabytes of data are processed daily. More than 15,000 developers use it across the company.
Scheduler + Scale
28 © Hortonworks Inc. 2011–2018. All rights reserved
• Now YARN can support a lot more use cases
• Co-locate the allocations of a job on the same rack (affinity)
• Spread allocations across machines (anti-affinity) to minimize resource interference
• Allow up to a specific number of allocations in a node group (cardinality)
• It improves perf a lot!
Scheduler: Placement constraints
>TensorFlow ML workflow with 1M iterations using 32 workers
with varying workers per node
Medea: Scheduling of Long Running
Applications in Shared Production Clusters
(Panagiotis/Konstantinos, et al)
29 © Hortonworks Inc. 2011–2018. All rights reserved
Finally, let’s get it run on YARN
LLAP
128 G 128 G 128 G 128 G 128 G
LLAP LLAP
128 G 128 G
GPUs
{Submarine}
Status & Case Study
Image courtesy of the NOAA Office of Ocean Exploration and Research, Gulf of Mexico 2018.
31 © Hortonworks Inc. 2011–2018. All rights reserved
Status & Plans
• Alpha solution is merged to trunk. (part of 3.2.0 release), still under active dev/testing.
Umbrella JIRA: YARN-8135.
• Submarine can run on Apache Hadoop 3.1+.x release. (HDP 3.0+). A single jar.
• Supported runtime of YARN native service to train use Docker container.
• is working on an adaptor to make TonY as a runtime of Submarine.
• TonY is open sourced!: https://github.com/linkedin/TonY
32 © Hortonworks Inc. 2011–2018. All rights reserved
Netease (NASDAQ: NTES) Case Study
• One of the largest online game/news/music provider in China.
• Total ~ 6k nodes YARN cluster.
• 100k jobs per day, 40% are Spark jobs.
• 1000 ML jobs per day.
• Runs in a separated GPU K8S cluster (~500 nodes), all data comes from HDFS and processed by
Spark, etc.
• Existing problems:
• Low utilization (YARN tasks cannot leverage this cluster).
• High maintenance cost (Need to manage the separated cluster).
• Working with community to develop, verifying Submarine on 20 Nodes GPU cluster.
• Plan to move all workload to Submarine in the future.
33 © Hortonworks Inc. 2011–2018. All rights reserved
Thanks!
• Source code / doc directory: https://github.com/apache/hadoop/tree/trunk/hadoop-
yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine
• Umbrella JIRA: https://issues.apache.org/jira/browse/YARN-8135
• Try it and give us feedbacks!
• We need your contribution, please file sub tickets under YARN-8135, and/or create a
pull request in https://github.com/apache/hadoop.

Weitere ähnliche Inhalte

Was ist angesagt?

Lessons Learned Running a Container Cloud on Apache Hadoop YARN
Lessons Learned Running a Container Cloud on Apache Hadoop YARNLessons Learned Running a Container Cloud on Apache Hadoop YARN
Lessons Learned Running a Container Cloud on Apache Hadoop YARNBillie Rinaldi
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureDataWorks Summit
 
YARN: a resource manager for analytic platform
YARN: a resource manager for analytic platformYARN: a resource manager for analytic platform
YARN: a resource manager for analytic platformTsuyoshi OZAWA
 
Running a container cloud on YARN
Running a container cloud on YARNRunning a container cloud on YARN
Running a container cloud on YARNDataWorks Summit
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?DataWorks Summit
 
TEZ-8 UI Walkthrough
TEZ-8 UI WalkthroughTEZ-8 UI Walkthrough
TEZ-8 UI Walkthrought3rmin4t0r
 
Triple-E’class Continuous Delivery with Hudson, Maven, Kokki and PyDev
Triple-E’class Continuous Delivery with Hudson, Maven, Kokki and PyDevTriple-E’class Continuous Delivery with Hudson, Maven, Kokki and PyDev
Triple-E’class Continuous Delivery with Hudson, Maven, Kokki and PyDevWerner Keil
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionDataWorks Summit
 
Memory Management: What You Need to Know When Moving to Java 8
Memory Management: What You Need to Know When Moving to Java 8Memory Management: What You Need to Know When Moving to Java 8
Memory Management: What You Need to Know When Moving to Java 8AppDynamics
 
Elastic JVM for Scalable Java EE Applications Running in Containers #Jakart...
Elastic JVM  for Scalable Java EE Applications  Running in Containers #Jakart...Elastic JVM  for Scalable Java EE Applications  Running in Containers #Jakart...
Elastic JVM for Scalable Java EE Applications Running in Containers #Jakart...Jelastic Multi-Cloud PaaS
 
Seamless Replication and Disaster Recovery for Apache Hive Warehouse
Seamless Replication and Disaster Recovery for Apache Hive WarehouseSeamless Replication and Disaster Recovery for Apache Hive Warehouse
Seamless Replication and Disaster Recovery for Apache Hive WarehouseSankar H
 
November 2014 HUG: Lessons from Hadoop 2+Java8 migration at LinkedIn
November 2014 HUG: Lessons from Hadoop 2+Java8 migration at LinkedIn November 2014 HUG: Lessons from Hadoop 2+Java8 migration at LinkedIn
November 2014 HUG: Lessons from Hadoop 2+Java8 migration at LinkedIn Yahoo Developer Network
 
What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4DataWorks Summit
 
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...NETWAYS
 
Analyzing Hadoop Using Hadoop
Analyzing Hadoop Using HadoopAnalyzing Hadoop Using Hadoop
Analyzing Hadoop Using HadoopDataWorks Summit
 
HotSpot JVM Tuning
HotSpot JVM TuningHotSpot JVM Tuning
HotSpot JVM TuningGilad Garon
 
State of Java Elasticity. Tuning Java Efficiency - GIDS.JAVA LIVE 2020
State of Java Elasticity. Tuning Java Efficiency - GIDS.JAVA LIVE 2020State of Java Elasticity. Tuning Java Efficiency - GIDS.JAVA LIVE 2020
State of Java Elasticity. Tuning Java Efficiency - GIDS.JAVA LIVE 2020Jelastic Multi-Cloud PaaS
 

Was ist angesagt? (20)

Lessons Learned Running a Container Cloud on Apache Hadoop YARN
Lessons Learned Running a Container Cloud on Apache Hadoop YARNLessons Learned Running a Container Cloud on Apache Hadoop YARN
Lessons Learned Running a Container Cloud on Apache Hadoop YARN
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
YARN: a resource manager for analytic platform
YARN: a resource manager for analytic platformYARN: a resource manager for analytic platform
YARN: a resource manager for analytic platform
 
Running a container cloud on YARN
Running a container cloud on YARNRunning a container cloud on YARN
Running a container cloud on YARN
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
TEZ-8 UI Walkthrough
TEZ-8 UI WalkthroughTEZ-8 UI Walkthrough
TEZ-8 UI Walkthrough
 
Triple-E’class Continuous Delivery with Hudson, Maven, Kokki and PyDev
Triple-E’class Continuous Delivery with Hudson, Maven, Kokki and PyDevTriple-E’class Continuous Delivery with Hudson, Maven, Kokki and PyDev
Triple-E’class Continuous Delivery with Hudson, Maven, Kokki and PyDev
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
Memory Management: What You Need to Know When Moving to Java 8
Memory Management: What You Need to Know When Moving to Java 8Memory Management: What You Need to Know When Moving to Java 8
Memory Management: What You Need to Know When Moving to Java 8
 
Elastic JVM for Scalable Java EE Applications Running in Containers #Jakart...
Elastic JVM  for Scalable Java EE Applications  Running in Containers #Jakart...Elastic JVM  for Scalable Java EE Applications  Running in Containers #Jakart...
Elastic JVM for Scalable Java EE Applications Running in Containers #Jakart...
 
Seamless Replication and Disaster Recovery for Apache Hive Warehouse
Seamless Replication and Disaster Recovery for Apache Hive WarehouseSeamless Replication and Disaster Recovery for Apache Hive Warehouse
Seamless Replication and Disaster Recovery for Apache Hive Warehouse
 
Future of Apache Storm
Future of Apache StormFuture of Apache Storm
Future of Apache Storm
 
November 2014 HUG: Lessons from Hadoop 2+Java8 migration at LinkedIn
November 2014 HUG: Lessons from Hadoop 2+Java8 migration at LinkedIn November 2014 HUG: Lessons from Hadoop 2+Java8 migration at LinkedIn
November 2014 HUG: Lessons from Hadoop 2+Java8 migration at LinkedIn
 
What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4
 
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...
 
Analyzing Hadoop Using Hadoop
Analyzing Hadoop Using HadoopAnalyzing Hadoop Using Hadoop
Analyzing Hadoop Using Hadoop
 
Apache Slider
Apache SliderApache Slider
Apache Slider
 
HotSpot JVM Tuning
HotSpot JVM TuningHotSpot JVM Tuning
HotSpot JVM Tuning
 
Basics of JVM Tuning
Basics of JVM TuningBasics of JVM Tuning
Basics of JVM Tuning
 
State of Java Elasticity. Tuning Java Efficiency - GIDS.JAVA LIVE 2020
State of Java Elasticity. Tuning Java Efficiency - GIDS.JAVA LIVE 2020State of Java Elasticity. Tuning Java Efficiency - GIDS.JAVA LIVE 2020
State of Java Elasticity. Tuning Java Efficiency - GIDS.JAVA LIVE 2020
 

Ähnlich wie Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda Tan, Hortonworks

Running Tensorflow In Production: Challenges and Solutions on YARN 3.x
Running Tensorflow In Production: Challenges and Solutions on YARN 3.x Running Tensorflow In Production: Challenges and Solutions on YARN 3.x
Running Tensorflow In Production: Challenges and Solutions on YARN 3.x Wangda Tan
 
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
Deep learning on yarn  running distributed tensorflow etc on hadoop cluster v3Deep learning on yarn  running distributed tensorflow etc on hadoop cluster v3
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3DataWorks Summit
 
Lessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARNLessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARNDataWorks Summit
 
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil GovindanNewton Alex
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionDataWorks Summit
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3DataWorks Summit
 
Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo DataWorks Summit
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionDataWorks Summit
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Data Con LA
 
Apache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the UnionApache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the UnionDataWorks Summit
 
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARN
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARNHadoop {Submarine} Project: Running Deep Learning Workloads on YARN
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARNDataWorks Summit
 
Running Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache HadoopRunning Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache Hadoophitesh1892
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark Hortonworks
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwordsSzehon Ho
 
Hadoop 2 @Twitter, Elephant Scale. Presented at
Hadoop 2 @Twitter, Elephant Scale. Presented at Hadoop 2 @Twitter, Elephant Scale. Presented at
Hadoop 2 @Twitter, Elephant Scale. Presented at lohitvijayarenu
 

Ähnlich wie Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda Tan, Hortonworks (20)

Running Tensorflow In Production: Challenges and Solutions on YARN 3.x
Running Tensorflow In Production: Challenges and Solutions on YARN 3.x Running Tensorflow In Production: Challenges and Solutions on YARN 3.x
Running Tensorflow In Production: Challenges and Solutions on YARN 3.x
 
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
Deep learning on yarn  running distributed tensorflow etc on hadoop cluster v3Deep learning on yarn  running distributed tensorflow etc on hadoop cluster v3
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
 
Lessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARNLessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARN
 
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
 
Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Apache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the UnionApache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the Union
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Running Services on YARN
Running Services on YARNRunning Services on YARN
Running Services on YARN
 
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARN
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARNHadoop {Submarine} Project: Running Deep Learning Workloads on YARN
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARN
 
Running Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache HadoopRunning Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache Hadoop
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwords
 
Containers and Big Data
Containers and Big DataContainers and Big Data
Containers and Big Data
 
Hadoop 2 @Twitter, Elephant Scale. Presented at
Hadoop 2 @Twitter, Elephant Scale. Presented at Hadoop 2 @Twitter, Elephant Scale. Presented at
Hadoop 2 @Twitter, Elephant Scale. Presented at
 

Mehr von Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...Yahoo Developer Network
 

Mehr von Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
 

Kürzlich hochgeladen

Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Kürzlich hochgeladen (20)

Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda Tan, Hortonworks

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved Hadoop {Submarine} Project: Running deep learning workloads on YARN Wangda Tan (wangda@apache.org)
  • 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved About me • Wangda Tan • Engineering Manager of YARN team @ Hortonworks. • Apache Hadoop PMC member and committer, working on Hadoop since 2011. • Major working field: scheduler / deep learning on YARN / GPUs on YARN, etc.
  • 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved Agenda • Machine Learning in production. • With data scientist hat – requirements. • {Submarine} project introduction with demo. • How other YARN feature helps. • Status, plan and case study.
  • 4. Machine Learning in Production Image courtesy of the NOAA Office of Ocean Exploration and Research, Gulf of Mexico 2018.
  • 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved Machine Learning in tutorial $ nvidia-docker run -it -p 8888:8888 tensorflow/tensorflow:latest-gpu Go to your browser on http://localhost:8888/
  • 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved Machine Learning in a Unified Platform “Hidden Technical Debt in Machine Learning Systems”, Google
  • 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved Data pipelines for Machine Learning (Big Data) ETLData Exploration Join / Sampling / Feature Extraction Split train, test Data set, etc.
  • 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved Training Hierarchical Models Word Embedding Model Food picture classifier Model Ensemble Model "Burger is great. however onion rings were over cooked" (Image/Photo from Yelp)
  • 9. With Data Scientist Hat – Requirements Image courtesy of the NOAA Office of Ocean Exploration and Research, Gulf of Mexico 2018.
  • 10. 10 © Hortonworks Inc. 2011–2018. All rights reserved Who they are? • After spoke to many Machine Learning Engineer or Data Scientist .. • What they are familiar with? • Linear algebra, statistics, machine learning algorithms and models, deep neural networks(DNN/CNN/RNN), basic programming skill, etc. • What they are not familiar with? • System environment and programming • Resource management and scheduling • Networking and storage, etc.
  • 11. 11 © Hortonworks Inc. 2011–2018. All rights reserved What they use • Liblinear • LibFM • Scikit-learn • XGBoost/LightGBM • Spark MLlib • TensorFlow/PyTorch/MXNet
  • 12. 12 © Hortonworks Inc. 2011–2018. All rights reserved How they do? • Where is the training and test dataset? • HDFS / S3 • Sharing between team members • Distributed preprocessing with MapReduce/Spark • How to do experiments? • Sample from full dataset • Choose state of the art models, tuning hyper-parameters with cross validation • Single node with CPUs • Single node with GPUs • Train with best parameters on full dataset • Multi-node with CPUs and GPUs • Push model into serving {Submarine}
  • 13. Hadoop {Submarine} Project Introduction Image courtesy of the NOAA Office of Ocean Exploration and Research, Gulf of Mexico 2018. The only machine can take human to deep
  • 14. 14 © Hortonworks Inc. 2011–2018. All rights reserved Things to do to support easy-to-use Machine learning platform What Machine Learning Engineer See What Infra Learning Engineer See
  • 15. 15 © Hortonworks Inc. 2011–2018. All rights reserved {Submarine} • So ... What Submarine can do?
  • 16. 16 © Hortonworks Inc. 2011–2018. All rights reserved {Submarine} - “Launch distributed TF job like hello world” • (Only prerequisite) Setup a YARN cluster (3.1.0+). • Run distributed TF training with one command: yarn jar hadoop-yarn-applications-submarine-<version>.jar job run --name tf-job-001 --docker_image <your docker image> --input_path hdfs://default/dataset/cifar-10-data --checkpoint_path hdfs://default/tmp/cifar-10-jobdir --num_workers 2 --worker_resources memory=8G,vcores=2,gpu=2 --worker_launch_cmd "cmd for worker ..." --num_ps 2 --ps_resources memory=4G,vcores=2,gpu=0 --ps_launch_cmd "cmd for ps"
  • 17. 17 © Hortonworks Inc. 2011–2018. All rights reserved {Submarine} – “View your jobs history like a king/queen” • Run a service to monitor all TF job’s training progress in one tensorboard dashboard with one command. yarn jar hadoop-yarn-applications-submarine-<version>.jar job run --name tensorboard-service-001 --docker_image <your docker image> --tensorboard
  • 18. 18 © Hortonworks Inc. 2011–2018. All rights reserved {Submarine} - “Cloud Notebook for Data Scientists” • Run a notebook (like Zeppelin) leveraging GPU with one command yarn jar hadoop-yarn-applications-submarine-<version>.jar job run --name zeppelin-note—book-001 --docker_image <your docker image> --num_workers 1 --worker_resources memory=8G,vcores=2,gpu=4 --worker_launch_cmd "/zeppelin/bin/zeppelin.sh" --quicklink Zeppelin_Notebook=http://master-0:8080
  • 19. 19 © Hortonworks Inc. 2011–2018. All rights reserved {Submarine} - “Same hello world examples for MXNet/Pytorch” • Run MXNet/PyTorch training with one command: yarn jar hadoop-yarn-applications-submarine-<version>.jar job run --name xyz-job-001 --docker_image <your docker image> --input_path hdfs://default/dataset/cifar-10-data --checkpoint_path hdfs://default/tmp/cifar-10-jobdir --num_workers 1 --worker_resources memory=8G,vcores=2,gpu=2 --worker_launch_cmd “cmd for MXNet/PyTorch"
  • 20. 20 © Hortonworks Inc. 2011–2018. All rights reserved {Submarine} Project Requirements • Run deep learning workloads on the same cluster as analytics, stream processing etc! • Allows jobs easy access data/models in HDFS and other storages. • Supports run distributed Tensorflow, etc. jobs with simple configs. • Supports run user-specified Docker images. • Supports specify GPU and other resources. • Supports launch tensorboard for training jobs if user specified.
  • 21. 21 © Hortonworks Inc. 2011–2018. All rights reserved Demo
  • 22. 22 © Hortonworks Inc. 2011–2018. All rights reserved Targeted features Job Management: - Start/Stop standalone TF/MXNet/PyTorch - Start/Stop distributed TF/ MXNet (WIP), PyTorch (WIP) - Stop - Monitoring (Tensorboard / history) Model Management (WIP) - Checkpoint / Saved model - Model serving. Library dependency management - BYOD (bring your own docker image) - Python library dependencies (WIP) Handled by YARN: - Log - Job monitoring - Best job scheduler: SLA, Quota, etc Submarine
  • 23. 23 © Hortonworks Inc. 2011–2018. All rights reserved Architecture
  • 24. How other YARN feature helps Image courtesy of the NOAA Office of Ocean Exploration and Research, Gulf of Mexico 2018.
  • 25. 25 © Hortonworks Inc. 2011–2018. All rights reserved GPU support on YARN (Apache Hadoop 3.1.0) • Why need isolation? • Multiple processes use the single GPU will be: • Serialized. • Cause OOM easily. • GPU isolation on YARN: . • Granularity is for per-GPU device. • Use Cgroups / docker to enforce the isolation.
  • 26. 26 © Hortonworks Inc. 2011–2018. All rights reserved Docker + GPU support on YARN (Apache Hadoop 3.1.0) • Most of machine learning platforms has python/R/cudnn/CUDA dependencies. • Docker solves messy dependencies issues • But it may introduce problems for GPU base libraries • Nvidia-docker-plugin mounts Nvidia driver, etc. when container got launched. • YARN supports Docker and as well as nvidia-docker-plugin. Tensorflow 1.2 Nginx AppUbuntu 14:04 Nginx AppHost OS GPU Base Lib v1 Volume Mount CUDA Library 5.0 Tensorflow 1.2 Nginx AppUbuntu 14:04 GPU Base Lib v2 Nginx AppHost OS GPU Base Lib v1 X Fails CUDA Library 5.0
  • 27. 27 © Hortonworks Inc. 2011–2018. All rights reserved • Global scheduling enhancements: (YARN-5139) • YARN scheduler can allocate 3k+ containers per second ≈ 10 mil allocations / hour! • 10X throughput gains • Scale: • Microsoft: 52K nodes in single cluster (RM federation) • https://azure.microsoft.com/en-us/blog/how-microsoft-drives-exabyte-analytics-on-the-world- s-largest-yarn-cluster/ • Exabytes of data are processed daily. More than 15,000 developers use it across the company. Scheduler + Scale
  • 28. 28 © Hortonworks Inc. 2011–2018. All rights reserved • Now YARN can support a lot more use cases • Co-locate the allocations of a job on the same rack (affinity) • Spread allocations across machines (anti-affinity) to minimize resource interference • Allow up to a specific number of allocations in a node group (cardinality) • It improves perf a lot! Scheduler: Placement constraints >TensorFlow ML workflow with 1M iterations using 32 workers with varying workers per node Medea: Scheduling of Long Running Applications in Shared Production Clusters (Panagiotis/Konstantinos, et al)
  • 29. 29 © Hortonworks Inc. 2011–2018. All rights reserved Finally, let’s get it run on YARN LLAP 128 G 128 G 128 G 128 G 128 G LLAP LLAP 128 G 128 G GPUs {Submarine}
  • 30. Status & Case Study Image courtesy of the NOAA Office of Ocean Exploration and Research, Gulf of Mexico 2018.
  • 31. 31 © Hortonworks Inc. 2011–2018. All rights reserved Status & Plans • Alpha solution is merged to trunk. (part of 3.2.0 release), still under active dev/testing. Umbrella JIRA: YARN-8135. • Submarine can run on Apache Hadoop 3.1+.x release. (HDP 3.0+). A single jar. • Supported runtime of YARN native service to train use Docker container. • is working on an adaptor to make TonY as a runtime of Submarine. • TonY is open sourced!: https://github.com/linkedin/TonY
  • 32. 32 © Hortonworks Inc. 2011–2018. All rights reserved Netease (NASDAQ: NTES) Case Study • One of the largest online game/news/music provider in China. • Total ~ 6k nodes YARN cluster. • 100k jobs per day, 40% are Spark jobs. • 1000 ML jobs per day. • Runs in a separated GPU K8S cluster (~500 nodes), all data comes from HDFS and processed by Spark, etc. • Existing problems: • Low utilization (YARN tasks cannot leverage this cluster). • High maintenance cost (Need to manage the separated cluster). • Working with community to develop, verifying Submarine on 20 Nodes GPU cluster. • Plan to move all workload to Submarine in the future.
  • 33. 33 © Hortonworks Inc. 2011–2018. All rights reserved Thanks! • Source code / doc directory: https://github.com/apache/hadoop/tree/trunk/hadoop- yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine • Umbrella JIRA: https://issues.apache.org/jira/browse/YARN-8135 • Try it and give us feedbacks! • We need your contribution, please file sub tickets under YARN-8135, and/or create a pull request in https://github.com/apache/hadoop.

Hinweis der Redaktion

  1. Just like the workflow shows, only a tiny fraction of the code is actually devoted to model learning. The machine learning workflow usually need lots of supports from the big data platform, such as data collection from different data sources, feature extraction, feature transform, and so on. Let’s find out how big data infrastructure could help machine learning step by step.
  2. Just like the workflow shows, only a tiny fraction of the code is actually devoted to model learning. The machine learning workflow usually need lots of supports from the big data platform, such as data collection from different data sources, feature extraction, feature transform, and so on. Let’s find out how big data infrastructure could help machine learning step by step.
  3. ToDo Add Ooozie/Azkaban to control the workflow
  4. To Do:
  5. TODO: add slides about how easy it is to use submarine.
  6. TODO: add slides about how easy it is to use submarine.
  7. TODO: add slides about how easy it is to use submarine.
  8. TODO: add slides about how easy it is to use submarine.
  9. 1) Run a normal distributed job. yarn app -destroy tf-job-001; yarn jar /tmp/hadoop-yarn-applications-submarine-3.2.0-SNAPSHOT.jar job run --name tf-job-001 --verbose --docker_image wtan/tf-1.8.0-gpu:0.0.3 --input_path hdfs://default/dataset/cifar-10-data --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --env YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS="/etc/docker_resolv.conf:/etc/resolv.conf:ro" --num_workers 2 --worker_resources memory=8G,vcores=2,gpu=1 --worker_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16 --train-batch-size=16 --num-gpus=2 --sync" --ps_docker_image wtan/tf-1.8.0-cpu:0.0.3 --num_ps 1 --ps_resources memory=4G,vcores=2,gpu=0 --ps_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --num-gpus=0" --tensorboard --tensorboard_docker_image wtan/tf-1.8.0-cpu:0.0.3 2) Run a standalone distributed job. yarn app -destroy tf-job-001; yarn jar /tmp/hadoop-yarn-applications-submarine-3.2.0-SNAPSHOT.jar job run --name tf-job-001 --verbose --docker_image wtan/tf-1.8.0-gpu:0.0.3 --input_path hdfs://default/dataset/cifar-10-data --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --env YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS="/etc/docker_resolv.conf:/etc/resolv.conf:ro" --num_workers 1 --worker_resources memory=8G,vcores=2,gpu=1 --worker_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16 --train-batch-size=16 --num-gpus=2 --sync" --tensorboard --tensorboard_docker_image wtan/tf-1.8.0-cpu:0.0.3 2) Run a Tensorboard service. yarn app -destroy tensorboard-service; yarn jar /tmp/hadoop-yarn-applications-submarine-3.2.0-SNAPSHOT.jar job run --name tensorboard-service --verbose --docker_image wtan/tf-1.8.0-cpu:0.0.3 --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --env YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS="/etc/docker_resolv.conf:/etc/resolv.conf:ro" --num_workers 0 --tensorboard
  10. Even though TF provide options to use GPU memory less than whole device provided. But we cannot enforce this from external.
  11. Even though TF provide options to use GPU memory less than whole device provided. But we cannot enforce this from external.