Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio

DATA ORCHESTRATION SUMMIT
2020
Ultra Fast Deep Learning in Hybrid Cloud
Using Intel Analytics Zoo & Alluxio
Jennie(Jiao)Wang Intel
LouieTsai Intel

DATA ORCHESTRATION
SUMMIT
2020
Deep Learning & Analytics Zoo
https://github.com/intel-analytics/analytics-zoo

Data Scale Driving
Deep Learning Process
“Machine Learning Yearning”,
Andrew Ng, 2016

Real-World ML/DL Systems Are
Complex Big Data Analytics Pipelines
Technical Debt in Machine Learning Systems”,
Sculley et al., Google, NIPS 2015 Paper

Analytics Zoo: End-to-End DL Pipeline
Made Easy for Big Data
Prototype on laptop
using sample data
Experiment on clusters
with history data
Deployment with
production, distributed big
data pipelines
5
▪“Zero” code change from laptop to distributed cluster
▪Directly accessing production big data (Hadoop/Hive/HBase)
▪Easily prototyping the end-to-end pipeline
▪Seamlessly deployed on production big data clusters

Analytics Zoo
Unified Data Analytics and AI Platform
Recommendation
Distributed TensorFlow & PyTorch on Spark
Spark Dataframes & ML Pipelines for DL
RayOnSpark
InferenceModel
Models &
Algorithms
End-to-end
Pipelines
Time Series Computer Vision NLP
ML Workflow AutoML Automatic Cluster Serving
Compute
Environment
K8s Cluster Cloud
Python Libraries
(Numpy/Pandas/sklearn/…)
DL Frameworks
(TF/PyTorch/OpenVINO/…)
Distributed Analytics
(Spark/Flink/Ray/…)
Laptop Hadoop Cluster
Powered by
oneAPI

Distributed TensorFlow/PyTorch on Spark
#pyspark code
train_rdd = spark.hadoopFile(…).map(…)
dataset = TFDataset.from_rdd(train_rdd,…)
#tensorflow code
import tensorflow as tf
slim = tf.contrib.slim
images, labels = dataset.tensors
with slim.arg_scope(lenet.lenet_arg_scope()):
logits, end_points = lenet.lenet(images, …)
loss = tf.reduce_mean(
tf.losses.sparse_softmax_cross_entropy(
logits=logits, labels=labels))
#distributed training on Spark
optimizer = TFOptimizer.from_loss(loss,
Adam(…))
optimizer.optimize(end_trigger=MaxEpoch(5))
Write TensorFlow/PyTorch inline with Spark code
Baidu/iQiyi Recommendation (similar to W&D)
SK Telecom Time series prediction
Midea Object detection
Example users
Analytics Zoo API in blue

RayOnSpark
sc = init_spark_on_yarn(...)
ray_ctx = RayContext(sc=sc, ...)
ray_ctx.init()
#Ray code
@ray.remote
class TestRay():
def hostname(self):
import socket
return socket.gethostname()
actors = [TestRay.remote() for i in range(0,
100)]
print([ray.get(actor.hostname.remote())
for actor in actors])
ray_ctx.stop()
Run Ray programs directly on YARN/Spark/K8s cluster
“RayOnSpark: Running Emerging AI Applications on Big Data Clusters with Ray and Analytics Zoo”
https://medium.com/riselab/rayonspark-running-emerging-ai-applications-on-big-data-clusters-with-ray-an
d-analytics-zoo-923e0136ed6a
NeuSoft/BMW AutoML for time series
MorningStar AutoML for time series
Tencent Cloud AutoML for time series
Example users

Spark Dataframe & ML Pipeline for DL
Spark Dataframe & ML Pipeline for DL
#Spark dataframe code
parquetfile = spark.read.parquet(…)
train_df = parquetfile.withColumn(…)
#Keras API
model = Sequential()
.add(Convolution2D(32, 3, 3))
.add(MaxPooling2D(pool_size=(2, 2)))
.add(Flatten()).add(Dense(10)))
#Spark ML pipeline code
estimater = NNEstimater(model,
CrossEntropyCriterion())
.setMaxEpoch(5)
.setFeaturesCol("image")
nnModel = estimater.fit(train_df)
Example users
WorldBank Image Classification
Cern
High-energy particle collision
classification

Scalable AutoML for Time Series
Prediction
“Scalable AutoML for Time Series Prediction using Ray and Analytics Zoo”
https://medium.com/riselab/scalable-automl-for-time-series
-prediction-using-ray-and-analytics-zoo-b79a6fd08139
Automated feature selection, model selection and hyper parameter tuning using
Ray
Example users
tsp = TimeSequencePredictor(
dt_col="datetime",
target_col="value")
pipeline = tsp.fit(train_df,
val_df, metric="mse",
recipe=RandomRecipe())
pipeline.predict(test_df)
NeuSoft/BMW Datacenter AIOps
Airtel Datacenter AIOps
Tencent Cloud Time series analysis service

Distributed Inference Made Easy with Cluster Serving
P5
P4
P3
P2
P1
R4
R3
R2
R1
R5
Input Queue for requests
Output Queue (or files/DB tables)
for prediction results
Local node or
Docker container Hadoop/Yarn/K8s cluster
Network
connection
Model
Simple
Python script
https://software.intel.com
/en-us/articles/distributed
-inference-made-easy-wit
h-analytics-zoo-cluster-s
erving
#enqueue request
input = InputQueue()
img = cv2.imread(path)
img = cv2.resize(img, (224,
224))
input.enqueue_image(id, img)
#dequeue response
output = OutputQueue()
result = output.dequeue()
for k in result.keys():
print(k + “: “ +
json.loads(result[k]))
√ Users freed from complex distributed inference solutions
√ Distributed, real-time inference automatically managed Analytics Zoo
− TensorFlow, PyTorch, Caffe, BigDL, OpenVINO, …
− Spark Streaming, Flink, …
Ping’an Recommendation service
Inspur Inference cluster
Example users

Real-world Adoptions
• 2.0k & 3.6k github star for
Analytics Zoo & BigDL
• 60+ design win
• Winning from Nvidia at 10+
users
• Joint solutions with DPG, SMG,
etc.

DATA ORCHESTRATION
SUMMIT
2020
Hybrid Cloud Solution with Alluxio
An Open Source Data Orchestration Layer
www.alluxio.io

Big data journey & innovation
Co-located
Co-located
compute & HDFS
on the same cluster
Disaggregated
compute & HDFS
on the different
cluster
MR / Hive
HDFS
Hive
HDFS
Disaggregated
HDFS data in the
cloud,
public or private
Enable & accelerate
access big data across
data centers
HDFS for Hybrid Cloud
Support analytics across
datacenters

Challenge: Data Gets Increasingly Remote from
Compute
▪ Challenging Scenarios
▪ Data-driven initiatives in need of more compute
▪ Hadoop system on-prem, but it’s remote
▪ Object data growth in a cloud region, but it’s remote
▪ How to make remote data local to the compute without
copies?
▪ Business benefits
▪ Faster data-driven insights: data immediately available for
compute
▪ More elastic computing power to solve problems quicker
▪ Up to 80% lower egress costs
Datacenter

Solution: “Zero-copy” bursting to scale to the
cloud
AnalyticsZoo
Alluxio
Accelerate big data
frameworks on the public
cloud
AnalyticsZoo
Alluxio
Burst big data
workloads in hybrid
cloud environments
On premise

Zero-Copy Burst: View the I/O Stack
FAST
104
- 105
MB/s
MODERATE 103
- 104
MB/s
Only when necessary
Limited
Often
SSD
HDD
Mem
SLOW 10 - 103
MB/s

Environments for performance results
1
8
EC2 Instance Type r5.8xlarge
Number of vCPU per instance 32
Size of memory per instance 256GB
Network speed 10Gbps
Disk space 100GB
Operation System Ubuntu 18.04
Apache Spark version 2.4.3
BigDL version 0.10.0
Analytics Zoo version 0.7.0
Alluxio version 2.2.0

Environments for performance results
1
9
Used 6 “r5.8xlarge”
instances. One worker
per instance.
Have 6 executors
Used example : Inception Model on Imagenet
tthttps://github.com/intel-analytics/analytics-zoo/tree/master/zoo/src/main/scala/com/intel/a
nalytics/zoo/examples/inception

Performance measurement
2
0
Measure data loading
time for training and
test data set
Job0 : load training data set
Job1 : load testing data set
Two stages :
stage 0 and stage 1 in Job 0
Two stages :
stage 2 and stage 3 in Job 1

Performance measurement
2
1
Using S3 data Using Alluxio data

Performance Results
2
2
Achieve 1.5X
speedup by using
Alluxio
Standard deviation is small
for both w & w/o testings

▪Deep Learning/Machine Learning Analytics in Hybrid Cloud becomes trend in
current industry
▪Analytics Zoo provides Unified Data Analytics and AI Platform for Big
Data
▪Big remote data access bring challenge for Analytics Zoo applications in Hybrid
Cloud
▪Analytics Zoo + Alluxio hybrid cloud solution provides acceleration of data
loading in Analytics Zoo applications and deep learning analytics on big data
systems
Conclusion

Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio

Ähnlich wie Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio (20)

Mehr von Alluxio, Inc.

Mehr von Alluxio, Inc. (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio