In this talk at MLConf 2017 New York station, I showed some enterprise requirements about AI, and explained how we have been working on PaddlePaddle, the open source deep learning platform, to address the demand from non-Internet industry. This effort leads to the development of a fault-recoverable deep learning framework running on general-purpose clusters managed by Kubernetes. This idea differs philosophically and significantly from HPC.
3. A REAL REQUEST FOR AI
▸ How to control TV sets via voice
▸ AI Hub
▸ No. An Alexa in each room?
▸ AI API
▸ No. Business owners don’t want user behavior data go to AI tech providers.
▸ AI on Cloud
▸ No. GPU instances are too expensive.
▸ AI on on-premise clusters
▸ Yes.
4.
5. CLOUD AND ON-PREMISE CLUSTERS
Internet traditional
big
companies
on-
premises
cluster
on-
premises
cluster
small
companies
cloud
on-
premises
cluster
6. THE SOLUTION - GENERAL PURPOSE CLUSTERS
GPU servers Multi-GPU servers CPU servers…
Kubernetes: a distributed operating system
PaddleSpark
speech
model
trainer
speech
API
server
fluentd
nginx
log Kafka
online
data
process
offline
data
process
Hadoop HDFS
labeled
data
model
Internet
clients:
- Web browser
- mobile apps
- IoT devices
7. CHALLENGES - GENERAL PURPOSE CLUSTERS
▸ group replica of processes into jobs
▸ Web services, data processing pipelines, machine learning jobs.
▸ service isolation and multi-user
▸ online experiments requires real log data stream, so
▸ we run production jobs and experimental jobs on the same cluster.
▸ priority-based scheduling
▸ a high-priority (production) job can preempt low-priority (experiment) jobs.
▸ make full use of hardware
▸ e.g., schedule processes of a Hadoop job that requires network and disk bandwidth
and processes of a deep learning job that requires GPU on the same node.
8. CHALLENGES - FAULT-TOLERABLE JOBS
▸ auto-scaling
▸ there are often many active users at day time, so the cluster kills processes of
deep learning jobs and creates more Web service processes.
▸ in nights, it kills some Web service processes to run more deep learning
processes.
▸ fault-recovery
▸ a job must be tolerable with a varying number of processes.
▸ speedup v.s. fault-recovery
▸ speedup optimizes a job.
▸ speedup with fault-tolerance optimizes the business.
9. A PADDLE PADDLE JOB
parameter
server 1
parameter
server 2
trainer 1
global
model
shard
1/2
global
model
shard
2/2
local
model
shard
1/2
local
model
shard
2/2
trainer 2
local
model
shard
1/2
local
model
shard
2/2
trainer 3
local
model
shard
1/2
local
model
shard
2/2
master
gradients/model gradients/model gradients/model
tasks
tasks
tasks
10. AUTO FAULT-RECOVERY
etcd
job B
master of job A
job A
task 4
task 2
task 1
todo
pending
done
task 3
task 2 task 1
todo
pending
done
task 3
master of job B
todo
created
pending
done
dispatched
completed
timeout
11. KEEP OPEN
▸ Thanks to the Kubernetes community for their expertise on
distributed computing and their effort of code review.
▸ We hope to see more traditional industries have their on-
premise clusters support running their whole business.
▸ PaddlePaddle will keep open.
▸ We are working on open source more AI technologies
basing on PaddlePaddle.