Kubernetes是当今最火的容器化应用自动部署、伸缩和资源管理的开源系统。随着Kubernetes的崛起,越来越多的公司愿意将自己的业务应用部署在Kubernetes上。除了典型的Web服务、数据库等服务会基于Kubernetes进行部署以外,深度学习框架的分布式训练也不例外。
然而,在Kubernetes系统中提交深度学习训练任务的功能并不像传统的高性能计算MPI平台那样直观。在2017年,Kubernetes社区就有文章Run Deep Learning with PaddlePaddle on Kubernetes分析了运行PaddlePaddle对底层资源的诉求,基于PaddlePaddle对计算容错性、弹性伸缩、资源隔离的要求,提出在Kubernetes平台上运行PaddlePaddle是最佳实践。
自Paddle Fluid 1.0 发布以来,飞桨在平台部署和任务调度管理上已经取得了长足的进步。借助Kubernetes平台,飞桨可以实现CPU/GPU等硬件资源的合理调度分配、训练任务的弹性扩缩容,并能显著提升计算资源的利用率。但是,在并行创建和调度任务、训练任务的生命周期管理、计算资源亲和性调度、调度策略优化等方面还有提升空间。为了提升飞桨框架的计算效率,飞桨团队和Volcano团队联合发布PaddlePaddle on Volcano方案。
Volcano平台可以满足飞桨对资源创建,资源调度的基本要求。Volcano的批量创建批量调度计算任务为飞桨作业提供计算任务的自动化生命周期管理,gang-scheduler调度策略可以满足PServer和Trainer “all or nothing”的业务调度约束,Queue和priority逻辑可以管理集群下计算作业的执行顺序,Fair-share和GPU亲和度调度使计算任务调度更贴合PServer和Trainer对节点资源和网络拓扑结构的要求而提升任务计算效率。
Volcano借助Kubernetes创建CRD能力,在Kubernetes中引入“apiVersion”为“batch.volcano.sh/v1alpha1”,“kind”为“Job”的资源对象,用于描述计算任务。通过配置和创建Volcano job可以使用Volcano平台创建、管理和调度计算任务。使用volcano平台,需要先在Kubernetes集群下安装Volcano,安装Volcano的方法可参考Volcano 官网。
首先使用飞桨官网推荐模式执行分布式计算任务,先创建一个副本数为2的Kubernetes ReplicaSet对象,用于运行PServer业务,然后创建一个并行度为2的Kubernetes Job对象,用于运行Trainer任务。
创建PServer任务
replicaset.extensions/fluid-ctr-pserver create
NAME DESIRED CURRENT READY AGE
fluid-ctr-pserver 2 2 2 5
fluid-ctr-pserver-b9w99 1/1 Running 0 9m18s
fluid-ctr-pserver-pb9vd 1/1 Running 0 9m18
+ case “$1″in
+ start_fluid_process
+ pserver_label=paddle-job-pserver=fluid-ctr
+ trainer_label=paddle-job=fluid-ct
+ hostname=c-rlnrdybm-muamumvq-1
+ task_index=
+ ‘[‘ PSERVER == TRAINER ‘]
+ ‘[‘ PSERVER == PSERVER ‘]’
+ stdbuf -oL python /root/k8s_tools.py wait_pods_running paddle-job-pserver=fluid-ctr 2
label selector: paddle-job-pserver=fluid-ctr, desired: 2
current cnt: 0 sleep for 5 seconds…
+ ‘[‘ PSERVER == TRAINER ‘]’
+ ‘[‘ PSERVER == WORKER ‘]
++ python /root/k8s_tools.py fetch_endpoints paddle-job-pserver=fluid-ctr 30236
+ export PADDLE_PSERVERS=192.168.48.24:30236,192.168.48.25:30237
+ PADDLE_PSERVERS=192.168.48.24:30236,192.168.48.25:30237
++ python /root/k8s_tools.py fetch_ips paddle-job=fluid-ctr
+ export PADDLE_TRAINER_IPS=
+ PADDLE_TRAINER_IPS=
+ ‘[‘ PSERVER == TRAINER ‘]’
+ ‘[‘ PSERVER == WORKER ‘]’
++ python /root/k8s_tools.py fetch_id paddle-job-pserver=fluid-ctr
+ task_index=0
+ export PADDLE_TRAINER_ID=0
+ PADDLE_TRAINER_ID=0
+ export PADDLE_PSERVER_ID=0
+ PADDLE_PSERVER_ID=0
+ stdbuf -oL sh -c ‘cd /workspace/ctr && python train.py –is_local 0 –cloud_train 1’
2019-09-03 06:43:10,661 – INFO – run dist training
2019-09-03 06:43:10,715 – INFO – run pserver
get_pserver_program() is deprecated, call get_pserver_programs() to get pserver main and startup in a single call.
I0903 06:43:10.826609 41 grpc_server.cc:435] Server listening on 192.168.48.24:30236 selected port:
job.batch/fluid-ctr-trainer create
fluid-ctr-pserver-b9w99 1/1 Running 0 87m
fluid-ctr-pserver-pb9vd 1/1 Running 0 87m
fluid-ctr-trainer-lg9n5 1/1 Running 0 12s
fluid-ctr-trainer-tVR99 1/1 Running 0 12
+ case “$1” in
+ start_fluid_process
+ pserver_labe=paddle-job-pserver=fluid-ctr
+ trainer_label=paddle-job=fluid-ctr
+ hostname=c-rlnrdybm-muamumvq-2
+ task_index=
+ ‘[‘ TRAINER == TRAINER ‘]’
+ stdbuf -oL python /root/k8s_tools.py wait_pods_running paddle-job-pserver=fluid-ctr 2
label selector: paddle-job-pserver=fluid-ctr, desired: 2
+ ‘[‘ TRAINER == TRAINER ‘]’
+ stdbuf -oL python /root/k8s_tools.py wait_pods_running paddle-job=fluid-ctr 2
label selector: paddle-job=fluid-ctr, desired: 2
++ python /root/k8s_tools.py fetch_endpoints paddle-job-pserver=fluid-ctr 30236
+ export PADDLE_PSERVERS=192.168.48.24:30236,192.168.48.25:30237
+ PADDLE_PSERVERS=192.168.48.24:30236,192.168.48.25:30237
++ python /root/k8s_tools.py fetch_ips paddle-job=fluid-ctr
+ export PADDLE_TRAINER_IPS=192.168.48.24,192.168.48.25
+ PADDLE_TRAINER_IPS=192.168.48.24,192.168.48.25
+ ‘[‘ TRAINER == TRAINER ‘]’
+ check_failed_cnt 1
+ max_failed=1
++ python /root/k8s_tools.py count_pods_by_phase paddle-job=fluid-ctr Failed
+ failed_count=0
+ ‘[‘ 0-gt 1 ‘]’
++ python /root/k8s_tools.py fetch_id paddle-job=fluid-ctr
+ task_index=0
+ export PADDLE_TRAINER_ID=0
+ PADDLE_TRAINER_ID=0
+ export PADDLE_PSERVER_ID=0
+ PADDLE_PSERVER_ID=0
+ stdbuf -oL sh -c ‘cd /workspace/ctr && python train.py –is_local 0 –cloud_train 1’
2019-09-03 08:10:20,888 – INFO – run dist training
2019-09-03 08:10:20,951 – INFO – download the training materials
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 433M 100 433M 0 0 70.9M 0 0:00:06 0:00:06 –:–:– 97.0M
2019-09-03 08:11:04,522 – INFO – run trainer
2019-09-03 08:11:04,591 – WARNING –
I0903 08:11:04.594007 25 parallel_executor.cc:329] The number of CPUPlace, which is used in ParallelExecutor, is 2. And the Program will be copied 2 copies
I0903 08:11:04.875757 25 rpc_client.h:101] init rpc client with trainer_id 0
2019-09-03 08:11:38,625 – INFO – TRAIN –> pass: 0 batch: 0 loss: 0.697331115723 auc: 0.500826068453, batch_auc: 0.500826068453
2019-09-03 08:11:38,967 – INFO – TRAIN –> pass: 0 batch: 1 loss: 0.652093688965 auc: 0.48451329672, batch_auc: 0.48451329672
2019-09-03 08:11:39,242 – INFO – TRAIN –> pass: 0 batch: 2 loss: 0.629092956543 auc: 0.485173881519, batch_auc: 0.485173881519
2019-09-03 08:11:39,577 – INFO – TRAIN –> pass: 0 batch: 3 loss: 0.603850708008 auc: 0.482131778494, batch_auc: 0.482131778494
2019-09-03 08:11:39,874 – INFO – TRAIN –> pass: 0 batch: 4 loss: 0.591485412598 auc: 0.479737304993, batch_auc: 0.479737304993
2019-09-03 08:11:40,133 – INFO – TRAIN –> pass: 0 batch: 5 loss: 0.58376159668 auc: 0.478554220739, batch_auc: 0.478554220739
2019-09-03 08:11:40,385 – INFO – TRAIN –> pass: 0 batch: 6 loss: 0.561969116211 auc: 0.481465857424, batch_auc: 0.481465857424
2019-09-03 08:11:40,637 – INFO – TRAIN –> pass: 0 batch: 7 loss: 0.557065185547 auc: 0.486014931119, batch_auc: 0.486014931119
2019-09-03 08:11:40,890 – INFO – TRAIN –> pass: 0 batch: 8 loss: 0.562498413086 auc: 0.489651573333, batch_auc: 0.489651573333
2019-09-03 08:11:41,158 – INFO – TRAIN –> pass: 0 batch: 9 loss: 0.566428283691 auc: 0.489853260221, batch_auc: 0.49137884426
2019-09-03 08:11:41,452 – INFO – TRAIN –> pass: 0 batch: 10 loss: 0.564840087891 auc: 0.492880386228, batch_auc: 0.494013763938
2019-09-03 08:11:41,742 – INFO – TRAIN –> pass: 0 batch: 11 loss: 0.564809204102 auc: 0.493201528907, batch_auc: 0.498872381582
2019-09-03 08:11:42,056 – INFO – TRAIN –> pass: 0 batch: 12 loss: 0.584479736328 auc: 0.494151972036, batch_auc: 0.503926628391
2019-09-03 08:11:42,329 – INFO – TRAIN –> pass: 0 batch: 13 loss: 0.615677246094 auc: 0.49252557362, batch_auc: 0.5028352489
fluid-ctr-pserver-b9w99 1/1 Running 0 177m
fluid-ctr-pserver-pb9vd 1/1 Running 0 177m
fluid-ctr-trainer-lg9n5 0/1 Completed 0 90m
fluid-ctr-trainer-tvr99 0/1 Completed 0 90