nvidia-container-runtime
想要使用带 GPU 的 container 需要先替换 Nvidia 定制的 runtime。
Since Kubernetes does not support the
--gpus
option with Docker yet, thenvidia
runtime should be setup as the default container runtime for Docker on the GPU node. This can be done by adding thedefault-runtime
line into the Docker daemon config file, which is usually located on the system at/etc/docker/daemon.json
:
1 |
|
k8s device plugin 需要做什么
- 获取/操作 host 上设备资源;
- 实现 device plugin 需要实现的 gRPC 方法,即控制资源分配;
container & host
container 运行时和 Host 之间的权限主要由四个选项控制:
Option | Description |
---|---|
--cap-add |
Add Linux capabilities |
--cap-drop |
Drop Linux capabilities |
--privileged |
Give extended privileges to this container |
--device=[] |
Allows you to run devices inside the container without the –privileged flag. |
默认的情况下 docker container 是没有特权的,举例来说,你无法在一个 docker 容器里运行 docker 守护进程。因为默认情况下,container 是无法访问 devices 的,但是有特权的 container 可以访问。容器的访问隔离是基于 cgroup 做的,对 devices 的访问也是通过 cgroup 实现的。
container 访问 device 资源
需要在启动容器的时候指定特定的设备,或者在构建容器镜像时指定,则在容器中可以访问该设备。
1 |
|
上图很好地展示了 --privileged
的效果。
container 加载/卸载内核模块
若要加载内核模块则需要使用 --privileged
同时添加 SYS_MODULE
能力。
1 |
|
从上图可以看到,在 docker 中 rmmod
,确实影响到了 host。
nvml
Nvidia 通过 nvml 库(NVIDIA Management Library)来监控管理 GPU 设备。它提供了通过 nvidia-smi 暴露的可以直接访问的查询和命令。
A C-based API for monitoring and managing various states of the NVIDIA GPU devices. It provides a direct access to the queries and commands exposed via nvidia-smi. The runtime version of NVML ships with the NVIDIA display driver, and the SDK provides the appropriate header, stub libraries and sample applications. Each new version of NVML is backwards compatible and is intended to be a platform for building 3rd party applications.
在 Nvidia device plugin 中监控管理 GPU 使用的也是 nvml。
container 能够访问 devices 并获取其状态即完成了第一步。
device plugin implementation
- 初始化,执行特定的初始化确保设备准备完成;
- 开启 gRPC service,在
/var/lib/kubelet/device-plugins
目录下创建一个unix socket
; - plugins 通过
/var/lib/kubelet/device-plugins/kubelet.sock
将自己注册到 kublet 中; - 完成注册后,device plugin 进入服务模式,持续监控设备状态,一旦检测到设备变化则向上报告给 kubelet。
gRPC service
device plugin 需要实现如下几个方法:
1 |
|
Note: Plugins are not required to provide useful implementations for
GetPreferredAllocation()
orPreStartContainer()
. Flags indicating which (if any) of these calls are available should be set in theDevicePluginOptions
message sent back by a call toGetDevicePluginOptions()
. Thekubelet
will always callGetDevicePluginOptions()
to see which optional functions are available, before calling any of them directly.
ps: nvidia device plugin 用于给 k8s 集群分配 gpu 资源,如果想要所有容器都能够访问 gpu,最好不要安装 nvidia device pluign。
Reference
https://docs.docker.com/engine/reference/run/#runtime-privilege-and-linux-capabilities
https://www.kernel.org/doc/Documentation/cgroup-v1/devices.txt
https://developer.nvidia.com/nvidia-management-library-nvml